Ajoxi
  • Pillar
    CLOUD PHONE

    Cloud phone, messaging, video, fax, chat — the full UCaaS stack.

    • Business PhoneCalling, SMS, video, one login
    • Customer EngagementEvery channel, one thread
    • Personal AIAI sidekick for every rep
    • SMS & MMSText from the main business line
    • Team ChatInternal chat, tied to customers
    • Video MeetingsRooms with AI notes + recap
    • Online FaxFax without the fax machine
    • Website ChatbotAuto-resolves order status & returns
    • Phone SystemModern PBX with AI built in
    Featured
    Everything included.
    Cloud phone, AI contact center, AI Receptionist, SMS, video, 300+ integrations.
    See plans & pricing
  • Core Capabilities
    • AI Receptionist24/7 first answer · 32 languages
    • AI SentimentRoutes upset callers automatically
    • AI Agent AssistWhisper scripts + next-best-action
    • Conversation IntelligenceTranscripts, sentiment, objections
    • Call RecordingFull fidelity + keyword search
    • Auto-attendantDrag-and-drop visual IVR builder
    • Supervisor ToolsListen · whisper · barge · audit log
    • Toll-free Numbers800, 888, 877 — provisioned fast
    New
    AI Sentiment · live scoring.
    Routes upset customers to senior agents the moment sentiment dips. On every paid plan.
    See AI Sentiment
  • By Industry & Team
    • FinanceSOC 2 · FINRA-ready audit trails
    • RetailOmnichannel + cart-recovery SMS
    • SaaSAPIs + Personal AI on every seat
    • LogisticsMulti-site dispatch routing
    • Sales TeamsPower dialer + live AI coaching
    • Support TeamsShared memory across 8 channels
    • Remote TeamsSame number on every device
    • SMBAI receptionist as your front desk
    • Enterprise ITSSO, SCIM, multi-site governance
    Most adopted
    A calling stack compliance trusts.
    Call recording, STIR/SHAKEN, sentiment routing. SOC 2, PCI, and FINRA-ready audit trails.
    See finance
  • Native Sync
    • HubSpotTwo-way sync · lifecycle triggers
    • ZohoCRM · Desk · Books · Bigin
    Coming soon
    Salesforce. Pipedrive. Freshsales.
    All three native two-way syncs in Q3 2026. Want a heads-up on launch?
    Email me on launch
  • Pricing
  • Learn
    • BlogEngineering & product notes
    • Customer storiesReal outcomes, real numbers
    • GuidesStep-by-step playbooks
    • WebinarsLive every Thursday · on-demand
    • Contact UsTalk to sales or get support
    Build
    • DocsHow everything works
    • API referenceREST + webhooks
    • SDKsNode, Python, Go, Ruby
    • ChangelogEvery ship, in one place
    Trust
    • Status pageLive uptime + incidents
    • Security + complianceSOC 2 · GDPR · PCI
    • PrivacyWhat we collect & why
    • TermsThe contract, in chapters
    Fresh ink
    8,400 calls, measured.
    AI receptionist accuracy by language, accent, and call type — the unedited numbers.
    Read the post
Sign inFree Trial
Cloud Phone
Business PhoneCalling, SMS, video, one loginCustomer EngagementEvery channel, one threadPersonal AIAI sidekick for every repSMS & MMSText from the main business lineTeam ChatInternal chat, tied to customersVideo MeetingsRooms with AI notes + recapOnline FaxFax without the fax machineWebsite ChatbotAuto-resolves order status & returnsPhone SystemModern PBX with AI built in
Contact Center
OmnichannelOne queue for every channelOutbound DialerPredictive, power, previewAgent AssistLive whisper coachingSupervisor AssistSpot bad calls in real timeInteraction AnalyticsAuto-QA, topic trendsEnterprise500+ seat operations
AI Family
Ajoxi VoiceAI Receptionist that books appointmentsAI AssistantDrafts, summaries, follow-upsConversation AIReads every call so you don't miss a thing
AI Receptionist24/7 first answer · 32 languagesAI SentimentRoutes upset callers automaticallyAI Agent AssistWhisper scripts + next-best-actionConversation IntelligenceTranscripts, sentiment, objectionsCall RecordingFull fidelity + keyword searchAuto-attendantDrag-and-drop visual IVR builderSupervisor ToolsListen · whisper · barge · audit logToll-free Numbers800, 888, 877 — provisioned fast
FinanceSOC 2 · FINRA-ready audit trailsRetailOmnichannel + cart-recovery SMSSaaSAPIs + Personal AI on every seatLogisticsMulti-site dispatch routingSales TeamsPower dialer + live AI coachingSupport TeamsShared memory across 8 channelsRemote TeamsSame number on every deviceSMBAI receptionist as your front deskEnterprise ITSSO, SCIM, multi-site governance
HubSpotTwo-way sync · lifecycle triggersZohoCRM · Desk · Books · Bigin
Learn
BlogEngineering & product notesCustomer storiesReal outcomes, real numbersGuidesStep-by-step playbooksWebinarsLive every Thursday · on-demandContact UsTalk to sales or get support
Build
DocsHow everything worksAPI referenceREST + webhooksSDKsNode, Python, Go, RubyChangelogEvery ship, in one place
Trust
Status pageLive uptime + incidentsSecurity + complianceSOC 2 · GDPR · PCIPrivacyWhat we collect & whyTermsThe contract, in chapters
Sign inFree Trial
Ajoxi

Cloud phone and AI contact center on one carrier-grade network.

SOC 2GDPRPCI-DSS

Cloud Phone

  • Business Phone
  • Customer Engagement
  • SMS & MMS
  • Team Chat
  • Video Meetings
  • Phone System

Contact Center

  • Omnichannel
  • Outbound Dialer
  • Agent Assist
  • Interaction Analytics
  • Enterprise CCaaS

Wholesale

  • Wholesale VoIP
  • Wholesale Voice
  • SIP Trunking
  • CLI Routes

AI

  • AI Platform
  • AI Receptionist
  • AI Assistant
  • Conversational AI
  • AI Sentiment
  • Conversation Intelligence

Solutions

  • Finance
  • Retail & eCom
  • SaaS & Tech
  • Sales Teams
  • SMB

Company

  • Pricing
  • About
  • Customers
  • Contact Us
  • Country Codes
  • Area Codes
  • Docs
  • Status
  • Security

© 2026 Ajoxi. All rights reserved.

All systems normal
  • Privacy
  • Terms
  • Security
Blog/Engineering/Same latency on Mandarin and English. Here is how

Same latency on Mandarin and English. Here is how

Hitting latency parity across 32 languages without bloating the model required a model-routing layer we did not see coming. Notes from the latency war room.

Table of Contents
  • 1.Introduction
  • 2.Where the milliseconds actually go
  • 3.The routing fix
  • 4.The classifier nobody saw coming
  • 5.The TTS side of the problem
  • 6.Measuring parity, not averages
  • 7.What is still slow
  • 8.What we would do differently

Introduction

Our voice product had two complaints that we could not reconcile internally. From English-speaking enterprise customers, the AI felt snappy. From Mandarin-speaking customers — and increasingly from Vietnamese, Tagalog, and Hindi-speaking customers — the AI felt sluggish. The gap, measured end-to-end from caller silence to AI first-syllable, was 260ms on English and 540ms on Mandarin. Both numbers were within the published latency budget. Only one of them felt acceptable on a phone call.

The instinct of any engineering team that sees a latency gap is to grind on the model. Faster inference. Smaller model. Better quantisation. We did all of that. It bought us 60ms across the board, which closed nothing — the relative gap was still there, and English remained twice as fast.

The real fix came from a different observation: we were treating latency as a model problem when it was actually a routing problem.

Where the milliseconds actually go

Breaking the 540ms Mandarin response into segments was the first thing that clarified the problem. The accounting looked roughly like this:

  • Audio capture and endpointing — 80ms
  • Speech-to-text first-token — 140ms
  • Intent + retrieval pipeline — 90ms
  • Large model first-token (generation) — 180ms
  • Text-to-speech first-audio — 50ms

On the English path, the same segments came in at 80 / 60 / 90 / 80 / 40. The two segments that diverged dramatically were speech-to-text and large-model generation. STT was slower in Mandarin because the acoustic model had been trained with a longer context window — necessary for tonal disambiguation — that pushed first-token by 80ms. The model generation was slower because the tokeniser produced more tokens per character of equivalent meaning in Mandarin than in English.

Neither was a model defect. Both were tradeoffs that had been quietly accumulated by independent teams optimising for their own metrics — STT for accuracy, the LLM for generation quality. The latency cost was real, but it was nobody's P&L.

The routing fix that bought 200ms

We stopped routing all languages through a single LLM. Instead, we built a thin classification layer in front of generation that detects, in under 8ms, three things: the language of the incoming utterance, the conversational intent class, and whether the request is one of about 40 high-frequency patterns we identified by clustering 2M call segments.

When the classifier identifies a high-frequency pattern in a non-English language, we route the generation to a smaller, language-specialised model that handles that pattern directly. The smaller model has been distilled on millions of completions of the same pattern, so its quality on those specific paths is within the noise of the big model — but its first-token latency is roughly a third.

When the classifier sees a tail intent — anything not in the 40 patterns — the request falls through to the big model, just like before. Most of the latency drop comes from the fact that the high-frequency patterns are also the patterns that account for ~78% of call volume. The tail cases pay the original latency, but they are rare.

The classifier nobody saw coming

The thin classifier was the part of the project nobody had budgeted for. Building it took longer than building the routing system itself, because the failure modes were subtle. A classifier that misroutes 1% of requests to a small model that does not handle them produces hallucinations, not just degraded responses.

Three things made the classifier work. First, it was trained on real production traffic, not synthetic data. Second, it returned a confidence score that the router could threshold — anything below 0.91 confidence falls through to the big model automatically. Third, we shadow-deployed it for six weeks against the existing pipeline, with the big model providing ground truth, before any traffic was actually routed to small models.

The shadow deployment found four classes of mistakes the offline test set had missed. Two were easy to fix in training data. Two required deliberate routing rules — anything that touches identity verification, payments, or appointment cancellation always goes to the big model regardless of classifier confidence, because the cost of a hallucination on those paths is higher than the latency win.

The TTS side of the problem

While the LLM and STT changes were going through review, a parallel team was working on the speech output. The 50ms TTS first-audio on English was effectively unbeatable; the 70ms-90ms gap on most non-English languages was driven by smaller voice models, less aggressive caching, and a single shared GPU pool that prioritised English under load.

We split the TTS infrastructure into language-pinned pools. Mandarin TTS now runs on its own pool, with its own scaling rules, on hardware close to the regions where most of our Mandarin traffic originates. The latency dropped from 70-90ms to 45ms within two weeks of cutover. None of this was clever; it was infrastructure work that nobody had bothered to do because the marginal improvement on any single language had not, in isolation, justified it.

The lesson — and we are now writing it down internally — is that "we did not invest in this because no single user-facing metric justified it" is exactly the kind of decision that compounds into a 280ms gap between two languages over four years.

Measuring parity, not averages

One small but consequential change we made internally was to the way the team's latency dashboard reports. We used to track the p50 and p95 of end-to-end latency, averaged across all languages, against a single SLO. The dashboard looked green most of the time.

The new dashboard reports p50 and p95 per language against a parity SLO — the gap between the slowest and fastest supported language has its own budget, separately enforced. We set the parity budget at 60ms. When the gap exceeds the budget, on-call gets paged.

The parity dashboard does something the average-based one could not: it makes regressions in the slowest language visible at the same urgency as regressions in the fastest one. When you average across 32 languages, a 200ms regression in Vietnamese moves the average by 6ms. When you measure parity, it moves the dashboard by 200ms.

What is still slow

After the routing work, the TTS pools, and the parity dashboard, our worst language (Cantonese, primarily because the supporting acoustic-model corpus is smaller) sits at 320ms. English is at 220ms. The remaining 100ms gap is real and we know roughly where it lives — the Cantonese STT acoustic model is older and not yet on the new architecture, and we will retrain it in Q3.

But "the slowest language is 100ms behind the fastest" is a different conversation from "the slowest language is 280ms behind the fastest." The former is a known regression with a quarterly fix planned. The latter was a customer-experience emergency we did not know we had.

What we would do differently next time

Three things, in priority order, if we were rebuilding the latency stack from scratch with what we know now.

  • Build the classifier-router first, then the model. The routing layer ended up being the highest-leverage piece. Treating it as a "we will add it later if we need it" turned out to mean we lived with single-model latency for 18 months longer than we needed to.
  • Set parity SLOs before language SLOs. The temptation is to set per-language targets and let parity emerge. Parity does not emerge; it diverges, slowly, in the direction of the team's primary language.
  • Resource the per-language infrastructure as if each language were a separate product. Mandarin TTS deserved its own GPU pool from day one, not from the day we measured the latency gap.

None of this is novel research. None of it required a paper. It required treating latency parity as a product commitment, then funding the unglamorous infrastructure to back the commitment. We have the parity we need now. We did not have it for four years.

Run your voice on Ajoxi.

AI receptionists, wholesale routes, virtual numbers — built on one platform with transparent pricing and a 24/7 NOC.

See pricing Talk to us
Keep Reading

Related reading

Hand-picked next reads from the Ajoxi blog.

We measured AI receptionist accuracy across 8,400 real calls
AI

We measured AI receptionist accuracy across 8,400 real calls

For three months we tracked every call the AI handled — by language, by accent, by call type — and graded the transcript against a human reviewer. The accuracy numbers were better than we expected. The failure modes were more interesting.

Read article
Why we ship STIR/SHAKEN attestation on day one
Compliance

Why we ship STIR/SHAKEN attestation on day one

Most cloud-phone vendors treat caller-ID attestation as a higher-tier feature. Carriers do not. Here is why we made it default — and what it changed for outbound answer rates.

Read article
The case for ranking calls, not sampling them
Product

The case for ranking calls, not sampling them

Random sampling misses the calls that actually matter. We rebuilt the supervisor console around a risk score — and stopped pretending QA was a numbers game.

Read article