Introduction
Our voice product had two complaints that we could not reconcile internally. From English-speaking enterprise customers, the AI felt snappy. From Mandarin-speaking customers — and increasingly from Vietnamese, Tagalog, and Hindi-speaking customers — the AI felt sluggish. The gap, measured end-to-end from caller silence to AI first-syllable, was 260ms on English and 540ms on Mandarin. Both numbers were within the published latency budget. Only one of them felt acceptable on a phone call.
The instinct of any engineering team that sees a latency gap is to grind on the model. Faster inference. Smaller model. Better quantisation. We did all of that. It bought us 60ms across the board, which closed nothing — the relative gap was still there, and English remained twice as fast.
The real fix came from a different observation: we were treating latency as a model problem when it was actually a routing problem.
Where the milliseconds actually go
Breaking the 540ms Mandarin response into segments was the first thing that clarified the problem. The accounting looked roughly like this:
- Audio capture and endpointing — 80ms
- Speech-to-text first-token — 140ms
- Intent + retrieval pipeline — 90ms
- Large model first-token (generation) — 180ms
- Text-to-speech first-audio — 50ms
On the English path, the same segments came in at 80 / 60 / 90 / 80 / 40. The two segments that diverged dramatically were speech-to-text and large-model generation. STT was slower in Mandarin because the acoustic model had been trained with a longer context window — necessary for tonal disambiguation — that pushed first-token by 80ms. The model generation was slower because the tokeniser produced more tokens per character of equivalent meaning in Mandarin than in English.
Neither was a model defect. Both were tradeoffs that had been quietly accumulated by independent teams optimising for their own metrics — STT for accuracy, the LLM for generation quality. The latency cost was real, but it was nobody's P&L.
The routing fix that bought 200ms
We stopped routing all languages through a single LLM. Instead, we built a thin classification layer in front of generation that detects, in under 8ms, three things: the language of the incoming utterance, the conversational intent class, and whether the request is one of about 40 high-frequency patterns we identified by clustering 2M call segments.
When the classifier identifies a high-frequency pattern in a non-English language, we route the generation to a smaller, language-specialised model that handles that pattern directly. The smaller model has been distilled on millions of completions of the same pattern, so its quality on those specific paths is within the noise of the big model — but its first-token latency is roughly a third.
When the classifier sees a tail intent — anything not in the 40 patterns — the request falls through to the big model, just like before. Most of the latency drop comes from the fact that the high-frequency patterns are also the patterns that account for ~78% of call volume. The tail cases pay the original latency, but they are rare.
The classifier nobody saw coming
The thin classifier was the part of the project nobody had budgeted for. Building it took longer than building the routing system itself, because the failure modes were subtle. A classifier that misroutes 1% of requests to a small model that does not handle them produces hallucinations, not just degraded responses.
Three things made the classifier work. First, it was trained on real production traffic, not synthetic data. Second, it returned a confidence score that the router could threshold — anything below 0.91 confidence falls through to the big model automatically. Third, we shadow-deployed it for six weeks against the existing pipeline, with the big model providing ground truth, before any traffic was actually routed to small models.
The shadow deployment found four classes of mistakes the offline test set had missed. Two were easy to fix in training data. Two required deliberate routing rules — anything that touches identity verification, payments, or appointment cancellation always goes to the big model regardless of classifier confidence, because the cost of a hallucination on those paths is higher than the latency win.
The TTS side of the problem
While the LLM and STT changes were going through review, a parallel team was working on the speech output. The 50ms TTS first-audio on English was effectively unbeatable; the 70ms-90ms gap on most non-English languages was driven by smaller voice models, less aggressive caching, and a single shared GPU pool that prioritised English under load.
We split the TTS infrastructure into language-pinned pools. Mandarin TTS now runs on its own pool, with its own scaling rules, on hardware close to the regions where most of our Mandarin traffic originates. The latency dropped from 70-90ms to 45ms within two weeks of cutover. None of this was clever; it was infrastructure work that nobody had bothered to do because the marginal improvement on any single language had not, in isolation, justified it.
The lesson — and we are now writing it down internally — is that "we did not invest in this because no single user-facing metric justified it" is exactly the kind of decision that compounds into a 280ms gap between two languages over four years.
Measuring parity, not averages
One small but consequential change we made internally was to the way the team's latency dashboard reports. We used to track the p50 and p95 of end-to-end latency, averaged across all languages, against a single SLO. The dashboard looked green most of the time.
The new dashboard reports p50 and p95 per language against a parity SLO — the gap between the slowest and fastest supported language has its own budget, separately enforced. We set the parity budget at 60ms. When the gap exceeds the budget, on-call gets paged.
The parity dashboard does something the average-based one could not: it makes regressions in the slowest language visible at the same urgency as regressions in the fastest one. When you average across 32 languages, a 200ms regression in Vietnamese moves the average by 6ms. When you measure parity, it moves the dashboard by 200ms.
What is still slow
After the routing work, the TTS pools, and the parity dashboard, our worst language (Cantonese, primarily because the supporting acoustic-model corpus is smaller) sits at 320ms. English is at 220ms. The remaining 100ms gap is real and we know roughly where it lives — the Cantonese STT acoustic model is older and not yet on the new architecture, and we will retrain it in Q3.
But "the slowest language is 100ms behind the fastest" is a different conversation from "the slowest language is 280ms behind the fastest." The former is a known regression with a quarterly fix planned. The latter was a customer-experience emergency we did not know we had.
What we would do differently next time
Three things, in priority order, if we were rebuilding the latency stack from scratch with what we know now.
- Build the classifier-router first, then the model. The routing layer ended up being the highest-leverage piece. Treating it as a "we will add it later if we need it" turned out to mean we lived with single-model latency for 18 months longer than we needed to.
- Set parity SLOs before language SLOs. The temptation is to set per-language targets and let parity emerge. Parity does not emerge; it diverges, slowly, in the direction of the team's primary language.
- Resource the per-language infrastructure as if each language were a separate product. Mandarin TTS deserved its own GPU pool from day one, not from the day we measured the latency gap.
None of this is novel research. None of it required a paper. It required treating latency parity as a product commitment, then funding the unglamorous infrastructure to back the commitment. We have the parity we need now. We did not have it for four years.