Why we ship STIR/SHAKEN attestation on day one
Most cloud-phone vendors treat caller-ID attestation as a higher-tier feature. Carriers do not. Here is why we made it default — and what it changed for outbound answer rates.
For three months we tracked every call the AI handled — by language, by accent, by call type — and graded the transcript against a human reviewer. The accuracy numbers were better than we expected. The failure modes were more interesting.
Most "AI accuracy" numbers you read in marketing copy are quietly meaningless. They benchmark a model on a clean dataset, score the dataset against itself, and produce a number that has almost nothing to do with how the system performs on a Tuesday afternoon when a customer is calling from a noisy warehouse and asking three questions at once.
We wanted a number that meant something. So for three months between January and March 2026, every call our AI receptionist handled across the production fleet — 8,427 calls in total — was transcribed, graded by the AI in real time, then independently re-graded by a human reviewer who never saw the AI verdict. We tracked language, accent (where identifiable), call type, time of day, length, and the specific path the conversation took.
The point was not to publish a flattering number. The point was to find the calls where the AI thought it had succeeded and was wrong, and the calls where the AI had genuinely solved the customer's problem but our internal metrics had marked it as a miss. Both classes were larger than we expected.
There were three rules. First, no cherry-picking: every call across the period was eligible, including the 47 calls that crashed mid-conversation and the 312 that were transferred to a human within the first 15 seconds. Second, the human reviewers were blind to the AI's self-assessment — they graded the transcript against the customer's stated intent, not against whatever the AI claimed it had done. Third, the rubric was published before any data was looked at, to keep us from quietly redefining "success" once the numbers came in.
Each call was scored on three independent dimensions:
A call was only counted as a "win" if it cleared all three: correct intent, complete task, hygiene score of 4 or 5. The third gate is the one that quietly eliminated calls we would otherwise have called successes.
Across the full 8,427-call dataset, the AI cleared all three gates on 91.4% of calls. That number was higher than we predicted internally — most of us had bets in the 85–88% range. The gap between our prediction and the result was not a model getting better than we expected; it was that our intuition was being shaped by the calls we remembered, which were disproportionately the bad ones.
Broken down by call type, the variance was significant:
The complaint and multi-issue numbers are the ones we focused on. A 96% number on appointment booking is what we ship to win deals; a 72% number on multi-issue calls is what we ship to keep them.
We operate in 32 languages and we publish a single accuracy number per language. The aggregate gap between our best language (English-US) and our worst supported language (Vietnamese) was 4.1 percentage points — narrower than we expected, and narrower than we have seen reported in published academic benchmarks.
But aggregate language scores hide the part of the problem that actually matters: accent variance within a single language. English calls handled in the United States scored 92.7%. English calls from Indian-English-speaking customers scored 91.3% — well within the noise. English calls from strongly regional Scottish-English speakers scored 83.4%. That is a 9-point gap, and it lives entirely inside the column labeled "English".
The point is not that our model is bad at Scottish English. The point is that publishing a per-language number, while leaving accent variance unmeasured, is a way of hiding the calls your AI is failing on. We are now reporting accent variance internally, and we expect to publish it externally within two quarters.
When we clustered the 723 missed calls (8.6% of the dataset), five failure modes accounted for 81% of them. None of these will surprise anyone who has shipped voice AI before, but the relative weights did surprise us.
The caller has two or more intents and the AI commits to the first one before they have finished stating the second. "I want to reschedule my appointment and also do you sell the X-100 model" is exactly the kind of input that nukes the conversation state.
The caller references something said earlier in the conversation that the AI did not store as context — a name, a price, a date. This is the failure mode that the next-generation context window most directly addresses.
Endpointing failures. The AI starts speaking, the caller talks over it, the AI hears its own voice mixed with the caller's and produces a transcript that is missing half the input.
Caller dictates a phone number or address. AI reads it back. Caller corrects one digit. AI updates the wrong digit. This is a 100%-fixable problem with a structured-data confirmation prompt, and we shipped one in February.
The AI correctly identifies that it can't solve the call and transfers — but the transfer routing sends it to the wrong queue. This is not really an AI failure; it is a dial-plan failure that the AI exposes.
Three shipped changes came directly out of the study, and a fourth is in development. We tried to ship the highest-leverage fix per failure mode rather than the most academically interesting one.
The fourth change — a richer per-call context buffer that addresses implicit-context drop — is the most complex of the four, and it is being trialed in shadow mode against a held-out call set before it goes live.
One small finding that is worth naming: the human reviewers, despite a published rubric, drifted across the three months. In month one, reviewers marked 87% of calls as successful. In month three, the same reviewers, scoring randomly drawn month-one calls a second time, marked 91% as successful. The transcripts had not changed.
The drift was not bad faith; it was familiarity. As reviewers got used to the AI's phrasing, they became more generous on the conversation-hygiene dimension. We caught this by re-scoring a 5% sample of month-one calls in month three, and we corrected the headline number for it. The 91.4% reported above is the drift-corrected figure. The raw number was 92.1%.
If a vendor publishes an accuracy number without describing how they controlled for human grader drift, the number is suspect by default. We did not invent this problem; we just paid attention to it.
Three things, in priority order. First, accent-resolved accuracy within each major supported language — and a public commitment to publish it once the methodology stabilises. Second, post-call outcome accuracy: did the caller actually receive the thing they came for, measured 7 and 30 days after the call by reconciling with the customer's downstream system. Third, the cost-per-resolved-call number that combines accuracy with the per-minute economics — because a 98% accurate AI that costs $1.40/min loses to a 92% accurate AI that costs $0.18/min, for almost any real business.
We will publish the next set of numbers in Q3. If they show regression we will publish that too. The point of measuring this stuff is not to find a flattering chart; it is to find the calls we are still getting wrong.
AI receptionists, wholesale routes, virtual numbers — built on one platform with transparent pricing and a 24/7 NOC.