We measured AI receptionist accuracy across 8,400 real calls

For three months we tracked every call the AI handled — by language, by accent, by call type — and graded the transcript against a human reviewer. The accuracy numbers were better than we expected. The failure modes were more interesting.

Introduction

Most "AI accuracy" numbers you read in marketing copy are quietly meaningless. They benchmark a model on a clean dataset, score the dataset against itself, and produce a number that has almost nothing to do with how the system performs on a Tuesday afternoon when a customer is calling from a noisy warehouse and asking three questions at once.

We wanted a number that meant something. So for three months between January and March 2026, every call our AI receptionist handled across the production fleet — 8,427 calls in total — was transcribed, graded by the AI in real time, then independently re-graded by a human reviewer who never saw the AI verdict.

We tracked language, accent (where identifiable), call type, time of day, length, and the specific path the conversation took.

The point was not to publish a flattering number. The point was to find the calls where the AI thought it had succeeded and was wrong, and the calls where the AI had genuinely solved the customer's problem but our internal metrics had marked it as a miss. Both classes were larger than we expected.

How we sampled and graded the calls

There were three rules. First, no cherry-picking: every call across the period was eligible, including the 47 calls that crashed mid-conversation and the 312 that were transferred to a human within the first 15 seconds. Second, the human reviewers were blind to the AI's self-assessment — they graded the transcript against the customer's stated intent, not against whatever the AI claimed it had done.

Third, the rubric was published before any data was looked at, to keep us from quietly redefining "success" once the numbers came in.

Each call was scored on three independent dimensions:

Intent capture — did the AI correctly identify what the caller was actually trying to do? (Binary: yes/no.)
Task completion — was the caller's stated need actually resolved on this call, or did it require a human follow-up the caller did not ask for?
Conversation hygiene — was the transcript free of hallucinations, false promises, or contradictions? Graded 1–5 by the reviewer.

A call was only counted as a "win" if it cleared all three: correct intent, complete task, hygiene score of 4 or 5. The third gate is the one that quietly eliminated calls we would otherwise have called successes.

The headline numbers

Across the full 8,427-call dataset, the AI cleared all three gates on 91.4% of calls. That number was higher than we predicted internally — most of us had bets in the 85–88% range.

The gap between our prediction and the result was not a model getting better than we expected; it was that our intuition was being shaped by the calls we remembered, which were disproportionately the bad ones.

Broken down by call type, the variance was significant:

Appointment booking — 96.1% (the simplest path, well-rehearsed intent)
General information — 93.8% (hours, location, services)
Pricing inquiries — 89.2% (some pricing paths required disambiguation)
Cancellation / reschedule — 87.6%
Complaints and escalations — 78.3%
Multi-issue calls — 71.9% (two or more independent intents in the same call)

The complaint and multi-issue numbers are the ones we focused on. A 96% number on appointment booking is what we ship to win deals; a 72% number on multi-issue calls is what we ship to keep them.

Language and accent: where the surprises lived

We operate in 32 languages and we publish a single accuracy number per language. The aggregate gap between our best language (English-US) and our worst supported language (Vietnamese) was 4.1 percentage points — narrower than we expected, and narrower than we have seen reported in published academic benchmarks.

But aggregate language scores hide the part of the problem that actually matters: accent variance within a single language. English calls handled in the United States scored 92.7%. English calls from Indian-English-speaking customers scored 91.3% — well within the noise. English calls from strongly regional Scottish-English speakers scored 83.4%. That is a 9-point gap, and it lives entirely inside the column labeled "English".

The point is not that our model is bad at Scottish English. The point is that publishing a per-language number, while leaving accent variance unmeasured, is a way of hiding the calls your AI is failing on. We are now reporting accent variance internally, and we expect to publish it externally within two quarters.

The five failure modes that explained 80% of misses

When we clustered the 723 missed calls (8.6% of the dataset), five failure modes accounted for 81% of them. None of these will surprise anyone who has shipped voice AI before, but the relative weights did surprise us.

1. Intent-stacking (29% of misses)

The caller has two or more intents and the AI commits to the first one before they have finished stating the second. "I want to reschedule my appointment and also do you sell the X-100 model" is exactly the kind of input that nukes the conversation state.

2. Implicit-context drop (22%)

The caller references something said earlier in the conversation that the AI did not store as context — a name, a price, a date. This is the failure mode that the next-generation context window most directly addresses.

3. Caller talked over the prompt (15%)

Endpointing failures. The AI starts speaking, the caller talks over it, the AI hears its own voice mixed with the caller's and produces a transcript that is missing half the input.

4. Number-readback errors (9%)

Caller dictates a phone number or address. AI reads it back. Caller corrects one digit. AI updates the wrong digit. This is a 100%-fixable problem with a structured-data confirmation prompt, and we shipped one in February.

5. False handoff (6%)

The AI correctly identifies that it can't solve the call and transfers — but the transfer routing sends it to the wrong queue. This is not really an AI failure; it is a dial-plan failure that the AI exposes.

What we shipped because of this study

Three shipped changes came directly out of the study, and a fourth is in development. We tried to ship the highest-leverage fix per failure mode rather than the most academically interesting one.

Multi-intent detector. A small classifier that runs in parallel with intent capture and flags utterances that contain two or more independent intents. When flagged, the conversation pauses and the AI explicitly enumerates: "I heard you want to reschedule and you also asked about the X-100. Let's do the reschedule first — is that okay?" The classifier added 11ms of latency and reduced intent-stacking misses by 64%.
Structured number confirmation. Any phone number, address, or amount is now confirmed digit-by-digit on readback, with a single-digit edit handler. Number-readback misses are down 88%.
Endpointing recalibration. We re-tuned the silence threshold per language and added a "did you finish that thought?" recovery on suspected truncation. Talk-over failures fell from 15% to 9% of the miss set.

The fourth change — a richer per-call context buffer that addresses implicit-context drop — is the most complex of the four, and it is being trialed in shadow mode against a held-out call set before it goes live.

A note on human grader drift

One small finding that is worth naming: the human reviewers, despite a published rubric, drifted across the three months. In month one, reviewers marked 87% of calls as successful. In month three, the same reviewers, scoring randomly drawn month-one calls a second time, marked 91% as successful. The transcripts had not changed.

The drift was not bad faith; it was familiarity. As reviewers got used to the AI's phrasing, they became more generous on the conversation-hygiene dimension. We caught this by re-scoring a 5% sample of month-one calls in month three, and we corrected the headline number for it. The 91.4% reported above is the drift-corrected figure. The raw number was 92.1%.

If a vendor publishes an accuracy number without describing how they controlled for human grader drift, the number is suspect by default. We did not invent this problem; we just paid attention to it.

What we are measuring next

Three things, in priority order. First, accent-resolved accuracy within each major supported language — and a public commitment to publish it once the methodology stabilises. Second, post-call outcome accuracy: did the caller actually receive the thing they came for, measured 7 and 30 days after the call by reconciling with the customer's downstream system.

Third, the cost-per-resolved-call number that combines accuracy with the per-minute economics — because a 98% accurate AI that costs $1.40/min loses to a 92% accurate AI that costs $0.18/min, for almost any real business.

We will publish the next set of numbers in Q3. If they show regression we will publish that too. The point of measuring this stuff is not to find a flattering chart; it is to find the calls we are still getting wrong.

Run your voice on Ajoxi.

AI receptionists, wholesale routes, virtual numbers — built on one platform with transparent pricing and a 24/7 NOC.

See pricing Talk to us

We measured AI receptionist accuracy across 8,400 real calls

Introduction

We tracked language, accent (where identifiable), call type, time of day, length, and the specific path the conversation took.

How we sampled and graded the calls

Third, the rubric was published before any data was looked at, to keep us from quietly redefining "success" once the numbers came in.

Each call was scored on three independent dimensions:

Intent capture — did the AI correctly identify what the caller was actually trying to do? (Binary: yes/no.)
Task completion — was the caller's stated need actually resolved on this call, or did it require a human follow-up the caller did not ask for?
Conversation hygiene — was the transcript free of hallucinations, false promises, or contradictions? Graded 1–5 by the reviewer.

The headline numbers

Across the full 8,427-call dataset, the AI cleared all three gates on 91.4% of calls. That number was higher than we predicted internally — most of us had bets in the 85–88% range.

Broken down by call type, the variance was significant:

Appointment booking — 96.1% (the simplest path, well-rehearsed intent)
General information — 93.8% (hours, location, services)
Pricing inquiries — 89.2% (some pricing paths required disambiguation)
Cancellation / reschedule — 87.6%
Complaints and escalations — 78.3%
Multi-issue calls — 71.9% (two or more independent intents in the same call)

The complaint and multi-issue numbers are the ones we focused on. A 96% number on appointment booking is what we ship to win deals; a 72% number on multi-issue calls is what we ship to keep them.

Language and accent: where the surprises lived

The five failure modes that explained 80% of misses

1. Intent-stacking (29% of misses)

2. Implicit-context drop (22%)

3. Caller talked over the prompt (15%)

Endpointing failures. The AI starts speaking, the caller talks over it, the AI hears its own voice mixed with the caller's and produces a transcript that is missing half the input.

4. Number-readback errors (9%)

5. False handoff (6%)

What we shipped because of this study

Three shipped changes came directly out of the study, and a fourth is in development. We tried to ship the highest-leverage fix per failure mode rather than the most academically interesting one.

Multi-intent detector. A small classifier that runs in parallel with intent capture and flags utterances that contain two or more independent intents. When flagged, the conversation pauses and the AI explicitly enumerates: "I heard you want to reschedule and you also asked about the X-100. Let's do the reschedule first — is that okay?" The classifier added 11ms of latency and reduced intent-stacking misses by 64%.
Structured number confirmation. Any phone number, address, or amount is now confirmed digit-by-digit on readback, with a single-digit edit handler. Number-readback misses are down 88%.
Endpointing recalibration. We re-tuned the silence threshold per language and added a "did you finish that thought?" recovery on suspected truncation. Talk-over failures fell from 15% to 9% of the miss set.

A note on human grader drift

If a vendor publishes an accuracy number without describing how they controlled for human grader drift, the number is suspect by default. We did not invent this problem; we just paid attention to it.

What we are measuring next

Run your voice on Ajoxi.

AI receptionists, wholesale routes, virtual numbers — built on one platform with transparent pricing and a 24/7 NOC.

See pricing Talk to us

Core Capabilities

By Industry & Team

Native Sync

Learn

Build

Trust

We measured AI receptionist accuracy across 8,400 real calls

Introduction

How we sampled and graded the calls

The headline numbers

Language and accent: where the surprises lived

The five failure modes that explained 80% of misses

1. Intent-stacking (29% of misses)

2. Implicit-context drop (22%)

3. Caller talked over the prompt (15%)

4. Number-readback errors (9%)

5. False handoff (6%)

What we shipped because of this study

A note on human grader drift

What we are measuring next

Run your voice on Ajoxi.

We measured AI receptionist accuracy across 8,400 real calls

Introduction

How we sampled and graded the calls

The headline numbers

Language and accent: where the surprises lived

The five failure modes that explained 80% of misses

1. Intent-stacking (29% of misses)

2. Implicit-context drop (22%)

3. Caller talked over the prompt (15%)

4. Number-readback errors (9%)

5. False handoff (6%)

What we shipped because of this study

A note on human grader drift

What we are measuring next

Run your voice on Ajoxi.

Introduction

How we sampled and graded the calls

The headline numbers

Language and accent: where the surprises lived

The five failure modes that explained 80% of misses

1. Intent-stacking (29% of misses)

2. Implicit-context drop (22%)

3. Caller talked over the prompt (15%)

4. Number-readback errors (9%)

5. False handoff (6%)

What we shipped because of this study

A note on human grader drift

What we are measuring next

Run your voice on Ajoxi.

Related reading

Why we ship STIR/SHAKEN attestation on day one

The case for ranking calls, not sampling them

Same latency on Mandarin and English. Here is how

Introduction

How we sampled and graded the calls

The headline numbers

Language and accent: where the surprises lived

The five failure modes that explained 80% of misses

1. Intent-stacking (29% of misses)

2. Implicit-context drop (22%)

3. Caller talked over the prompt (15%)

4. Number-readback errors (9%)

5. False handoff (6%)

What we shipped because of this study

A note on human grader drift

What we are measuring next

Run your voice on Ajoxi.

Related reading

Why we ship STIR/SHAKEN attestation on day one

The case for ranking calls, not sampling them

Same latency on Mandarin and English. Here is how