Ajoxi
  • Pillar
    CLOUD PHONE

    Cloud phone, messaging, video, fax, chat — the full UCaaS stack.

    • Business PhoneCalling, SMS, video, one login
    • Customer EngagementEvery channel, one thread
    • Personal AIAI sidekick for every rep
    • SMS & MMSText from the main business line
    • Team ChatInternal chat, tied to customers
    • Video MeetingsRooms with AI notes + recap
    • Online FaxFax without the fax machine
    • Website ChatbotAuto-resolves order status & returns
    • Phone SystemModern PBX with AI built in
    Featured
    Everything included.
    Cloud phone, AI contact center, AI Receptionist, SMS, video, 300+ integrations.
    See plans & pricing
  • Core Capabilities
    • AI Receptionist24/7 first answer · 32 languages
    • AI SentimentRoutes upset callers automatically
    • AI Agent AssistWhisper scripts + next-best-action
    • Conversation IntelligenceTranscripts, sentiment, objections
    • Call RecordingFull fidelity + keyword search
    • Auto-attendantDrag-and-drop visual IVR builder
    • Supervisor ToolsListen · whisper · barge · audit log
    • Toll-free Numbers800, 888, 877 — provisioned fast
    New
    AI Sentiment · live scoring.
    Routes upset customers to senior agents the moment sentiment dips. On every paid plan.
    See AI Sentiment
  • By Industry & Team
    • FinanceSOC 2 · FINRA-ready audit trails
    • RetailOmnichannel + cart-recovery SMS
    • SaaSAPIs + Personal AI on every seat
    • LogisticsMulti-site dispatch routing
    • Sales TeamsPower dialer + live AI coaching
    • Support TeamsShared memory across 8 channels
    • Remote TeamsSame number on every device
    • SMBAI receptionist as your front desk
    • Enterprise ITSSO, SCIM, multi-site governance
    Most adopted
    A calling stack compliance trusts.
    Call recording, STIR/SHAKEN, sentiment routing. SOC 2, PCI, and FINRA-ready audit trails.
    See finance
  • Native Sync
    • HubSpotTwo-way sync · lifecycle triggers
    • ZohoCRM · Desk · Books · Bigin
    Coming soon
    Salesforce. Pipedrive. Freshsales.
    All three native two-way syncs in Q3 2026. Want a heads-up on launch?
    Email me on launch
  • Pricing
  • Learn
    • BlogEngineering & product notes
    • Customer storiesReal outcomes, real numbers
    • GuidesStep-by-step playbooks
    • WebinarsLive every Thursday · on-demand
    • Contact UsTalk to sales or get support
    Build
    • DocsHow everything works
    • API referenceREST + webhooks
    • SDKsNode, Python, Go, Ruby
    • ChangelogEvery ship, in one place
    Trust
    • Status pageLive uptime + incidents
    • Security + complianceSOC 2 · GDPR · PCI
    • PrivacyWhat we collect & why
    • TermsThe contract, in chapters
    Fresh ink
    8,400 calls, measured.
    AI receptionist accuracy by language, accent, and call type — the unedited numbers.
    Read the post
Sign inFree Trial
Cloud Phone
Business PhoneCalling, SMS, video, one loginCustomer EngagementEvery channel, one threadPersonal AIAI sidekick for every repSMS & MMSText from the main business lineTeam ChatInternal chat, tied to customersVideo MeetingsRooms with AI notes + recapOnline FaxFax without the fax machineWebsite ChatbotAuto-resolves order status & returnsPhone SystemModern PBX with AI built in
Contact Center
OmnichannelOne queue for every channelOutbound DialerPredictive, power, previewAgent AssistLive whisper coachingSupervisor AssistSpot bad calls in real timeInteraction AnalyticsAuto-QA, topic trendsEnterprise500+ seat operations
AI Family
Ajoxi VoiceAI Receptionist that books appointmentsAI AssistantDrafts, summaries, follow-upsConversation AIReads every call so you don't miss a thing
AI Receptionist24/7 first answer · 32 languagesAI SentimentRoutes upset callers automaticallyAI Agent AssistWhisper scripts + next-best-actionConversation IntelligenceTranscripts, sentiment, objectionsCall RecordingFull fidelity + keyword searchAuto-attendantDrag-and-drop visual IVR builderSupervisor ToolsListen · whisper · barge · audit logToll-free Numbers800, 888, 877 — provisioned fast
FinanceSOC 2 · FINRA-ready audit trailsRetailOmnichannel + cart-recovery SMSSaaSAPIs + Personal AI on every seatLogisticsMulti-site dispatch routingSales TeamsPower dialer + live AI coachingSupport TeamsShared memory across 8 channelsRemote TeamsSame number on every deviceSMBAI receptionist as your front deskEnterprise ITSSO, SCIM, multi-site governance
HubSpotTwo-way sync · lifecycle triggersZohoCRM · Desk · Books · Bigin
Learn
BlogEngineering & product notesCustomer storiesReal outcomes, real numbersGuidesStep-by-step playbooksWebinarsLive every Thursday · on-demandContact UsTalk to sales or get support
Build
DocsHow everything worksAPI referenceREST + webhooksSDKsNode, Python, Go, RubyChangelogEvery ship, in one place
Trust
Status pageLive uptime + incidentsSecurity + complianceSOC 2 · GDPR · PCIPrivacyWhat we collect & whyTermsThe contract, in chapters
Sign inFree Trial
Ajoxi

Cloud phone and AI contact center on one carrier-grade network.

SOC 2GDPRPCI-DSS

Cloud Phone

  • Business Phone
  • Customer Engagement
  • SMS & MMS
  • Team Chat
  • Video Meetings
  • Phone System

Contact Center

  • Omnichannel
  • Outbound Dialer
  • Agent Assist
  • Interaction Analytics
  • Enterprise CCaaS

Wholesale

  • Wholesale VoIP
  • Wholesale Voice
  • SIP Trunking
  • CLI Routes

AI

  • AI Platform
  • AI Receptionist
  • AI Assistant
  • Conversational AI
  • AI Sentiment
  • Conversation Intelligence

Solutions

  • Finance
  • Retail & eCom
  • SaaS & Tech
  • Sales Teams
  • SMB

Company

  • Pricing
  • About
  • Customers
  • Contact Us
  • Country Codes
  • Area Codes
  • Docs
  • Status
  • Security

© 2026 Ajoxi. All rights reserved.

All systems normal
  • Privacy
  • Terms
  • Security
Blog/AI/We measured AI receptionist accuracy across 8,400 real calls

We measured AI receptionist accuracy across 8,400 real calls

For three months we tracked every call the AI handled — by language, by accent, by call type — and graded the transcript against a human reviewer. The accuracy numbers were better than we expected. The failure modes were more interesting.

Table of Contents
  • 1.Introduction
  • 2.How we sampled and graded
  • 3.The headline numbers
  • 4.Language and accent
  • 5.The five failure modes
  • 6.What we shipped because of it
  • 7.Human grader drift
  • 8.What we are measuring next

Introduction

Most "AI accuracy" numbers you read in marketing copy are quietly meaningless. They benchmark a model on a clean dataset, score the dataset against itself, and produce a number that has almost nothing to do with how the system performs on a Tuesday afternoon when a customer is calling from a noisy warehouse and asking three questions at once.

We wanted a number that meant something. So for three months between January and March 2026, every call our AI receptionist handled across the production fleet — 8,427 calls in total — was transcribed, graded by the AI in real time, then independently re-graded by a human reviewer who never saw the AI verdict. We tracked language, accent (where identifiable), call type, time of day, length, and the specific path the conversation took.

The point was not to publish a flattering number. The point was to find the calls where the AI thought it had succeeded and was wrong, and the calls where the AI had genuinely solved the customer's problem but our internal metrics had marked it as a miss. Both classes were larger than we expected.

How we sampled and graded the calls

There were three rules. First, no cherry-picking: every call across the period was eligible, including the 47 calls that crashed mid-conversation and the 312 that were transferred to a human within the first 15 seconds. Second, the human reviewers were blind to the AI's self-assessment — they graded the transcript against the customer's stated intent, not against whatever the AI claimed it had done. Third, the rubric was published before any data was looked at, to keep us from quietly redefining "success" once the numbers came in.

Each call was scored on three independent dimensions:

  • Intent capture — did the AI correctly identify what the caller was actually trying to do? (Binary: yes/no.)
  • Task completion — was the caller's stated need actually resolved on this call, or did it require a human follow-up the caller did not ask for?
  • Conversation hygiene — was the transcript free of hallucinations, false promises, or contradictions? Graded 1–5 by the reviewer.

A call was only counted as a "win" if it cleared all three: correct intent, complete task, hygiene score of 4 or 5. The third gate is the one that quietly eliminated calls we would otherwise have called successes.

The headline numbers

Across the full 8,427-call dataset, the AI cleared all three gates on 91.4% of calls. That number was higher than we predicted internally — most of us had bets in the 85–88% range. The gap between our prediction and the result was not a model getting better than we expected; it was that our intuition was being shaped by the calls we remembered, which were disproportionately the bad ones.

Broken down by call type, the variance was significant:

  • Appointment booking — 96.1% (the simplest path, well-rehearsed intent)
  • General information — 93.8% (hours, location, services)
  • Pricing inquiries — 89.2% (some pricing paths required disambiguation)
  • Cancellation / reschedule — 87.6%
  • Complaints and escalations — 78.3%
  • Multi-issue calls — 71.9% (two or more independent intents in the same call)

The complaint and multi-issue numbers are the ones we focused on. A 96% number on appointment booking is what we ship to win deals; a 72% number on multi-issue calls is what we ship to keep them.

Language and accent: where the surprises lived

We operate in 32 languages and we publish a single accuracy number per language. The aggregate gap between our best language (English-US) and our worst supported language (Vietnamese) was 4.1 percentage points — narrower than we expected, and narrower than we have seen reported in published academic benchmarks.

But aggregate language scores hide the part of the problem that actually matters: accent variance within a single language. English calls handled in the United States scored 92.7%. English calls from Indian-English-speaking customers scored 91.3% — well within the noise. English calls from strongly regional Scottish-English speakers scored 83.4%. That is a 9-point gap, and it lives entirely inside the column labeled "English".

The point is not that our model is bad at Scottish English. The point is that publishing a per-language number, while leaving accent variance unmeasured, is a way of hiding the calls your AI is failing on. We are now reporting accent variance internally, and we expect to publish it externally within two quarters.

The five failure modes that explained 80% of misses

When we clustered the 723 missed calls (8.6% of the dataset), five failure modes accounted for 81% of them. None of these will surprise anyone who has shipped voice AI before, but the relative weights did surprise us.

1. Intent-stacking (29% of misses)

The caller has two or more intents and the AI commits to the first one before they have finished stating the second. "I want to reschedule my appointment and also do you sell the X-100 model" is exactly the kind of input that nukes the conversation state.

2. Implicit-context drop (22%)

The caller references something said earlier in the conversation that the AI did not store as context — a name, a price, a date. This is the failure mode that the next-generation context window most directly addresses.

3. Caller talked over the prompt (15%)

Endpointing failures. The AI starts speaking, the caller talks over it, the AI hears its own voice mixed with the caller's and produces a transcript that is missing half the input.

4. Number-readback errors (9%)

Caller dictates a phone number or address. AI reads it back. Caller corrects one digit. AI updates the wrong digit. This is a 100%-fixable problem with a structured-data confirmation prompt, and we shipped one in February.

5. False handoff (6%)

The AI correctly identifies that it can't solve the call and transfers — but the transfer routing sends it to the wrong queue. This is not really an AI failure; it is a dial-plan failure that the AI exposes.

What we shipped because of this study

Three shipped changes came directly out of the study, and a fourth is in development. We tried to ship the highest-leverage fix per failure mode rather than the most academically interesting one.

  • Multi-intent detector. A small classifier that runs in parallel with intent capture and flags utterances that contain two or more independent intents. When flagged, the conversation pauses and the AI explicitly enumerates: "I heard you want to reschedule and you also asked about the X-100. Let's do the reschedule first — is that okay?" The classifier added 11ms of latency and reduced intent-stacking misses by 64%.
  • Structured number confirmation. Any phone number, address, or amount is now confirmed digit-by-digit on readback, with a single-digit edit handler. Number-readback misses are down 88%.
  • Endpointing recalibration. We re-tuned the silence threshold per language and added a "did you finish that thought?" recovery on suspected truncation. Talk-over failures fell from 15% to 9% of the miss set.

The fourth change — a richer per-call context buffer that addresses implicit-context drop — is the most complex of the four, and it is being trialed in shadow mode against a held-out call set before it goes live.

A note on human grader drift

One small finding that is worth naming: the human reviewers, despite a published rubric, drifted across the three months. In month one, reviewers marked 87% of calls as successful. In month three, the same reviewers, scoring randomly drawn month-one calls a second time, marked 91% as successful. The transcripts had not changed.

The drift was not bad faith; it was familiarity. As reviewers got used to the AI's phrasing, they became more generous on the conversation-hygiene dimension. We caught this by re-scoring a 5% sample of month-one calls in month three, and we corrected the headline number for it. The 91.4% reported above is the drift-corrected figure. The raw number was 92.1%.

If a vendor publishes an accuracy number without describing how they controlled for human grader drift, the number is suspect by default. We did not invent this problem; we just paid attention to it.

What we are measuring next

Three things, in priority order. First, accent-resolved accuracy within each major supported language — and a public commitment to publish it once the methodology stabilises. Second, post-call outcome accuracy: did the caller actually receive the thing they came for, measured 7 and 30 days after the call by reconciling with the customer's downstream system. Third, the cost-per-resolved-call number that combines accuracy with the per-minute economics — because a 98% accurate AI that costs $1.40/min loses to a 92% accurate AI that costs $0.18/min, for almost any real business.

We will publish the next set of numbers in Q3. If they show regression we will publish that too. The point of measuring this stuff is not to find a flattering chart; it is to find the calls we are still getting wrong.

Run your voice on Ajoxi.

AI receptionists, wholesale routes, virtual numbers — built on one platform with transparent pricing and a 24/7 NOC.

See pricing Talk to us
Keep Reading

Related reading

Hand-picked next reads from the Ajoxi blog.

Why we ship STIR/SHAKEN attestation on day one
Compliance

Why we ship STIR/SHAKEN attestation on day one

Most cloud-phone vendors treat caller-ID attestation as a higher-tier feature. Carriers do not. Here is why we made it default — and what it changed for outbound answer rates.

Read article
The case for ranking calls, not sampling them
Product

The case for ranking calls, not sampling them

Random sampling misses the calls that actually matter. We rebuilt the supervisor console around a risk score — and stopped pretending QA was a numbers game.

Read article
Same latency on Mandarin and English. Here is how
Engineering

Same latency on Mandarin and English. Here is how

Hitting parity across 32 languages without bloating the model required a model-routing layer we did not see coming. Notes from the latency war room.

Read article