Introduction
For two decades the contact-center industry agreed on a process: a supervisor would randomly sample a small percentage of recorded calls — typically 2 to 5 per agent per week — score them against a 30-row rubric, and roll the scores up into a coaching report. The whole industry was built around this practice. Vendors sold scorecards. Auditors validated rubrics. Conferences had tracks for it.
The math never worked. A 1,000-agent contact center handling 200,000 calls a week, sampling 2% per agent, gives you 40,000 scored calls — which sounds like a lot until you realise that the sample is uniformly random across calls that are 99% benign. Sampling 40,000 random conversations to find the 60 calls where the agent committed compliance violations is the definition of a needle in a haystack. By the time the haystack has been sorted, the agent has moved on, the customer has churned, and the violation has compounded.
The point of QA was never to score the average call. It was to find the calls that mattered — the failures, the saves, the edge cases, the conversations where the agent did something exceptional or alarming. Random sampling is structurally bad at finding any of those.
What changed in the past 18 months
Two things changed enough to make the old model obviously broken. First, transcription quality crossed a usefulness threshold — across most major languages, transcripts are now reliable enough to score with software, not just with human ears. Second, large language models got good enough at supervised classification that a custom-trained model can score a call on 30 dimensions in 4 seconds for less than a penny.
Together these meant that for the first time in contact center history, every single call could be scored against the rubric — not 2%, not 5%, but 100%. Once every call is scored, the question stops being "which calls do we sample?" and becomes "which calls deserve a human supervisor's attention?"
That is the question worth solving, and it is a ranking problem, not a sampling problem.
What goes into the risk score
We score every call on three independent dimensions, then combine them into a single rank that the supervisor sees in their console.
- Compliance risk — did the call contain language that triggers regulatory exposure? Mini-Miranda, TCPA consent, debt-collection FDCPA boundaries, protected health disclosures. This dimension is binary-ish — most calls score near zero, a small tail scores high.
- Outcome risk — is this customer likely to churn, complain, or escalate as a result of this call? Combines sentiment trajectory, unresolved-issue signals, explicit complaint language, and the customer's account-tier value.
- Coaching value — would a supervisor watching this call learn something they can teach? High-rank coaching calls are the unusual saves, the clean handoffs, and the controlled de-escalations. They are not failures; they are exemplars.
The three dimensions are not weighted equally and the weights are not the same across customers. A debt-collection operator weights compliance risk heaviest. A high-touch enterprise SaaS support team weights outcome risk. A training-heavy onboarding team weights coaching value. The weights are exposed in the supervisor settings and we set sensible defaults per industry.
The supervisor console: a ranked queue
The console looks deliberately different from the old "random sample queue" UI. Instead of a paginated list of recent calls, it is a single ranked queue, sorted by combined risk score, refreshing every two minutes. The top of the queue is the 20 or so calls that need attention today. Everything below is a long tail.
Each call card carries the three sub-scores, a 90-second summary of what the call was about, and the specific snippets the model flagged. A supervisor can listen to the snippet without listening to the full call. If the snippet is the whole story — and most of the time it is — the supervisor confirms or rejects the flag in 30 seconds and moves on.
Random-sample QA used to take 8 minutes per call on average. The ranked queue averages 2 minutes 40 seconds. The cost of QA per call has fallen. The coverage has gone from 2% to 100%. The supervisor is spending their time on the calls that actually moved the metric.
The objections we heard
When we showed the ranked queue to QA leaders at the design-partner customers, three objections came up consistently.
Objection 1: "The model will miss things a human would catch."
True in the abstract; mostly false in practice. The model misses things a human specialist would catch — a compliance auditor reviewing the same call could find subtleties the model does not flag. But the comparison is not against a specialist. It is against a supervisor randomly sampling 2% of calls. The model reviews 100% and flags the obvious 5%. The specialist reviews the 5%. Net coverage is far better than the old system.
Objection 2: "Agents will game the score."
Probably true. Any metric that gets attention gets gamed. The defence is twofold: the score is multi-dimensional, so gaming one axis pushes you up another; and the score is not the agent's performance review. The score is a triage signal for the supervisor. The performance review is what the supervisor concludes after they listen to the actual call. We sell the score as a queue, not a scoreboard.
Objection 3: "It changes the supervisor's job."
Also true, and we should be honest about it. The supervisor of a ranked-queue contact center spends less time scoring calls against a 30-row rubric and more time coaching, escalating, and intervening. The supervisors who liked the old job — the methodical, scorecard-driven part of it — are not necessarily thrilled with the new one. We talked openly to design-partner ops leaders about this before launch. It is a workflow change, not just a tool change.
What the new system found in the first 90 days
Across the four design partners who ran the ranked queue exclusively for 90 days, the supervisor teams escalated 4.7x more calls per week to retention or compliance than they had under random sampling. The increase was not a "more scrutiny" effect; it was almost entirely calls that random sampling had structurally missed.
One customer caught a debt-collection script drift — a single agent had started using language that crossed an FDCPA boundary, on roughly 9% of their calls — that random sampling had not surfaced for 11 weeks. The ranked queue surfaced it within 48 hours of the drift starting, because the compliance-risk score on those specific calls jumped two standard deviations above the agent's baseline.
Another customer found that the agents the random-sample system had flagged as the lowest performers were not, in fact, the lowest performers — they were just the loudest. The ranked queue, which rated based on outcome risk, identified two quiet but consistently underperforming agents whose calls had been sampled at the same rate as everyone else and had landed in the "average" pile.
What we deliberately will not do with the score
Three things, on purpose.
We will not auto-score agents on the queue's output. The score is a triage signal. Agent performance ratings still go through a supervisor. We are deliberately keeping the human in the loop, not because the model cannot do it, but because we have watched enough autoscored systems erode trust to want to avoid the pattern entirely.
We will not surface the agent's real-time score to the agent during the call. There is a school of thought that says agents should see their own coaching scores in real time. There is a stronger school of thought that says doing so degrades the call. We side with the second school.
We will not export the score to performance-management systems by default. Customers who want to wire the score into a separate HR tool can do so with explicit configuration, but we make the default the safer choice. If you want a number on a spreadsheet, you should have to ask for it on purpose.