Introduction
For about a decade, conversation-intelligence vendors have been selling a single number per call: positive, neutral, or negative sentiment, usually on a -1 to +1 scale. The number is the headline of every QA dashboard. The number is what shows up in the weekly slide for the VP. The number is, in our experience, almost useless.
Consider two calls. Call A starts at -0.6 sentiment — angry customer, hostile opener — and ends at +0.4 sentiment after the agent solves the problem and earns a thank-you. Call B starts at +0.2 and slowly degrades to -0.3 as the customer realises their issue is not going to be resolved on this call. Aggregate sentiment for Call A is mildly negative. Aggregate sentiment for Call B is mildly negative. They are the same number. They are not the same call.
The single-score model treats sentiment as a property of the call, when it is actually a property of every moment of the call. The aggregation is what destroys the information.
What the trajectory model captures
We score sentiment at the utterance level — every speaker turn gets its own value. Then we treat the resulting series as a time signal, not a number to be averaged.
From the time signal we extract four features that actually correlate with downstream outcomes:
- Start value — the sentiment of the first 30 seconds. Captures how the customer arrived.
- End value — the sentiment of the last 60 seconds. Captures how the customer left.
- Minimum — the deepest dip in the call. Captures whether the conversation hit a friction point.
- Slope — the linear trend across the full call. Captures whether the conversation was getting better or worse.
A call with a low start, a low minimum, and a high end with positive slope is a "save" — the agent took a hostile opener and resolved it. A call with a high start, a moderate minimum, and a low end is a "quiet fade" — the customer was reasonable, the agent didn't solve the problem, and the customer's patience ran out. Both have the same average sentiment. Different calls. Different coaching implications.
What trajectory actually predicts
The reason to bother with trajectory is that the features predict things the single score does not. Specifically, we tracked four post-call outcomes against trajectory features across 240,000 calls at three design-partner customers and found:
- 30-day churn correlated more strongly with end value and slope than with average sentiment. A customer who left a call with negative slope was 2.3x more likely to churn within 30 days than a customer with the same average sentiment but positive slope.
- Repeat-contact rate (whether the customer called back within 7 days) correlated almost entirely with end value, regardless of how the call started.
- CSAT survey response (where surveys were sent) correlated with end value and the minimum. The minimum is what people remember most clearly when they fill in a survey three days later.
- Agent attrition risk — separately — correlated with the agent's average minimum across their last 50 calls. Agents who consistently encountered deep dips, even on calls that recovered, were 1.8x more likely to leave the role within six months.
Coaching the saves, not just flagging the failures
The biggest workflow change trajectory enables is on the positive side. Traditional QA flags the worst calls. The best calls — the saves, the de-escalations, the moments where an agent took -0.7 sentiment and walked it back to +0.5 — are invisible to a system that only knows the average.
In a trajectory-aware queue, saves rise to the top of the coaching feed alongside the failures. A supervisor running a Friday-afternoon stand-up has both — "here is the call where Maria turned a hostile customer around in 90 seconds, listen to the phrasing she used at minute 2" — and the team learns from the exemplar, not just from the cautionary tale.
This is the more important half of the value. The failures coach the floor. The saves coach the culture.
Trajectory has pitfalls too
Two failure modes that are worth naming, because we have walked into both of them.
The first is that sentiment is hard to score at utterance level for some languages and tones. Sarcasm is the obvious case — "that's just great" reads as positive on a literal score and negative in context. We have a specific sub-model for sarcasm and irony, and it still gets it wrong 12% of the time. The trajectory plot for a sarcastic customer looks like a roller-coaster, and the four extracted features become unreliable. We mark these calls and route them to a human reviewer rather than letting the trajectory feed into downstream metrics.
The second is over-coaching on the slope feature. If supervisors learn that "positive slope is good," they will coach agents to aggressively end calls on a positive note — which is rational in a customer-experience sense but can lead to artificially upbeat closes that mask unresolved issues. The slope is a signal. It is not a target.
What we shipped because of this
Three product changes came out of moving from single-score to trajectory.
The supervisor console now shows a small trajectory sparkline next to every call summary, with the four features called out as numbers. The sparkline is more information-dense than any single score, and supervisors learned to read it within a day.
The post-call coaching report includes a "trajectory snippet" — a 20-second audio clip pulled from the inflection point of the call, where sentiment changed direction. The inflection point is, almost always, the moment that decided the call. Listening to 20 seconds at the inflection point is roughly as informative as listening to the whole call.
The customer-success dashboard reports churn risk using the four trajectory features as inputs, not aggregated sentiment. Customers with negative end-value and negative slope are flagged for proactive outreach within 24 hours. The new model has caught 41% more at-risk accounts than the previous one, with the same precision.
What is left to do
The single biggest open problem is multi-issue calls — where a customer raises two or three independent issues during the same call, and each has its own sentiment arc. The current trajectory model collapses them into a single series, which loses information when one issue is resolved well and another is not. We are working on per-issue trajectory decomposition; it is harder than it sounds because it requires the system to know where one issue ends and the next begins, which is itself a hard problem.
In the meantime, the single-score model is still wrong, and the four-feature trajectory model is still right. We will take the partial fix over the elegant-but-unmeasured original any day.