If you ran a call center in 2022, the AI conversation was about chatbots. Slow, scripted, easy to break, useful for FAQs and not much else. In 2026 the conversation is different: real voice agents are handling tier-1 inbound at production scale, and the unit economics have crossed the line.
We've shipped this stack into European call centers, telco operators, and enterprise customer service teams. This post is a tour of what's actually working, what isn't, and the architecture decisions that matter when you put a voice agent in front of a real customer.
The cost line has crossed
A reasonable production cost for an AI voice agent in 2026 is around £0.18 per call. A human agent in a UK call center costs roughly £3.20 per call once you include salary, benefits, supervision, and facilities. That's a 17× delta — and unlike the human number, the AI number is dropping every quarter.
What the architecture actually looks like
A production voice agent has six moving parts. Skip any of them and you'll ship something that demos beautifully and fails in week 1.
- Speech-to-text (STT) — Deepgram or OpenAI Realtime, low-latency streaming
- Language model — Claude or GPT-4o, with tool calling for system access
- Memory & context — short-term conversation memory + retrieval over your knowledge base
- Tool layer — typed functions that read/write to your CRM, billing, and ticketing
- Text-to-speech (TTS) — ElevenLabs or Cartesia, natural voice with the right brand profile
- Telephony — SIP/Twilio/Vonage, with proper barge-in and DTMF handling
The unsexy ones are the ones that break first. Telephony and barge-in handling are where most demos fall apart on a real network. Tool integration is where 'AI prototype' turns into 'project that runs out of budget'.
Six failure modes you have to design around
1. Latency that kills the conversation
Anything over 800ms end-to-end and humans hang up. Target sub-500ms. This means streaming STT, streaming LLM responses, and TTS that starts speaking before the model has finished thinking.
2. Hallucinated transfers and ticket numbers
LLMs love to invent things. 'Your ticket number is RMC-485219' — fabricated, not in any system. The fix: never let the model state IDs without a tool call that actually generates one. Constrain it.
3. Accent and code-switching
STT models trained on US English drop accuracy on Indian English, Caribbean English, and any kind of code-switching. Pick a model that supports your customer base, and test on real call recordings before launch.
4. Customer rage
When the customer is angry, the agent has to escalate fast and warm-transfer with full context. Don't make the customer repeat themselves. Don't make them wait. Don't have the AI argue.
5. Compliance and call recording
Two-party consent jurisdictions, GDPR, sector-specific recording rules. Build the consent prompt into the start of the call, log it, and keep an audit trail.
6. Silent regressions
You ship, it works, three weeks later it's quietly worse. You need an evaluation harness running on real call samples — not just unit tests. Score every call, flag anomalies, alert when scores drop.
The cost model in detail
// Per-call cost (£) — production AI agent vs human agent
const human = 3.20; // UK loaded cost, tier-1 inbound
const ai = 0.18; // STT + LLM + TTS + telephony, 4-min average
const savings = human - ai; // 3.02
const ratio = human / ai; // ~17.7x
const breakEven = setupCost / savings; // calls to recoup setupOn a 100k-call-per-month operation, that's roughly £302k saved per month at full deflection. Real-world deflection is more like 40–60%, so call it £150k/month — still a number that justifies a serious engineering investment.
When this is the wrong answer
Voice AI is not the right answer for everything. It's not the right answer for high-emotional-load calls (bereavement, complaints with legal exposure, vulnerable customers). It's not right for ultra-low-volume calls where the build cost will never pay back. And it's not right when the underlying systems are too broken for the agent to read or write to.
What we'd do if we were starting today
- Pick the highest-volume, lowest-emotional-load call type as the pilot (balance, status, top-up)
- Ship a 2-week pilot on a small slice of real traffic, with humans in the loop
- Measure: deflection rate, resolution rate, escalation reasons, customer sentiment
- Tune for two weeks, then expand to next-highest-volume call type
- Run an evaluation harness from day one, not as an afterthought
If this sounds like the kind of work you need help with, that's exactly what we ship. The Call Center Automation Pack is 4–6 weeks, fixed price, production-ready.
Written by
RMC Engineering