Voice AI Agents: Building Natural Phone Interactions
Text-based AI agents are getting good. Voice AI agents are harder.
When you call a business and an AI answers, the bar is high. People expect phone calls to feel like phone calls—not like talking to a really slow computer.
Here's what we've learned about building voice AI agents that actually work.
Why Voice Is Hard
Voice AI has challenges that text AI doesn't:
Real-time constraints: Response must start within 1-2 seconds or conversation feels broken. Text chat tolerates 5-10 second delays.
No do-overs: In text, you can edit. In voice, you can't unsay something. Mistakes are more noticeable.
Acoustic complexity: Background noise, accents, mumbling, interruptions. Text is clean; voice is messy.
Emotional intensity: Phone calls often happen when people have urgent issues. Tone matters more.
No visual cues: Can't show options, links, or images. Everything must be spoken.
These constraints make voice AI fundamentally different from chat AI.
Architecture Overview
A voice AI agent typically involves:
Phone System → Speech-to-Text → LLM Agent → Text-to-Speech → Phone System
↑ ↓ ↓ ↓ ↓
Telephony Transcription Reasoning Synthesis Playback
Infrastructure Engine Engine
Each component adds latency. Optimising end-to-end latency is critical.
Speech-to-Text (STT)
Converts caller's speech to text for the LLM.
Options:
- Streaming STT: Transcribes as speaker talks (lower latency)
- Utterance-based STT: Waits for speaker to finish (higher accuracy)
For conversational agents, streaming is usually necessary.
Key considerations:
- Accuracy on your caller demographic (accents, vocabulary)
- Latency to first transcription
- Handling of background noise
- Ability to detect when speaker is done
LLM Processing
The agent logic—same as text agents but with latency constraints.
Optimisations:
- Streaming responses: Start speaking before full response is generated
- Shorter responses: Voice tolerance for long responses is lower than text
- Faster models: Might sacrifice some capability for speed
Text-to-Speech (TTS)
Converts agent response to spoken audio.
Options:
- Neural TTS: More natural, slightly slower
- Standard TTS: Less natural, faster
- Custom voices: Trained on your brand voice
Key considerations:
- Naturalness (avoid robotic)
- Latency to first audio
- Expressiveness (tone, pacing)
- Pronunciation of domain terms
Latency Budget
For conversational voice AI, target under 2 seconds end-to-end.
Typical breakdown:
Speech-to-Text: 200-500ms
LLM processing: 500-1500ms
Text-to-Speech: 200-400ms
Network overhead: 100-200ms
Total: 1000-2600ms
To hit targets:
- Stream everything (don't wait for complete outputs)
- Optimise model selection for speed
- Use edge deployment where possible
- Pre-compute common responses
Conversation Design for Voice
Voice conversations need different design than text.
Shorter Responses
Text: "I can help you with that. To update your delivery address, I'll need your order number. Once I have that, I can look up your current shipping details and make the change for you. Would you like to proceed?"
Voice: "Sure, I can update that. What's your order number?"
Be concise. Voice users can't skim.
Clear Turn-Taking
In text, simultaneous messages work fine. In voice, talking over each other is bad.
Design for clear turn-taking:
- Short pauses after questions
- Clear conversation structure
- Explicit handoffs ("Go ahead")
- Handle interruptions gracefully
Barge-In Handling
When caller interrupts mid-response:
Options:
- Stop immediately and listen (feels responsive)
- Finish current sentence then listen (feels smoother)
- Continue until critical info delivered (risky)
Usually: Stop immediately. Callers interrupt for a reason.
Confirmation Without Tedium
Text: Easy to display what the agent heard for confirmation Voice: Reading everything back gets tedious
Balance:
- Confirm critical info explicitly ("That's 0-4-1-2, 3-4-5, 6-7-8?")
- Summarise actions ("I've updated your address to 45 Smith Street, Newtown")
- Don't read back obvious inputs
Error Recovery
When the agent misunderstands:
Bad:
Agent: "I didn't understand that. Please try again."
Caller: [Gives up]
Good:
Agent: "I heard you say you want to cancel. Did you mean cancel your order, or something else?"
Caller: "No, I want to check my balance"
Agent: "Got it—checking your balance. One moment."
Offer what was heard. Let caller correct.
Use Cases That Work Well
Appointment Scheduling
- Clear structure (date, time, service type)
- Limited options (constrained responses work well)
- Confirmation is natural ("Thursday at 2pm?")
Order Status
- Quick lookup, quick response
- Caller usually knows order number
- Straightforward information delivery
FAQ and Information
- Caller has specific question
- Agent retrieves and speaks answer
- Works if knowledge base is good
Triage and Routing
- Understand caller intent
- Route to appropriate queue or resource
- Gather context before handoff
Use Cases That Struggle
Complex Problem Solving
Multiple rounds of troubleshooting work better in text where caller can follow along.
Emotional Situations
Complaints, cancellations, bad news. Often better handled by humans.
Long Information Collection
Forms with many fields. Voice fatigue sets in. Consider moving to text or callback.
Ambiguous Requests
Text agents can ask clarifying questions without feeling slow. Voice clarifications feel like the system doesn't work.
Testing Voice Agents
Acoustic Testing
- Various accents and speech patterns
- Background noise (office, car, street)
- Audio quality issues (bad connection, speakerphone)
- Edge cases (coughing, long pauses, um/uh)
Conversation Flow Testing
- Complete task flows
- Error recovery paths
- Interruption handling
- Timeout behaviour
Latency Testing
- Measure end-to-end latency under load
- Test across connection types
- Identify bottlenecks
User Testing
- Real users, not just team members
- Record and analyse conversations
- Track completion rates and time
Integration Considerations
Telephony Platforms
Options:
- Twilio, Vonage, Bandwidth for API-based
- Amazon Connect, Google Contact Center for full solution
- On-premise integration with SIP trunking
Handoff to Humans
When voice agent can't help:
- Warm transfer (agent stays on, introduces)
- Cold transfer (direct to queue)
- Callback arrangement (human calls back)
Pass context to human agent—don't make caller repeat.
Recording and Compliance
Voice calls often have:
- Recording requirements (or prohibitions)
- Consent requirements ("This call may be recorded")
- Data retention obligations
- Accessibility requirements
Build compliance in from the start.
Our Voice AI Approach
We've built voice-enabled customer service systems that handle real calls. The key principles:
- Optimise for speed (latency kills voice UX)
- Design for conversation (not just Q&A)
- Plan for handoffs (voice can't do everything)
- Test extensively with real acoustic conditions
Voice AI is harder than text. But for the right use cases, it opens capabilities that text can't match.
Talk to us about voice AI for your business.