Back to Blog

Voice AI Agents: Building Natural Phone Interactions

November 19, 20255 min readTeam 400

Text-based AI agents are getting good. Voice AI agents are harder.

When you call a business and an AI answers, the bar is high. People expect phone calls to feel like phone calls—not like talking to a really slow computer.

Here's what we've learned about building voice AI agents that actually work.

Why Voice Is Hard

Voice AI has challenges that text AI doesn't:

Real-time constraints: Response must start within 1-2 seconds or conversation feels broken. Text chat tolerates 5-10 second delays.

No do-overs: In text, you can edit. In voice, you can't unsay something. Mistakes are more noticeable.

Acoustic complexity: Background noise, accents, mumbling, interruptions. Text is clean; voice is messy.

Emotional intensity: Phone calls often happen when people have urgent issues. Tone matters more.

No visual cues: Can't show options, links, or images. Everything must be spoken.

These constraints make voice AI fundamentally different from chat AI.

Architecture Overview

A voice AI agent typically involves:

Phone System → Speech-to-Text → LLM Agent → Text-to-Speech → Phone System
     ↑              ↓               ↓              ↓              ↓
Telephony      Transcription    Reasoning      Synthesis      Playback
Infrastructure   Engine                         Engine

Each component adds latency. Optimising end-to-end latency is critical.

Speech-to-Text (STT)

Converts caller's speech to text for the LLM.

Options:

  • Streaming STT: Transcribes as speaker talks (lower latency)
  • Utterance-based STT: Waits for speaker to finish (higher accuracy)

For conversational agents, streaming is usually necessary.

Key considerations:

  • Accuracy on your caller demographic (accents, vocabulary)
  • Latency to first transcription
  • Handling of background noise
  • Ability to detect when speaker is done

LLM Processing

The agent logic—same as text agents but with latency constraints.

Optimisations:

  • Streaming responses: Start speaking before full response is generated
  • Shorter responses: Voice tolerance for long responses is lower than text
  • Faster models: Might sacrifice some capability for speed

Text-to-Speech (TTS)

Converts agent response to spoken audio.

Options:

  • Neural TTS: More natural, slightly slower
  • Standard TTS: Less natural, faster
  • Custom voices: Trained on your brand voice

Key considerations:

  • Naturalness (avoid robotic)
  • Latency to first audio
  • Expressiveness (tone, pacing)
  • Pronunciation of domain terms

Latency Budget

For conversational voice AI, target under 2 seconds end-to-end.

Typical breakdown:

Speech-to-Text:     200-500ms
LLM processing:     500-1500ms
Text-to-Speech:     200-400ms
Network overhead:   100-200ms
Total:              1000-2600ms

To hit targets:

  • Stream everything (don't wait for complete outputs)
  • Optimise model selection for speed
  • Use edge deployment where possible
  • Pre-compute common responses

Conversation Design for Voice

Voice conversations need different design than text.

Shorter Responses

Text: "I can help you with that. To update your delivery address, I'll need your order number. Once I have that, I can look up your current shipping details and make the change for you. Would you like to proceed?"

Voice: "Sure, I can update that. What's your order number?"

Be concise. Voice users can't skim.

Clear Turn-Taking

In text, simultaneous messages work fine. In voice, talking over each other is bad.

Design for clear turn-taking:

  • Short pauses after questions
  • Clear conversation structure
  • Explicit handoffs ("Go ahead")
  • Handle interruptions gracefully

Barge-In Handling

When caller interrupts mid-response:

Options:

  • Stop immediately and listen (feels responsive)
  • Finish current sentence then listen (feels smoother)
  • Continue until critical info delivered (risky)

Usually: Stop immediately. Callers interrupt for a reason.

Confirmation Without Tedium

Text: Easy to display what the agent heard for confirmation Voice: Reading everything back gets tedious

Balance:

  • Confirm critical info explicitly ("That's 0-4-1-2, 3-4-5, 6-7-8?")
  • Summarise actions ("I've updated your address to 45 Smith Street, Newtown")
  • Don't read back obvious inputs

Error Recovery

When the agent misunderstands:

Bad:
Agent: "I didn't understand that. Please try again."
Caller: [Gives up]

Good:
Agent: "I heard you say you want to cancel. Did you mean cancel your order, or something else?"
Caller: "No, I want to check my balance"
Agent: "Got it—checking your balance. One moment."

Offer what was heard. Let caller correct.

Use Cases That Work Well

Appointment Scheduling

  • Clear structure (date, time, service type)
  • Limited options (constrained responses work well)
  • Confirmation is natural ("Thursday at 2pm?")

Order Status

  • Quick lookup, quick response
  • Caller usually knows order number
  • Straightforward information delivery

FAQ and Information

  • Caller has specific question
  • Agent retrieves and speaks answer
  • Works if knowledge base is good

Triage and Routing

  • Understand caller intent
  • Route to appropriate queue or resource
  • Gather context before handoff

Use Cases That Struggle

Complex Problem Solving

Multiple rounds of troubleshooting work better in text where caller can follow along.

Emotional Situations

Complaints, cancellations, bad news. Often better handled by humans.

Long Information Collection

Forms with many fields. Voice fatigue sets in. Consider moving to text or callback.

Ambiguous Requests

Text agents can ask clarifying questions without feeling slow. Voice clarifications feel like the system doesn't work.

Testing Voice Agents

Acoustic Testing

  • Various accents and speech patterns
  • Background noise (office, car, street)
  • Audio quality issues (bad connection, speakerphone)
  • Edge cases (coughing, long pauses, um/uh)

Conversation Flow Testing

  • Complete task flows
  • Error recovery paths
  • Interruption handling
  • Timeout behaviour

Latency Testing

  • Measure end-to-end latency under load
  • Test across connection types
  • Identify bottlenecks

User Testing

  • Real users, not just team members
  • Record and analyse conversations
  • Track completion rates and time

Integration Considerations

Telephony Platforms

Options:

  • Twilio, Vonage, Bandwidth for API-based
  • Amazon Connect, Google Contact Center for full solution
  • On-premise integration with SIP trunking

Handoff to Humans

When voice agent can't help:

  • Warm transfer (agent stays on, introduces)
  • Cold transfer (direct to queue)
  • Callback arrangement (human calls back)

Pass context to human agent—don't make caller repeat.

Recording and Compliance

Voice calls often have:

  • Recording requirements (or prohibitions)
  • Consent requirements ("This call may be recorded")
  • Data retention obligations
  • Accessibility requirements

Build compliance in from the start.

Our Voice AI Approach

We've built voice-enabled customer service systems that handle real calls. The key principles:

  • Optimise for speed (latency kills voice UX)
  • Design for conversation (not just Q&A)
  • Plan for handoffs (voice can't do everything)
  • Test extensively with real acoustic conditions

Voice AI is harder than text. But for the right use cases, it opens capabilities that text can't match.

Talk to us about voice AI for your business.