Voice AI Agents: Building Natural Phone Interactions

November 19, 2025•6 min read•Team 400

Text-based AI agents are getting good. Voice AI agents are harder.

When you call a business and an AI answers, the bar is high. People expect phone calls to feel like phone calls, not like talking to a really slow computer.

Here's what we've learned about building voice AI agents that actually work.

Why Voice Is Hard

Voice AI has challenges that text AI doesn't:

Real-time constraints: Response must start within 1-2 seconds or conversation feels broken. Text chat tolerates 5-10 second delays.

No do-overs: In text, you can edit. In voice, you can't unsay something. Mistakes are more noticeable.

Acoustic complexity: Background noise, accents, mumbling, interruptions. Text is clean; voice is messy.

Emotional intensity: Phone calls often happen when people have urgent issues. Tone matters more.

No visual cues: Can't show options, links, or images. Everything must be spoken.

These constraints make voice AI fundamentally different from chat AI.

Architecture Overview

A voice AI agent typically involves:

Phone System → Speech-to-Text → LLM Agent → Text-to-Speech → Phone System
     ↑              ↓               ↓              ↓              ↓
Telephony      Transcription    Reasoning      Synthesis      Playback
Infrastructure   Engine                         Engine

Each component adds latency. Optimising end-to-end latency is critical.

Speech-to-Text (STT)

Converts caller's speech to text for the LLM.

Options:

Streaming STT: Transcribes as speaker talks (lower latency)
Utterance-based STT: Waits for speaker to finish (higher accuracy)

For conversational agents, streaming is usually necessary.

Key considerations:

Accuracy on your caller demographic (accents, vocabulary)
Latency to first transcription
Handling of background noise
Ability to detect when speaker is done

LLM Processing

The agent logic, same as text agents but with latency constraints.

Optimisations:

Streaming responses: Start speaking before full response is generated
Shorter responses: Voice tolerance for long responses is lower than text
Faster models: Might sacrifice some capability for speed

Text-to-Speech (TTS)

Converts agent response to spoken audio.

Options:

Neural TTS: More natural, slightly slower
Standard TTS: Less natural, faster
Custom voices: Trained on your brand voice

Key considerations:

Naturalness (avoid robotic)
Latency to first audio
Expressiveness (tone, pacing)
Pronunciation of domain terms

Latency Budget

For conversational voice AI, target under 2 seconds end-to-end.

Typical breakdown:

Speech-to-Text:     200-500ms
LLM processing:     500-1500ms
Text-to-Speech:     200-400ms
Network overhead:   100-200ms
Total:              1000-2600ms

To hit targets:

Stream everything (don't wait for complete outputs)
Optimise model selection for speed
Use edge deployment where possible
Pre-compute common responses

Conversation Design for Voice

Voice conversations need different design than text.

Shorter Responses

Text: "I can help you with that. To update your delivery address, I'll need your order number. Once I have that, I can look up your current shipping details and make the change for you. Would you like to proceed?"

Voice: "Sure, I can update that. What's your order number?"

Be concise. Voice users can't skim.

Clear Turn-Taking

In text, simultaneous messages work fine. In voice, talking over each other is bad.

Design for clear turn-taking:

Short pauses after questions
Clear conversation structure
Explicit handoffs ("Go ahead")
Handle interruptions gracefully

Barge-In Handling

When caller interrupts mid-response:

Options:

Stop immediately and listen (feels responsive)
Finish current sentence then listen (feels smoother)
Continue until critical info delivered (risky)

Usually: Stop immediately. Callers interrupt for a reason.

Confirmation Without Tedium

Text: Easy to display what the agent heard for confirmation Voice: Reading everything back gets tedious

Balance:

Confirm critical info explicitly ("That's 0-4-1-2, 3-4-5, 6-7-8?")
Summarise actions ("I've updated your address to 45 Smith Street, Newtown")
Don't read back obvious inputs

Error Recovery

When the agent misunderstands:

Bad:
Agent: "I didn't understand that. Please try again."
Caller: [Gives up]

Good:
Agent: "I heard you say you want to cancel. Did you mean cancel your order, or something else?"
Caller: "No, I want to check my balance"
Agent: "Got it, checking your balance. One moment."

Offer what was heard. Let caller correct.

Use Cases That Work Well

Appointment Scheduling

Clear structure (date, time, service type)
Limited options (constrained responses work well)
Confirmation is natural ("Thursday at 2pm?")

Order Status

Quick lookup, quick response
Caller usually knows order number
Straightforward information delivery

FAQ and Information

Caller has specific question
Agent retrieves and speaks answer
Works if knowledge base is good

Triage and Routing

Understand caller intent
Route to appropriate queue or resource
Gather context before handoff

Use Cases That Struggle

Complex Problem Solving

Multiple rounds of troubleshooting work better in text where caller can follow along.

Emotional Situations

Complaints, cancellations, bad news. Often better handled by humans.

Long Information Collection

Forms with many fields. Voice fatigue sets in. Consider moving to text or callback.

Ambiguous Requests

Text agents can ask clarifying questions without feeling slow. Voice clarifications feel like the system doesn't work.

Testing Voice Agents

Acoustic Testing

Various accents and speech patterns
Background noise (office, car, street)
Audio quality issues (bad connection, speakerphone)
Edge cases (coughing, long pauses, um/uh)

Conversation Flow Testing

Complete task flows
Error recovery paths
Interruption handling
Timeout behaviour

Latency Testing

Measure end-to-end latency under load
Test across connection types
Identify bottlenecks

User Testing

Real users, not just team members
Record and analyse conversations
Track completion rates and time

Integration Considerations

Telephony Platforms

Options:

Twilio, Vonage, Bandwidth for API-based
Amazon Connect, Google Contact Center for full solution
On-premise integration with SIP trunking

Handoff to Humans

When voice agent can't help:

Warm transfer (agent stays on, introduces)
Cold transfer (direct to queue)
Callback arrangement (human calls back)

Pass context to human agent, don't make caller repeat.

Recording and Compliance

Voice calls often have:

Recording requirements (or prohibitions)
Consent requirements ("This call may be recorded")
Data retention obligations
Accessibility requirements

Build compliance in from the start.

Our Voice AI Approach

As specialists in voice AI, we've built voice-enabled customer service systems that handle real calls. The key principles:

Optimise for speed (latency kills voice UX)
Design for conversation (not just Q&A)
Plan for handoffs (voice can't do everything)
Test extensively with real acoustic conditions

Voice AI is harder than text. But for the right use cases, it opens capabilities that text can't match.

Work with AI specialists in Melbourne who understand the unique challenges of voice AI implementation. Our team helps businesses design and deploy voice agents that deliver natural, effective customer experiences.

Talk to us about voice AI for your business.