Building Voice Agents with OpenAI's Realtime API - What You Need to Know

March 16, 2026•8 min read•Michael Ridland

Building Voice Agents with OpenAI's Realtime API - What You Need to Know

Voice is one of those AI capabilities that people get excited about but struggle to implement well. The demos always sound impressive. The reality of building a production voice agent has traditionally involved stitching together speech-to-text, an LLM for reasoning, and text-to-speech, then dealing with the latency, lost nuance, and integration headaches that come with a three-model pipeline.

OpenAI's approach with the Realtime API changes this architecture pretty fundamentally. Instead of chaining separate models, the Realtime API processes and generates audio directly through a single model. One API, one connection, audio in, audio out. The reduction in latency alone makes conversations feel genuinely natural rather than like talking to a slow call centre IVR.

I've been watching this space closely because several of our clients are exploring voice interfaces for customer service, internal operations, and field work. Here's what I've learned about how it works, what it does well, and where it still has rough edges.

How the Realtime API Works

The traditional approach to voice AI looks like this: audio comes in, gets converted to text by a speech-to-text model, the text goes to an LLM for processing, the response text goes to a text-to-speech model, and audio comes out. Each step adds latency and loses information. Tone of voice, emphasis, hesitation - these nuances get stripped out during transcription and have to be artificially reconstructed in synthesis.

The Realtime API collapses this into a single model called gpt-realtime. Audio goes in, the model processes it natively (understanding both the words and how they're spoken), and audio comes out. The model maintains the conversational context in its native audio format rather than converting everything to text as an intermediate step.

The connection happens over WebSockets, which gives you persistent, low-latency bidirectional communication. You send audio frames, the model processes them, and audio responses stream back in real time. For developers who've worked with WebSocket APIs before, the pattern is familiar.

What makes this more than just a faster pipeline is that the model can handle interruptions naturally. If someone starts talking while the agent is responding, the model detects this and adjusts. In a traditional pipeline, handling interruptions means adding voice activity detection, managing audio buffers, and coordinating cancellation across three separate models. With the Realtime API, it's built into the model's behaviour.

Voice Agents Through the Agents SDK

OpenAI also provides voice agent support through their Agents SDK (the JavaScript/TypeScript version). This builds on the Realtime API but adds the agentic patterns - tool calling, handoffs between agents, and structured workflows.

The combination is powerful. You can build a voice agent that answers a phone call, understands the caller's request, calls APIs to look up account information, hands off to a specialist agent for complex queries, and does all of this while maintaining a natural voice conversation. The caller just talks. They don't need to "press 1 for billing" or spell out their account number.

We covered the Agents SDK in detail in a previous post. The voice capabilities sit on top of the same primitives - agents, tools, handoffs, and tracing - but with audio as the input and output modality instead of text.

What Actually Works Well

Latency is genuinely low. The single-model approach means response times feel conversational. There's still a small delay while the model processes, but it's closer to the natural pause between turns in a human conversation than the awkward multi-second gaps you get with chained models. For phone-based agents, this makes the difference between something people will tolerate and something they'll hang up on.

Voice quality has improved significantly. OpenAI offers multiple voices, including Cedar and Marin which are exclusive to the Realtime API. The output sounds more expressive and natural than earlier versions. It's not perfect - you can still tell it's synthetic if you're listening carefully - but it's well past the threshold where most people accept it as a reasonable interaction.

Tool calling works during voice conversations. The agent can pause mid-conversation to call an API, look up data, or perform an action, then resume talking with the results. The model handles this transition smoothly, and the caller hears a natural continuation rather than dead air or a "please hold" message. This is where the practical business value starts to appear - a voice agent that can actually do things, not just talk.

SIP integration for phone systems. The Realtime API supports Session Initiation Protocol, which means you can connect it directly to phone infrastructure. Your voice agent can answer actual phone calls, not just work through a web browser or app. For businesses that handle high volumes of inbound calls, this is where things get interesting.

Image and MCP support. The API now supports image inputs alongside audio, and it can connect to remote MCP servers for extended tool access. A voice agent could look at a photo the customer sent, describe what it sees, and take action - all within the same conversation.

Where It Gets Tricky

Cost adds up quickly. Audio tokens are expensive relative to text tokens. gpt-realtime pricing sits at $32 per million audio input tokens and $64 per million audio output tokens. For high-volume voice applications handling hundreds or thousands of concurrent calls, the costs can be substantial. Cached input tokens are much cheaper at $0.40 per million, so if your conversations follow predictable patterns, there are optimisation opportunities. But you need to model the economics carefully before committing to production deployment.

Handling edge cases in voice is harder than text. Background noise, accents, multiple speakers, phone line quality - these all affect how well the model understands input. In a text-based chat, a typo is obvious and the model handles it gracefully. In voice, a misheard word can send the conversation in a completely wrong direction, and recovering is awkward. You need good error handling and confirmation patterns built into your agent's instructions.

Not all conversations should be voice. This sounds obvious, but I've seen organisations try to voice-enable every interaction. Some things are better as text - anything involving numbers, addresses, email addresses, or complex data that needs to be confirmed character by character. A good voice agent knows when to say "Let me send you a text message with those details" rather than trying to spell out an email address over the phone.

Testing is harder. With text-based agents, you can write automated tests that send input strings and check output strings. Voice agent testing requires generating audio inputs, evaluating audio outputs, and handling the inherent variability in speech. It's doable but requires more infrastructure than text-based testing.

Regulatory considerations for Australian businesses. If your voice agent handles customer data, you need to think about recording, consent, and data residency. Australian Privacy Principles apply regardless of whether the conversation is with a human or an AI. Make sure your legal team has reviewed the implications before you deploy customer-facing voice agents.

Where Voice Agents Make Sense

Based on what we've seen working with clients, here are the use cases where voice agents deliver the most value:

High-volume first-line support. Answering common questions, routing calls to the right department, checking order status, confirming appointments. These are repetitive conversations with predictable patterns that a voice agent handles well, freeing up human agents for complex issues.

After-hours coverage. Many Australian businesses can't justify 24/7 human staffing but still get calls outside business hours. A voice agent that can handle basic enquiries and schedule callbacks for complex issues is a practical middle ground.

Field worker interfaces. For people working with their hands - tradies, warehouse staff, field technicians - typing on a phone isn't practical. A voice agent they can talk to for looking up procedures, logging work, or reporting issues fits naturally into their workflow.

Accessibility. Voice interfaces make services accessible to people who find text-based interfaces difficult - whether due to visual impairment, literacy challenges, or simply preference. This matters for government services and healthcare in particular.

Getting Started

If you're exploring voice agents for your organisation, start small. Pick one bounded use case - maybe after-hours call handling or a specific FAQ line - and build a proof of concept. Test it with real users (not just internal team members who know the script) and pay close attention to where conversations break down.

The technology is good enough for production use in the right scenarios. But "good enough" doesn't mean "deploy everywhere." Be selective about where you apply it, invest in the conversational design (the prompts and instructions matter enormously), and plan for the cases where the agent can't help and needs to hand off to a human.

For Australian organisations looking at voice AI, our AI agent development team can help you evaluate the options and build a pilot. We work across multiple platforms - OpenAI, Azure AI, and others - so we can recommend what fits your specific requirements rather than defaulting to one vendor. And if you're interested in broader AI-powered customer service beyond just voice, our AI for customer service solutions cover the full spectrum from chatbots to voice to multi-channel agents.

For the technical details on OpenAI's voice agent capabilities, check out the official voice agents guide and the Realtime API documentation.