Introducing OpenAI’s gpt-realtime for Voice Agents

Introducing OpenAI’s gpt-realtime for Voice Agents

OpenAI’s new gpt-realtime model and updated Realtime API mark a shift from laggy, turn-based interactions to fluid, real-time conversations. With end-to-end speech processing, multimodal inputs, and production-ready features, developers can now build voice agents that feel natural, responsive, and truly collaborative.
10 mins read
Sep 29, 2025
Share

Think about the last time you interacted with a voice assistant.

You ask a question, and then comes that familiar, brief silence as computer-generated words are processed formulated into a response. That slight but perceptible delay is a fundamental barrier in human-computer interaction. The gap separates a simple transaction from a truly natural conversation.

At the end of August, OpenAI introduced a significant step toward closing that gap with the release of gpt-realtime, its most advanced speech-to-speech model, and major updates to its Realtime API. This is more than an incremental improvement in speed; it represents a foundational shift toward building AI agents that can interact with human conversation’s fluidity, nuance, and immediacy.

For us as developers, this announcement opens a new frontier. It challenges us to move beyond the turn-based, request-and-response model and create applications that feel less like tools and more like real-time collaborators. In this newsletter, we will explore what this means in practice as we cover the following:

  • What makes this a fundamental shift from traditional AI pipelines?

  • The key capabilities of the new gpt-realtime model.

  • A first look at the new Audio Playground for testing voice agents.

  • The crucial role of the Realtime API in enabling instant conversations.

  • The new production-ready features of the updated Realtime API.

The core innovation: A unified speech-to-speech system#

To appreciate the significance of this update, we first need to understand how most voice AI has worked until now. For years, the standard approach to building a voice agent involved stitching together separate, specialized systems in a pipeline: a speech-to-text model to transcribe the user’s audio, a large language model to process the text and decide on a response, and finally, a text-to-speech model to convert that response back into audio.

While functional, this pipeline approach has two major drawbacks. First, each handoff between models adds a delay, creating noticeable latency, the time delay between when a user speaks and when they begin to hear the AI’s response. Second, when human speech is converted to plain text, crucial information is lost: the tone, pace, and emotion that give our words meaning.

The gpt-realtime model and Realtime API introduce a new, more elegant paradigm. Instead of a complex, multi-step process, it operates as a single, end-to-end model. This unified system processes and generates audio directly, eliminating the intermediate text steps.

This architectural shift is the key to reducing latency and, for the first time, preserving the rich nuances of human expression from input to output.

A closer look at the gpt-realtime model’s capabilities#

At the heart of this announcement is a model trained to excel at the real-world tasks required for building sophisticated voice agents. The improvements are not just theoretical; they are measurable and directly impact the quality of interaction we can create.

Natural expression and deeper understanding#

A truly effective voice agent must sound natural, and gpt-realtime produces higher-quality, more emotionally resonant speech precisely because it no longer loses vital data during a text conversion step. For instance, we can now instruct the model to “adopt a gentle and encouraging tone, speaking slowly to help a student with a difficult concept,” or to “respond with the formal, deliberate cadence of a history professor,” giving us precise creative control over the user experience.

However, the true power of this new architecture lies in its ability to preserve the subtleties of human expression, moving beyond mere words to capture actual tone and intent from the user. This deeper comprehension is also a direct result of processing audio end-to-end. The model can now follow the natural, often unstructured, flow of human speech, interpreting non-verbal sounds like laughter that provide crucial context to a conversation. The agent can adapt without missing a beat if a user moves between languages in a single thought. This improved intelligence also extends to capturing critical details with higher accuracy, ensuring that specific information like a phone number or tracking number is understood correctly the first time.

Enhanced intelligence for complex tasks#

Beyond comprehension, gpt-realtime shows significant gains in reasoning and instruction following. On the Big Bench Audio evaluation, a benchmark that measures reasoning capabilities from audio input, the model scores 82.8% accuracy. Furthermore, the MultiChallenge audio benchmark, which measures adherence to instructions in multi-turn conversations, scores 30.5%, a significant improvement over the previous models.

One of the most impactful new features is improving asynchronous function calling. This allows the model to execute a tool, such as looking up information in a database, without pausing the conversation. The agent can continue interacting with the user while the tool runs in the background, eliminating the awkward silences that break conversational flow. On the ComplexFuncBench audio evaluation, gpt-realtime scores 66.5% in function calling performance, a substantial leap that makes our agents far more capable.

Informational note: With this release, OpenAI has introduced two new, highly realistic voices, Marin and Cedar. These voices are noted to have the most significant improvements in natural-sounding speech and are available exclusively in the Realtime API.

A look at the Realtime Audio Playground#

This all sounds powerful in theory, but how does it work in practice? And can we try it out? The answer is yes. OpenAI has integrated these new capabilities directly into their developer platform via the Audio Playground, allowing us to test and configure real-time agents before writing a single line of code.

OpenAI's gpt-realtime Audio Playground
OpenAI's gpt-realtime Audio Playground

Click the “Create” button to open the main interface for creating a new “audio prompt,” which is essentially a real-time voice agent.

OpenAI’s gpt-realtime Audio Playground for prompting
OpenAI’s gpt-realtime Audio Playground for prompting

Let’s break down what we’re seeing. The interface is split into two key areas: the central conversation space and the right-hand configuration panel, where we define the agent’s behavior. Here are some of the key settings available in the configuration panel:

  • Model selection: This is where we select our model. As shown, we have chosen gpt-realtime to power the conversation.

  • Audio controls: We have fine-grained control over the conversational flow. Silence duration lets us define how long the agent waits after the user stops speaking before it responds, while Noise reduction helps clean up the audio input for better accuracy in real-world environments.

  • Tool integration: The Functions and MCP servers sections are where we connect our agent to the outside world. This is the UI equivalent of the API functionality we will discuss next, allowing us to easily add tools like a database lookup or a connection to a GitHub repository.

  • User transcript model: An interesting detail is selecting a User transcript model (like whisper-1). This highlights that while the core interaction is speech-to-speech, a transcription model is still used in the background to provide a text log of the conversation for developers to review and debug.

The playground allows us to configure all these parameters, click “Start session,” and have a live conversation with our agent directly in the browser. It’s an invaluable tool for rapid prototyping and understanding how different configurations will impact the end user experience.

Understanding the Realtime API: The bridge to our model#

While gpt-realtime is the intelligent model, Realtime API is the essential communication layer that connects our applications to it. We must distinguish it from a standard request-response API to understand its importance.

In a typical API call, we send a complete prompt, wait for the server to process it, and return a complete answer. This is efficient for many tasks but creates a noticeable delay that feels unnatural in a live conversation.

Realtime API, which was first introduced as a public beta in October 2024, was built specifically to solve this problem. It operates on a streaming principle, creating a persistent, two-way connection between our application and the model. Instead of sending data in one block, it sends a continuous stream of audio, and the model streams its response back word by word. This architecture eliminates the delay, allowing for the fluid, low-latency interactions required for a truly conversational agent.

The relationship between the gpt-realtime model and the Realtime API is symbiotic. The model provides advanced conversational intelligence, and the API provides the robust infrastructure to deliver that intelligence instantly. As gpt-realtime has evolved with new capabilities, the Realtime API has been upgraded in lockstep to support them. The features we are about to discuss are the new tools within the API that unlock the full potential of this powerful new model.

What’s new in the Realtime API: Building production-ready agents#

Now that we understand the Realtime API’s fundamental role as our streaming bridge to the model, we can explore its latest evolution. With its move from public beta to general availability, the API introduces several features designed for reliability, capability, and seamless integration in production environments. These updates allow us to build agents that are far more aware and connected to the world around them. Let’s explore the three most significant new capabilities:

  • Image input: The Realtime API is now truly multimodal, with the native ability to accept image inputs alongside audio. This allows us to ground conversations in visual context, creating richer and more effective interactions where words alone are not enough.

  • Remote MCP server support: This feature makes extending an agent’s capabilities significantly easier by connecting it to external tools and services using the Model Context Protocol (MCP). Consider MCP as a universal adapter for AI tools; instead of writing custom integration code for every external service, we can point our agent to a public MCP server.

  • SIP support: We can now connect our applications directly to the public phone network using the Session Initiation Protocol (SIP), the standard protocol for voice communications over the internet. This means the agents we build are no longer confined to apps or websites; they can handle real phone calls, opening up a vast range of use cases.

Informational note: The Realtime API now supports reusable prompts to improve developer workflow and ensure consistency. This allows us to save a complete agent configuration as a single, reusable template, including its system instructions, tool definitions, and variables. This is especially valuable for deploying specialized agents at scale, as it eliminates redundant setup and guarantees predictable behavior across all sessions.

A look at pricing and availability#

The gpt-realtime model and the generally available Realtime API are now accessible to all developers. To encourage broader adoption for production applications, OpenAI has reduced the price by 20% compared to the previous preview version, making it more cost-effective to build and scale sophisticated voice agents.

The pricing is structured per 1 million tokens and varies by the type of input and output. It is important to note the significant cost savings offered for cached inputs, which helps make longer conversations more economical.

Here’s a clear breakdown of the costs:

Beyond the price reduction, OpenAI has also introduced more fine-grained controls to help developers manage costs effectively in production. We can now set intelligent token limits and truncate multiple conversational turns simultaneously. This is particularly useful for managing long-running sessions, ensuring that costs remain predictable without abruptly ending a user’s interaction.

A true collaborator#

Ensuring their responsible use is paramount as we build more powerful and interactive systems. The Realtime API is built with multiple layers of protection, from active classifiers that can halt conversations violating harmful content policies to the use of preset voices that prevent impersonation. Furthermore, the Agents SDK provides us with the tools to implement our custom safety guardrails, while enterprise-grade privacy commitments and support for EU Data Residency ensure that these powerful tools can be deployed confidently.

The launch of gpt-realtime and the production-ready Realtime API marks a significant moment in the evolution of AI interaction. We are moving beyond the era of turn-based commands and into a new landscape where AI can act as a true real-time partner in our creative and educational endeavors. The potential for building more intuitive, responsive, and genuinely helpful learning experiences is immense, and we are excited to see what our community builds next.

To help you harness the power of the Model Context Protocol, we have launched a new hands-on course dedicated to building these powerful, tool-enabled agents.

Mastering MCP: Building Advanced Agentic Applications

Cover
Mastering MCP: Building Advanced Agentic Applications

This course teaches you how to use the Model Context Protocol (MCP) to build real-world AI applications. You’ll explore the evolution of agentic AI, why LLMs need supporting systems, and how MCP works, from its architecture and life cycle to its communication protocols. You’ll build both single- and multi-server setups through hands-on projects like a weather assistant, learning to structure prompts and connect resources for context-aware systems. You’ll also extend the MCP application to integrate external frameworks like LlamaIndex and implement RAG for advanced agent behavior. The course covers observability essentials, including MCP authorization, authentication, logging, and debugging, to prepare your systems for production. It concludes with a capstone project where you’ll design and build a complete “Image Research Assistant,” a multimodal application that combines vision and research capabilities through a fully interactive web interface.

7hrs
Intermediate
1 Cloud Lab
12 Playgrounds

Written By:
Fahim ul Haq
The AI Infrastructure Blueprint: 5 Rules to Stay Online
Whether you’re building with OpenAI’s API, fine-tuning your own model, or scaling AI features in production, these strategies will help you keep services reliable under pressure.
9 mins read
Apr 9, 2025