Home/Newsletter/Artificial Intelligence/Introducing OpenAI’s gpt-realtime for Voice Agents

Introducing OpenAI’s gpt-realtime for Voice Agents

OpenAI’s new gpt-realtime model and updated Realtime API mark a shift from laggy, turn-based interactions to fluid, real-time conversations. With end-to-end speech processing, multimodal inputs, and production-ready features, developers can now build voice agents that feel natural, responsive, and truly collaborative.

10 min read

Sep 29, 2025

Think about the last time you interacted with a voice assistant.

You ask a question, and then comes that familiar, brief silence as computer-generated words are processed formulated into a response. That slight but perceptible delay is a fundamental barrier in human-computer interaction. The gap separates a simple transaction from a truly natural conversation.

At the end of August, OpenAI introduced a significant step toward closing that gap with the release of gpt-realtime, its most advanced speech-to-speech model, and major updates to its Realtime API. This is more than an incremental improvement in speed; it represents a foundational shift toward building AI agents that can interact with human conversation’s fluidity, nuance, and immediacy.

For us as developers, this announcement opens a new frontier. It challenges us to move beyond the turn-based, request-and-response model and create applications that feel less like tools and more like real-time collaborators. In this newsletter, we will explore what this means in practice as we cover the following:

What makes this a fundamental shift from traditional AI pipelines?
The key capabilities of the new gpt-realtime model.
A first look at the new Audio Playground for testing voice agents.
The crucial role of the Realtime API in enabling instant conversations.
The new production-ready features of the updated Realtime API.

The core innovation: A unified speech-to-speech system

To appreciate the significance of this update, we first need to understand how most voice AI has worked until now. For years, the standard approach to building a voice agent involved stitching together separate, specialized systems in a pipeline: a speech-to-text model to transcribe the user’s audio, a large language model to process the text and decide on a response, and finally, a text-to-speech model to convert that response back into audio.

Written By: Fahim