Advancing voice intelligence with new models in the API
We’re introducing three audio models in the API that unlock a new class of voice apps for developers. With these models, developers can build voice experiences that feel more natural, respond more intelligently, and take action in real time: - GPT‑Realtime‑2, our first voice model with GPT‑5‑class reasoning that can handle harder requests and carry the conversation forward naturally. - GPT‑Realtime‑Translate, a new live translation model that translates speech from 70+ input languages into 13 output languages while keeping pace with the speaker. - GPT‑Realtime‑Whisper, a new streaming speech-to-text that transcribes speech live as the speaker talks. Try GPT-Realtime-2 What can I ask? After you start the session, try saying one of these: - I’m hosting a last-minute dinner tonight. I have 30 minutes, two vegetarian friends, one mushroom-hater, and a tiny kitchen. Help me plan a simple menu. - I’m welcoming guests to a live event in Japan. Say a warm, natural welcome in Japanese — like a host kicking off something special. - My order number is Orbit-742Q. Repeat it back clearly so I can confirm it’s right. - Help me practice telling my team we hit our launch milestone. First say it with quiet confidence, then with more excitement. - I’m planning trivia for a road trip. Give me three trick questions that sound deceivingly simple, then explain each answer in one sentence. Voice is becoming one of the most natural ways for people to use software. It lets someone ask for help while driving, change a travel plan while walking through an airport, get support in their preferred language, or move through a task without stopping to type. But building useful voice products takes more than fast turn-taking or a natural-sounding voice. A voice agent needs to understand what someone means, keep track of context, recover when a request changes, use tools while the conversation continues, and respond in a way that feels appropriate to the moment. Together, the models we are launching move realtime audio from simple call-and-response toward voice interfaces that can actually do work: listen, reason, translate, transcribe, and take action as a conversation unfolds. As voice becomes a more natural way to use software, we’re seeing developers build around three emerging patterns in voice AI: - Voice-to-action, where people can describe what they need and the system can reason through the request, use tools, and complete the task. For example, Zillow is building an assistant that can listen, reason, and act on requests like: “find me homes within my BuyAbility, avoid busy streets, and schedule a tour for Saturday.” - Systems-to-voice, where software can turn context into live spoken guidance. For example, a travel app could proactively tell a traveler: “Your inbound flight is delayed, but you can still make your connection. I found the new gate, mapped the fastest route through the terminal, and your bag is still expected to transfer.” - Voice-to-voice, where AI can help live conversations continue across languages, tasks, or changing context. For example, Deutsche Telekom is…
