OpenAI introduces three realtime audio models for developers
Updated
Updated · OpenAI · May 7
OpenAI introduces three realtime audio models for developers
13 articles · Updated · OpenAI · May 7
The launch adds GPT-Realtime-2, live translation from 70-plus input languages into 13 outputs, and GPT-Realtime-Whisper streaming transcription, all available now through the Realtime API.
OpenAI said GPT-Realtime-2 brings GPT-5-class reasoning, a 128K context window and tool-calling for voice agents, with benchmark gains over GPT-Realtime-1.5 on audio intelligence and instruction following.
The company said the models target voice assistants, multilingual support and live captions, with safety classifiers in sessions and pricing from $0.017 per minute for transcription.
With OpenAI and DeepL in a voice AI arms race, what does the future of human conversation and employment look like?
As AI voices become indistinguishable from humans, how can society prevent a new wave of undetectable, real-time fraud?
Agentic AI promises huge cost savings, but what are the hidden societal costs of automating millions of service jobs?
OpenAI’s GPT-4o Audio Suite: Breakthroughs in Accuracy, Latency, and Voice Customization for Developers
Overview
In May 2026, OpenAI launched three advanced audio models that significantly improve speech transcription accuracy and generate more natural, customizable voices. These models support real-time applications through API access and specialized streaming optimizations. The launch has intensified competition and raised ethical concerns about voice impersonation and deepfake audio, prompting OpenAI to implement safeguards and industry players to develop watermarking technologies. While the models enable developers to create interactive voice agents, customer service tools, and accessibility applications, undisclosed pricing has sparked worries about affordability for smaller developers. Additionally, regulatory pressures and user adoption challenges remain key hurdles as OpenAI plans future enhancements and multimodal capabilities.