OpenAI updates the Realtime API with gpt-realtime, its most advanced voice AI model yet
OpenAI’s Realtime API is now generally available after first launching in October 2024, bringing what the company calls its best voice AI model yet: gpt-realtime. This speech-to-speech system processes and generates audio directly without converting to text, delivering faster and more natural interactions. It can interpret nonverbal cues, supports function calls, switch languages mid-sentence, adjust tone or accent, and generate speech with emotional inflections. Benchmark results highlight its progress, with Big Bench Audio at 82.8%, MultiChallenge at 30.5%, and ComplexFuncBench at 66.5%.
Developers also gain enhanced integration options, including support for Session Initiation Protocol (SIP) to enable phone calling and remote Model Context Protocol (MCP) servers for connecting external tools and services. Additional features include reusable prompts, token limits, and session-trimming controls to manage costs. Image input support enables screenshots or photos to be processed for text reading or content-based queries, with permissions configurable by developers.
OpenAI also added two new synthetic voices, Cedar and Marin, alongside updates to existing ones. Pricing has been reduced by 20%, with audio input tokens at $32 per million and cached tokens at $0.40 per million. For EU users and privacy-sensitive businesses, data can be stored within the European Union under stricter compliance rules. The updated tools are available now through the Playground and official API documentation.