StepFun's Voice AI Topped Every Benchmark. It Also Hears Your Sighs

Summary

StepFun, a Shanghai-based AI lab, launched StepAudio 2.5 Realtime—a real-time, end-to-end voice model that processes audio directly, without converting speech to text. It supports both Chinese and English. Benchmarks indicate strong performance, notably in paralinguistic comprehension, where StepAudio scored 82.18 out of 100, outperforming GPT Realtime 1.5 and other competitors. In human evaluation tests, StepAudio scored 80.41, also leading the field. StepFun addresses common AI persona stability issues, such as out-of-character (OOC) drift, by using roleplay-specific reinforcement learning from human feedback (RLHF) and a vast, diversified dataset, aiming for consistent and robust character behavior even in unusual conversations. The model's algorithm can interpret non-verbal cues—such as emotion, speech rate, and age—from input audio. StepFun, founded in 2023 by Jiang Daxin and backed by $1.7 billion in funding, positions its technology as a direct competitor to OpenAI’s advanced voice mode, claiming superior results. The launch includes Xiao Yue, a highly customizable AI persona, and an API for developers to build custom characters. The model is available at platform.stepfun.com.