Qwen 3.5 Omni: Alibaba’s AI Model Can Now Hear, Watch, and Clone Your Voice

Summary

Alibaba’s Qwen team has released Qwen 3.5 Omni, a major upgrade to its omnimodal AI capable of natively processing text, images, audio, and video simultaneously in real time across 36 languages. Unlike typical text-centered models, Qwen 3.5 Omni handles all input types directly without third-party tools. Available in Plus, Flash, and Light sizes, it features a 256,000-token context window and was trained on over 100 million hours of audio-visual data. Enhancements include improved reasoning, longer context, wider language support, and advanced real-time features, such as semantic interruption for smooth spoken interaction and the ARIA technique to maintain audio clarity. Voice cloning and live web search are supported. On multilingual benchmarks, it beat major competitors like ElevenLabs and GPT-Audio. The model can also perform "Audio-Visual Vibe Coding," generating code from video screen recordings. In tests, Qwen 3.5 Omni demonstrated faster, more integrated analysis of video content and seamless language switching compared to ChatGPT 5.4. It is accessible via Alibaba Cloud API and online demos.