
Alibaba launches Qwen3.5-Omni series with omnimodal, multilingual, and captioning upgrades
Alibaba Cloud has introduced Qwen3.5-Omni as the latest entry in its large language model lineup, expanding the series with Qwen3.5-Omni-Plus and the Qwen3.5-Omni-Plus-Realtime models. Qwen3.5-Omni is positioned as the company's leading omnimodal large language model, offering integrated support for text, image, audio, and audio-visual content understanding. The architecture employs Hybrid-Attention Mixture-of-Experts for both its Thinker and Talker components. The lineup includes instruct models of varying capabilities: Plus, Flash, and Light.
Building on this foundation, the Qwen3.5-Omni models enable a 256,000 token long-context input, process over 10 hours of audio, and manage more than 400 seconds of 720p video at one frame per second. These models are pretrained on extensive multimodal datasets, including more than 100 million hours of audio-visual material, which supports their content generation and perception across formats.
In terms of language support, Qwen3.5-Omni substantiates major improvements with speech recognition for 113 languages and dialects, and speech generation in 36. While these multilingual upgrades broaden its reach, Qwen3.5-Omni-Plus also outperforms Gemini-3.1 Pro in audio tasks and matches its performance in audio-visual understanding. The series features advanced captioning, capable of screenplay-level descriptions, scene segmentation, timestamping, and detailed mapping of character relationships within audio content. The new models are available through both Offline and Realtime APIs.
