
Mistral AI unveils Voxtral TTS for nuanced & low-latency speech generation in 9 languages
Mistral AI has announced the launch of Voxtral TTS, a text-to-speech model designed for advanced multilingual voice generation. The model provides state-of-the-art results in nine languages, including English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic to deliver broad international support.
Unlike many text-to-speech systems, Voxtral TTS is lightweight with 4 billion parameters. This design facilitates efficient deployment at scale while maintaining natural-sounding and reliable speech output. Building on this efficiency, the model demonstrates advanced contextual understanding and speaker modeling, reproducing speaker personality traits such as natural pauses, rhythm, intonation, and emotional nuance.
These advances are reinforced by human evaluations, which show that Voxtral TTS surpasses ElevenLabs Flash v2.5 in naturalness and matches the quality and emotion steering capabilities of ElevenLabs v3, while maintaining fast response times. Notably, Voxtral TTS adapts to a custom voice reference as short as three seconds, accurately reproducing accent, inflections, intonation, and natural disfluencies. For a standard voice sample of ten seconds and 500 characters, the model delivers a latency of 70 milliseconds.
Users can experiment with Voxtral TTS in the Mistral AI Studio playground, use it in Le Chat, integrate it via API for $0.016 per 1,000 characters, or access open model weights on Hugging Face under a Creative Commons BY-NC 4.0 license.
