

VibeVoice
VibeVoice is a novel framework designed for generating expressive, long-form, multi-speaker conversational audio, such as podcasts, from text. It addresses significant challenges in traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and...
Cost / License
- Free
- Open Source
Platforms
- Python
- Self-Hosted
- Hugging Face
Features
- Text to Speech
- AI-Powered
Tags
- ai-model
VibeVoice News & Activities
Recent activities
Maoholguin added VibeVoice as alternative to Paraspeech
Maoholguin added VibeVoice as alternative to FLUID - AI Dictation- POX added VibeVoice as alternative to Vibe Transcribe, FUTO Voice Input, Voxtral and Speech Note
- POX added VibeVoice
VibeVoice information
What is VibeVoice?
VibeVoice is a novel framework designed for generating expressive, long-form, multi-speaker conversational audio, such as podcasts, from text. It addresses significant challenges in traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and natural turn-taking.
A core innovation of VibeVoice is its use of continuous speech tokenizers (Acoustic and Semantic) operating at an ultra-low frame rate of 7.5 Hz. These tokenizers efficiently preserve audio fidelity while significantly boosting computational efficiency for processing long sequences. VibeVoice employs a next-token diffusion framework, leveraging a Large Language Model (LLM) to understand textual context and dialogue flow, and a diffusion head to generate high-fidelity acoustic details.
The model can synthesize speech up to 90 minutes long with up to 4 distinct speakers, surpassing the typical 1-2 speaker limits of many prior models.





