VibeVoice icon
VibeVoice icon

VibeVoice

VibeVoice is a novel framework designed for generating expressive, long-form, multi-speaker conversational audio, such as podcasts, from text. It addresses significant challenges in traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and...

VibeVoice screenshot 1

Cost / License

  • Free
  • Open Source

Platforms

  • Python
  • Self-Hosted
  • Hugging Face
-
No reviews
0likes
0comments
0news articles

Features

Suggest and vote on features
  1.  Text to Speech
  2.  AI-Powered

 Tags

  • ai-model

VibeVoice News & Activities

Highlights All activities

Recent activities

Show all activities

VibeVoice information

  • Developed by

    US flagMicrosoft
  • Licensing

    Open Source (MIT) and Free product.
  • Written in

  • Alternatives

    54 alternatives listed
  • Supported Languages

    • English

AlternativeTo Category

AI Tools & Services

GitHub repository

  •  17,143 Stars
  •  1,873 Forks
  •  58 Open Issues
  •   Updated  
View on GitHub
VibeVoice was added to AlternativeTo by Paul on and this page was last updated .
No comments or reviews, maybe you want to be first?
Post comment/review

What is VibeVoice?

VibeVoice is a novel framework designed for generating expressive, long-form, multi-speaker conversational audio, such as podcasts, from text. It addresses significant challenges in traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and natural turn-taking.

A core innovation of VibeVoice is its use of continuous speech tokenizers (Acoustic and Semantic) operating at an ultra-low frame rate of 7.5 Hz. These tokenizers efficiently preserve audio fidelity while significantly boosting computational efficiency for processing long sequences. VibeVoice employs a next-token diffusion framework, leveraging a Large Language Model (LLM) to understand textual context and dialogue flow, and a diffusion head to generate high-fidelity acoustic details.

The model can synthesize speech up to 90 minutes long with up to 4 distinct speakers, surpassing the typical 1-2 speaker limits of many prior models.

Official Links