Meta Unveils Voicebox: A Generative AI Model for Audio Applications
Meta has unveiled its latest venture in the field of artificial intelligence with Voicebox , a generative AI model designed for audio applications. Building on the success of its AI voice technology capable of recognizing over 4,000 languages, Meta's Voicebox takes vocal capabilities to new heights. This new technology is a generative model, similar to a text-to-speech (TTS) system but more capable and versatile.
Voicebox offers a range of features, including the generation, editing, and sampling of audio. It aims to create natural-sounding voices for virtual assistants and metaverse characters, as well as to improve audio synthesis for individuals with visual impairments or content creators. Unlike text and image-focused models like ChatGPT or Dall-E , Voicebox operates by producing audio files based on user-provided text.
What sets Voicebox apart from other speech synthesis models is its remarkably efficient training process. Meta leveraged audio recordings and transcriptions from non-specialized audiobooks in English, French, Spanish, German, Polish, and Portuguese. Despite this departure from the conventional training approach, the resulting vocal syntheses generated by Voicebox have proven to be highly effective and straightforward compared to existing solutions such as Microsoft's Vall-E.
Voicebox's advanced capabilities go beyond mere text-to-speech conversion. The model can predict and seamlessly fill in missing speech segments based on surrounding discourse and the provided transcription. This unique ability facilitates speech generation and allows for the enhancement of previously recorded audio files, saving time and effort. It can identify and replace poorly pronounced words or background noise, ensuring a more seamless listening experience.
Moreover, Voicebox supports multiple languages, enabling text-to-speech readings in English, French, German, Spanish, Polish, and Portuguese. Even when the languages of the speech sample and the given text do not match, Voicebox can produce accurate audio outputs. With training data spanning various genres in the six mentioned languages, Voicebox excels at delivering speech that closely mimics real-life conversations.
Despite its promising features, Meta has announced that Voicebox is currently unavailable to the public due to security reasons. Interested individuals can, however, explore the demonstration videos and accompanying research document included in the press release. As Meta continues to push the boundaries of AI innovation, Voicebox sets a new benchmark for high-quality, versatile audio synthesis.
