Meta announced earlierA speech generation model called Voicebox, it will be possible to learn from a large amount of original audio and transcribed text through the Flow Matching algorithm, and then generate natural and vivid voice content.
At the same time, Voicebox is not limited to voice content in specific fields. With sufficient audio and text content, it can produce noise-free, clear voice. It can also perform content editing, style conversion, or output voice content with different sound characteristics.
The Flow Matching algorithm can learn directly from raw audio and text content and generate voice content simultaneously, without the need for data learning and training like previous similar speech generation models, which can only be trained on a single voice content.
The Flow Matching algorithm is designed to learn the differences between speech and text. Even if the text content is the same, differences in different voice presentation methods, such as intonation, speaking speed, accent, or stress, will result in different meanings for the same text content.
Currently, Voicebox is based on the results of 5 hours of accumulated voice recording training, including public recording clips in English, French, Spanish, German, Polish, and Portuguese, as well as corresponding text content. At the same time, by automatically generating artificial intelligence operation modes, it can not only quickly learn various voice pronunciations and reading methods, but also only need to input a voice sample and text content, it can read the text in the style of the input voice sample, and even make subsequent editing adjustments.
However, because this model may be abused, Meta currently only discloses the relevant technology to the public, and does not disclose the relevant model and source code content.


