Amazon announces new base modelAmazon Nova Sonic, unifying speech understanding and speech generation into a single model, making the voice conversation performance of artificial intelligence application services closer to that of real people. It can be called through Amazon Bedrock in the form of an API and can be used for service call automation services or cross-industry artificial intelligence agent services covering fields such as tourism, education, medical care, and entertainment.
Traditional voice application development requires coordinating multiple models simultaneously, such as a speech recognition model that converts speech into text, a large language model that understands and generates responses, and a text-to-speech model that converts text into audio presentation. This not only increases development complexity, but also makes it difficult to preserve the vocal context and nuances that are crucial in natural conversations, such as tone, intonation, and speaking style.
Nova Sonic, on the other hand, abandons the previous design of using multiple different models and unifies the understanding and generation functions into a single model, allowing the model to adjust the generated voice responses based on the sound context such as tone and style, as well as the spoken input, to make the performance closer to the intonation of natural conversation.
Nova Sonic can even understand the subtle nuances of human conversation, including natural pauses and hesitations, enabling it to respond appropriately and gracefully handle interruptions. The model also generates text from the spoken content, allowing developers to use this text to call specific tools and APIs, thereby building richer voice AI agent services.
You can experience the natural intonation performance generated by Nova Sonic through the following link:
