Google’s AI model can generate music, dialogues for videos: How it works – Times of India



Google DeepMind researchers have developed an AI-powered model, called video-to-audio (V2A), that can generate audio and dialogues for videos. This development is a significant step towards creating fully audiovisual experiences using AI.

How Google’s V2A AI model works

The video-to-audio (V2A) AI tech can work well with videos that are generated by AI models, such as Google’s Veo that was announced at Google I/O 2024.The V2A technology works by combining video information with text prompts.
Users can provide additional instructions to guide the V2A system towards specific sounds they want to create for a video, thus, allowing for creative control over the generated soundtrack.
“Today, we’re sharing progress on our video-to-audio (V2A) technology, which makes synchronised audiovisual generation possible. V2A combines video pixels with natural language text prompts to generate rich soundscapes for the on-screen action,” the company said.
“Our V2A technology is pairable with video generation models like Veo to create shots with a dramatic score, realistic sound effects or dialogue that matches the characters and tone of a video,” it added.
V2A first encodes the video and then uses a diffusion model to refine random noise into realistic audio that matches the video and any text prompts provided. Finally, the audio is decoded and combined with the video data.
Some of the use cases include generating soundtracks for silent videos or traditional footage, including archival materials and silent films.
“To generate higher quality audio and add the ability to guide the model towards generating specific sounds, we added more information to the training process, including AI-generated annotations with detailed descriptions of sound and transcripts of spoken dialogue,” Google DeepMind said.
The AI model is trained on video, audio and the additional annotations which is said to help it associate specific audio events with various visual scenes, at the same time, responding to the information provided in transcripts.

Limitations of the AI model

As per the researchers, the quality of the generated audio depends on the quality of the video input and lip movements in videos generated by other models might not perfectly match the soundtrack created by V2A.





Source link

Leave a Reply

Your email address will not be published. Required fields are marked *