Google DeepMind’s new language model can generate soundtracks, dialogues for videos — what it is, other details

Google’s V2A system does not require humans to adjust the synchronisation of the generated audio with the video.

Google DeepMind’s new language model can generate soundtracks, dialogues for videos — what it is, other details
Google DeepMind’s new language model can generate soundtracks, dialogues for videos — what it is, other details

Google’s DeepMind research lab has unveiled a new AI model called V2A (Video-to-Audio) that can breathe life into silent videos by generating soundtracks and even dialogue. While the video generation technology is rapidly growing, most of the current systems can generate only create videos without sound. The new video-to-audio (V2A) technology allows synchronised audiovisual creation by combining video pixels with text prompts.

V2A can generate soundscapes that match the on-screen action when paired with video generation models like Veo. This means adding dramatic scores, realistic sound effects, or fitting dialogue that matches the characters and mood of the video.

Creators can provide text prompts to guide the AI towards a desired sound or mood. V2A can generate multiple soundtrack variations until a user finds the perfect fit.

How does the V2A system work?

Google explains that V2A system begins by compressing the video input. Then, using a diffusion model, it refines audio generation from random noise. This refinement is guided by visual input and natural language prompts to create synchronised and realistic audio that matches the given prompt. The resulting audio is then decoded into a waveform and combined with the video data.

Google says it has enhanced the technology by including additional data during the training process. This data includes AI-generated annotations that provide detailed descriptions of sounds, as well as transcripts of spoken dialogue. By incorporating this extra information, Google aims to improve the quality of the audio generated by their models.

Unlike many other video generation technology that requires humans to manually adjust, Google’s V2A system does not require humans to adjust the synchronisation of the generated audio with the video.

Normally, aligning audio with video involves painstakingly adjusting various elements such as sound effects, visuals, and timing to ensure they match perfectly. By using advanced technology, Google’s system automates this process ensuring that the audio and video are synchronised accurately without the need for manual intervention.

Get live Share Market updates, Stock Market Quotes, and the latest India News and business news on Financial Express. Download the Financial Express App for the latest finance news.

This article was first uploaded on June nineteen, twenty twenty-four, at thirty-seven minutes past six in the evening.

Photo Gallery

View All
Market Data
Market Data