Google DeepMind’s new language model can generate soundtracks, dialogues for videos — what it is, other details

Google’s V2A system does not require humans to adjust the synchronisation of the generated audio with the video.

June 19, 2024 18:37 IST

Google’s DeepMind research lab has unveiled a new AI model called V2A (Video-to-Audio) that can breathe life into silent videos by generating soundtracks and even dialogue. While the video generation technology is rapidly growing, most of the current systems can generate only create videos without sound. The new video-to-audio (V2A) technology allows synchronised audiovisual creation by combining video pixels with text prompts.

V2A can generate soundscapes that match the on-screen action when paired with video generation models like Veo. This means adding dramatic scores, realistic sound effects, or fitting dialogue that matches the characters and mood of the video.

Creators can provide text prompts to guide the AI towards a desired sound or mood. V2A can generate multiple soundtrack variations until a user finds the perfect fit.

How does the V2A system work?

Google explains that V2A system begins by compressing the video input. Then, using a diffusion model, it refines audio generation from random noise. This refinement is guided by visual input and natural language prompts to create synchronised and realistic audio that matches the given prompt. The resulting audio is then decoded into a waveform and combined with the video data.

Google says it has enhanced the technology by including additional data during the training process. This data includes AI-generated annotations that provide detailed descriptions of sounds, as well as transcripts of spoken dialogue. By incorporating this extra information, Google aims to improve the quality of the audio generated by their models.

Unlike many other video generation technology that requires humans to manually adjust, Google’s V2A system does not require humans to adjust the synchronisation of the generated audio with the video.

Normally, aligning audio with video involves painstakingly adjusting various elements such as sound effects, visuals, and timing to ensure they match perfectly. By using advanced technology, Google’s system automates this process ensuring that the audio and video are synchronised accurately without the need for manual intervention.

Related News

Who is Alexandr Wang? The 28-year-old hired by Mark Zuckerberg for $14 billion to head Meta’s Superintelligence Labs.

Who is Alexandr Wang? The 28-year-old hired for 14 billion US dollar by Mark Zukerberg to lead Meta superintelligence labs

An incorrect name on your Aadhaar can lead to severe issues with KYC (Know Your Customer) compliance, bank accounts, and benefits distribution.

Aadhaar card online update: How to change name, address, date of birth and phone number online in simple steps

WhatsApp tests feature allowing cross-messaging for users via Zoho founder Sridhar Vembu’s messaging app ‘Arattai’

‘I have talked to Tim Cook many times, we do not have the right to…’, says Ford CEO Jim Farley

‘If US blocks Google, ChatGPT, Instagram or Facebook’: Zoho founder Sridhar Vembu responds to Harsh Goenka on ‘plan B’

Yamaha launches two bikes; enters electric mobility with two more

3 hr ago

India Yamaha Motor has launched the XSR155 and FZ-Rave motorcycles, while also entering the electric mobility space with the Aerox-E and EC-06 models. The company’s chairman, Itaru Otani, sees India as a key market for growth and is focused on both premium and electric segments.

View all shorts