Google’s AI research lab, DeepMind, has announced it is developing AI technology that can generate sound and dialogue for GenAI-generated video

V2A, short for video-to-audio, uses a written description of a soundtrack and video to create music, sound effects and dialogue that Google says will match the intended tone and characters. 

In a blog post, DeepMind said that the AI model powering V2A was trained on a combination of video clips, sounds and dialogue transcripts.

“Video generation models are advancing at an incredible pace, but many current systems can only generate silent output,” DeepMind writes. “V2A technology [could] become a promising approach for bringing generated movies to life.”

Although GenAI sound design is nothing new, DeepMind claims the technology is the first to automatically implement matching music and dialogue to video.

“By training on video, audio and the additional annotations, our technology learns to associate specific audio events with various visual scenes, while responding to the information provided in the annotations or transcripts,” according to DeepMind.

How well do you really know your competitors?

Access the most comprehensive Company Profiles on the market, powered by GlobalData. Save hours of research. Gain competitive edge.

Company Profile – free sample

Thank you!

Your download email will arrive shortly

Not ready to buy yet? Download a free sample

We are confident about the unique quality of our Company Profiles. However, we want you to make the most beneficial decision for your business, so we offer a free sample that you can download by submitting the below form

By GlobalData
Visit our Privacy Policy for more information about our services, how we may use, process and share your personal data, including information of your rights in respect of your personal data and how you can unsubscribe from future marketing communications. Our services are intended for corporate subscribers and you warrant that the email address submitted is your corporate email address.

“Also, the system doesn’t need manual alignment of the generated sound with the video, which involves tediously adjusting different elements of sounds, visuals and timings,” the company added.

In January, UK AI voice generator company, ElevenLabs, reached unicorn status after its latest series B funding round achieved $80m, valuing the company over $1bn.

ElevenLabs claims its users have generated over 100 years’ of audio and its software is currently being used by 41% of Fortune 500 companies.  

The startup touts its AI voice generator for multiple use cases in publishing, media creation, entertainment and gaming. 

Despite the fierce competition, DeepMind is not in any rush to release the AI technology to the public.

“To make sure our V2A technology can have a positive impact on the creative community, we’re gathering diverse perspectives and insights from leading creators and filmmakers, and using this valuable feedback to inform our ongoing research and development,” DeepMind said.

“Before we consider opening access to it to the wider public, our V2A technology will undergo rigorous safety assessments and testing,” the company added.

GenAI is predicted to be the fastest-growing segment of AI and exceed a total revenue of $33bn in 2027, according to forecasts by research and analysis company GlobalData.