DeepMind's new AI generates soundtracks and dialogue for videos.

WhatsApp Group Join Now
Telegram Group Join Now
Instagram Group Join Now

DeepMind, Google's AI research lab, says it is developing AI tech to create soundtracks for videos.

In a post on its official blog, DeepMind says it sees the technology, V2A (short for “video-to-audio”) as an integral piece of the AI-generated media puzzle. Although many organizations, including DeepMind, have developed AI models that generate video, these models cannot create sound effects to synchronize with the videos they produce.

“Video generation models are advancing at an incredible pace, but many existing systems can only produce muted output,” writes DeepMind. “V2A Technology [could] Be a promising visionary to bring animated films to life.

DeepMind's V2A tech takes the details of a soundtrack paired with a video to create music, sound effects, and even dialogue (like “Jellyfish pulsing underwater, sea life, ocean” – competing with SynthID technology DeepMind says the AI ​​model that powers V2A, a diffusion model, was trained on a combination of sounds and dialogue transcripts, as well as video clips.

According to DeepMind, “Through training on video, audio and additional annotations, our technology learns to associate specific audio events with different visual scenes, while responding to information provided in annotations or transcripts.

There's word on whether any of the training data was copyrighted — and whether the creators of the data were informed of DeepMind's work. We've reached out to DeepMind for clarification and will update this post if we hear back.

AI-powered voice generation tools aren't new. Startup Stability AI released one last week, and ElevenLabs launched one in May. Nor are models for creating video sound effects. A Microsoft Project can create talking and singing videos from a still image, and platforms like Pika and GenreX have trained models to take video and make the best guess about what music or effects are appropriate in a given scene. .

But DeepMind claims its V2A tech is unique in that it can understand the raw pixels of video and automatically synchronize the sounds produced with the video, optionally without detail.

V2A isn't perfect, and DeepMind recognizes that. Because the underlying model was not trained on many videos with artifacts or distortion, it does not produce particularly high-quality audio for them. And usually, the generated audio isn't. Super My colleague Natasha Lomas of Persuasion described it as “a smorgasbord of stereotypical voices,” and I can't say I disagree.

For these reasons, and to prevent misuse, DeepMind says it won't be releasing the technology to the public anytime soon, if ever.

DeepMind writes, “To ensure that our V2A technology can positively impact the creative community, we are gathering diverse perspectives and insights from leading creators and filmmakers, and our ongoing research and development. are using this valuable feedback to inform,” writes DeepMind. “Our V2A technology will undergo rigorous security reviews and testing before we consider opening it up to the wider public.”

DeepMind touts its V2A technology as a particularly useful tool for archivists and people working with historical footage. But along those lines, creative AI also threatens to disrupt the film and TV industry. It's going to take some seriously tough labor protections to ensure that generative media tools don't eliminate jobs — or, as the case may be, the entire profession.

Leave a Comment