A Microsoft research team has unveiled VALL-E 2, a new AI system for speech synthesis capable of producing “human-level performance” sounds indistinguishable from the source with just a few seconds of audio. were
“(VALL-E 2) is the latest development in neural codec language models that marks a milestone in zero-shot text-to-speech synthesis (TTS), achieving human equivalents for the first time,” the research paper reads. Is.” It builds on its predecessor, the VALL-E, introduced in early 2023. Neural Codec Language models represent speech as a sequence of codes.
What sets VALL-E 2 apart from other acoustic cloning techniques is its “re-aware sampling” method and adaptive switching between sampling techniques, the team said. The strategy improves consistency and addresses the most common problems in traditional creative audio.
“VALL-E 2 consistently synthesizes high-quality speech, even for sentences that are traditionally difficult due to their complexity or repetitive phrases,” the researchers wrote, pointing out While technology can help produce speech for people who have lost the ability to speak.
As impressive as it is, however, the tool will not be made available to the public.
“At this time, we have no plans to add VALL-E 2 to any product or expand access to the public,” Microsoft said in its ethics statement, noting that such tools are voiceless without consent. pose risks such as impersonation and persuasive AI voices. Scams and other criminal activities.
The research team emphasized the need for a standardized method for digitally marking AI species, acknowledging that detecting AI-generated material with high accuracy is still a challenge.
“If the model is generalized to unseen speakers in the real world, it should include a protocol to ensure that the speaker uses their voice and a synthesized speech detection model. Approve,” he wrote.
That said, VALL-E 2 results are very accurate compared to other tools. In a series of tests conducted by the research team, VALL-E 2 outperformed human benchmarks in robustness, natural and generated speech matching.
Image: Microsoft
The VALL-E-2 was able to achieve these results with only 3 seconds of audio. However, the research team noted that “using 10-second speech samples resulted in even better quality.”
Microsoft isn't the only AI company to show off the latest AI models without releasing them. Meta's Voicebox and OpenAI's Voice Engine are two impressive voice cloners that also face similar limitations.
A spokesperson for MetaAI said that there are many interesting use cases for creative speech models, but we are not currently making the voice box model or code publicly available due to potential risks of misuse. Decrypt Last year
Also, OpenAI explained that it is trying to address the security issue before launching its artificial voices model.
OpenAI explained in an official blog post, “In line with our approach to AI safety and our voluntary commitments, we are choosing to review this technology but at this time it is not widely available. Not releasing.”
This call for ethical guidelines is spreading throughout the AI community, especially as regulators begin to raise concerns about the impact of creative AI in our daily lives.
Edited by Ryan Ozawa.