Baidu unveils voice cloning artificial intelligence that can swap genders and remove accents using just a few seconds of audio

Artificial intelligence (AI) technology has truly come a long way. Nowadays, it’s used for more than just advanced computing machines and robots. For example, Chinese internet giant Baidu has applied AI to voice cloning technology that it’s currently developing, and the progress it has made so far is remarkable.

In just one year, the company has made a lot of improvements to its voice cloning AI technology, as it noted in a recent announcement. In particular, Baidu’s Deep Voice AI now has the ability to take no more than a few seconds of speech audio as a basis for its so-called voice clones. And not only that, the AI can also create entirely new voice clones from the sources by simply changing accents or even the genders of the speakers.

According to a report on the Deep Voice AI’s improvements, the AI can now accurately clone an individual voice faster than ever before. Interestingly, it still takes a fair amount of time for the AI to do anything more than that, but the waiting time isn’t that unreasonable, based on data and information from Baidu itself.

In a recently published white paper, the China-based company revealed that it has devised two different training methods that can be used to improve the AI’s performance. In one of the models, the voice cloning AI can create such believable audio output that most people might not believe it to be AI. In the second model, the voice cloning AI generates cloned audio a lost faster at the expense of voice quality.

In a blog post on the company’s official website, Baidu’s research team noted that they are developing this technology with the aim to “revolutionize human-machine interfaces with the latest artificial intelligence techniques.” Right now, the main motivation is said to be simply pushing the idea that a single system could learn to reproduce thousands of speaker identities even further.

The researchers shared some details about their methodology in the same blog post. “In this study, we focus on two fundamental approaches for solving the problems with voice cloning: speaker adaptation and speaker encoding,” they explained. “Both techniques can be adapted to a multi-speaker generative speech model with speaker embeddings, without degrading its quality. In terms of naturalness of the speech and similarity to the original speaker, both demonstrate good performance, even with very few cloning audios.”

Right now, if you listen to Deep Voice you can still spot moments when the voice sounds robotic. It’s especially true for the voices that have been generated by taking existing audio snippets, removing the accent as well as reversing the gender of the speaker. In Baidu’s uploaded voice samples, it’s most apparent in the voice sample of the male speaker that was created from a source sample that was taken from a female speaker. However, as the research team continues working on improving the technology, they should be able to iron out all the kinks soon – it’s only going to be a matter of time.

At first, you might think that this is a cool new piece of technology that’s only proving how interesting the AI industry is becoming. But while that’s true, to a certain extent, it also shows how dangerous AI can be. If it’s possible to take a person’s voice – even just a few seconds of it – and clone it to make that same voice appear to say something else, or worse, create new voice signatures entirely, then this could be disastrous for privacy and increase the risks of  disinformation in the future.

Find out other ways that robot tech is advancing in

Sources include:

value="Enter your email address here..." style=" border-radius: 2px; font: 14px/100% Arial, Helvetica, sans-serif; padding: .2em 2em .2em;" onfocus="if(this.value == 'Enter your email address here...') { this.value = ''; }" onblur="if(this.value == '') { this.value = 'Enter your email address here...'; }" />

style="display: inline-block;

outline: none;

cursor: pointer;

text-align: center;

text-decoration: none;

font: 14px/100% Arial, Helvetica, sans-serif;

padding: .2em 1em .3em;

text-shadow: 0 1px 1px rgba(0,0,0,.3);

-webkit-border-radius: .2em;

-moz-border-radius: .2em;

border-radius: .2em;

-webkit-box-shadow: 0 1px 2px rgba(0,0,0,.2);

-moz-box-shadow: 0 1px 2px rgba(0,0,0,.2);

box-shadow: 0 1px 2px rgba(0,0,0,.2);"


comments powered by Disqus