Learn everything about Analytics

Baidu’s ‘Deep Voice’ AI System can Clone your Voice

, / 0


  • Baidu’s AI system needs just a 3 second sample to clone your voice
  • Researchers used speaker adaptation and speaker encoding to develop it
  • Check out their audio samples and research paper below



Chinese internet search giant Baidu has developed an AI system that can clone an individual’s voice! An year in the making, the text to speech system, called Deep Voice, can generate synthetic human voices using deep neural networks.

According to the information shared by Baidu Research, they claim that it takes their trained model just three seconds to replicate and output a person’s voice.

Baidu’s research team used voice cloning techniques to develop the AI system which they expect will have noteworthy applications in personalizing human-machine interface. They used a two-pronged approach to build their neural cloning system:

  • Speaker adaptation: It is grounded in a multi-speaker generative model that uses a backpropagation-based approach.
  • Speaker encoding: It combines the model that generates speaker embedding from cloned audio with the multi-speaker generative model, which helps in reducing the cloning time.

Speaker Adaptation and Speaker Encoding Approach

Both Speaker Adaptation and Speaker Encoding (requiring minimal audio) provide quality performance and can be integrated in the Deep Voice model along with speaker embeddings without having to compromise the quality of the source audio.

You can check out some audio samples provided by Baidu’s Research team which consist of original and synthesized voices. They have also published an official research paper which you can access here.


Our take on this

Text to speech technology has been around for a while. Google’s Deepmind, Adoble and Lyrebird have made significant contribution in the field. Baidu also developed a Text-To-Speech system in 2017 and has since made an exponential growth in the field.

As far the uses go for this technology, it can potentially be useful for improving digital virtual assistants like Apple’s Siri, Amazon’s Alexa or Google Assistant. The cloning voice technology might also be serviceable in the film industry.

One of the major areas where it can be of assistance is healthcare. Baidu claims that it’s technology will help people, who have lost their voice, to communicate again. It’s a bold claim and it remains to be seen if the technology is advanced enough to do this yet.

The early reviews from people have been mixed. While it has received some positive reviews for the thought, the execution hasn’t been great so far. If you listen to the audio samples provided above, the output doesn’t seem to be similar to the voice being cloned. Baidu has asked for a few months more to perfect the technology so the jury remains out on this for now.


Subscribe to AVBytes here to get regular data science, machine learning and AI updates in your inbox!