Google Cloud’s Machine Learning Powered Text-to-Speech is Available for Everyone!

Pranav Dar 07 May, 2019

3 min read

Overview

Google Cloud’s Text-to-Speech and Speech-to-Text offerings are now available to the general public
The latest updates are packed with features, with the key one being the the release of 17 new WaveNet powered voices
A TensorFlow implementation of WaveNet is available on GitHub and the link is in the below article

Introduction

Text-to-speech and Speech-to-text are fascinating concepts and ones that have seen a ton of research thanks to machine learning. We are no longer limited to hearing mechanical voices emanating from machines. If you’re still skeptical, a look at the Google Duplex demo will quickly convince you otherwise.

Source: Financial Express

Google Cloud’s Text-to-Speech and Speech-to-Text offerings have been around for almost a year but were still fairly limited in their ability to synthesize speech and doing so in multiple languages. However, all bets are off in the latest release. A bunch of updates have been added, making it far easier to hear natural sounding voices from machines and generating much more accurate transcripts.

And of course, the Text-to-Speech API is now available to the general public!

This Text-to-Speech API now works in 14 languages and supports 30 standards voices along with 26 WaveNet voices. A demo is available for you to try out here. The below table shows the entire list:

The key takeaway for data scientists in this release is surely the launch of 17 new WaveNet voices. WaveNet is a model developed by DeepMind that uses machine learning to generate these text-to-speech audios. It’s a deep neural network that is capable of producing incredible human-like sound from machines. It is the algorithm that powers the voice you hear in the Google Assistant. You can read more about WaveNet here.

On the Speech-to-Text front, Google Cloud can now recognize the different speakers in the audio thanks to machine learning. You need to specify how many speakers are there in the audio, and Google’s service then gets to work. It even has the ability to tag each word with a unique speaker number.

You don’t even need to wait for Google to release any research paper detailing each step – head over to this GitHub repository and download the TensorFlow implementation of WaveNet!

Our take on this

This is a brilliant example of combining NLP with audio processing. When I’d heard the Google Duplex demo earlier this year I was blow away and instantly wanted to figure out how I could create this technology on my machine. And with WaveNet’s TensorFlow implementation that part becomes easier than ever before.

Once you’re done playing around with the demo, you can use the full services. The pricing et all is available on Google’s page. It’s not very expensive so I would recommend trying it out at least once if NLP is your field of choice.

Subscribe to AVBytes here to get regular data science, machine learning and AI updates in your inbox!

Pranav Dar 07 May, 2019

Senior Editor at Analytics Vidhya. Data visualization practitioner who loves reading and delving deeper into the data science and machine learning arts. Always looking for new ways to improve processes using ML and AI.

AVbytes