Learn everything about Analytics

Google’s Neural Network Extracts the Audio Source by Looking at the Person’s Face

, / 0


  • Google Research team has developed a convolutional neural network that can separate audio from visual queues
  • The dataset consisted of 100,000 videos from Youtube
  • This was then “cleaned” to separate the speech per speaker. This led to 2000 hours of video clips



If you’ve taken Andrew Ng’s Machine Learning course, you will remember the example of two different audio sources at a cocktail party. He goes on to show how, using an algorithm, the voices have been separated from the background noise so you can clearly hear only that audio.

In their latest paper, the Google Research team has put forth an audio-visual deep learning model that takes this example to another level altogether. With this study, the team was able to produce videos in which the speech of specific people in enhanced while all the other noise is reduced to an almost negligible level, or cut out completely.

Once the user selects the face of the person in the video that he/she wants to hear, the algorithm works on enhancing that person’s audio and reducing everything else. It currently works on videos that have a single audio track. Check out an example of this in the below video:

What separates this research from anything done before it is the unique audio-visual component. As you can imagine, when a person speaks, the movement of their mouth should ideally correlate with the sound being produced. This helps in identifying which parts of the audio correspond to that specific person. As you can see in the image below, the input is a video with multiple people speaking simultaneously. The algorithm works on audio-visual source separation and the output is a decomposition of the input audio track into cleaned speeches (one for each person speaking).

What separates Google Research from most others is the ease of access they have to large amounts of data. They curated 100,000 high-quality videos of lectures from YouTube to build their dataset. Then they extracted parts with clean speech and with a single speaker visible in the video. This gave them around 2000 hours of video clips. These clips were in turn used to generate “mixtures of face videos and their corresponding speech from separate video sources, along with non-speech background noise we obtained from AudioSet“.

The team built a convolutional neural network model on this data to split the above mixture into separate audio streams for each user. You can read the research paper in full here or go through their official blog post.


Our take on this

This could be used for speech enhancement, recognition in videos, video conferencing, in healthcare to improve hearing aids for hard of hearing people, and in situations where you need the microphone to pick up specific speech patterns in a populated setting.

The model can even be used for automatic video captioning. The initial results have been very promising (to say the least). In a field that has seen as many challenges as breakthroughs, this will speed up research in the community.


Subscribe to AVBytes here to get regular data science, machine learning and AI updates in your inbox!