We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details

PixelPlayer – Identify and Extract Musical Instrument Sounds from Videos with MIT’s AI

pranavdar 06 Jul, 2018
3 min read

Overview

  • Researchers from MIT have developed an AI system, called PixelPlayer, that identifies and isolates instrument sounds from videos
  • The system, developed through self-supervised learning, was trained on over 60 hours of video
  • Three neural networks are at play in this system – one for video, one for audio, and a third for separating the sound

 

Introduction

There are countless times when I listen to music on YouTube and I’m mesmerized by one of the instruments in the video. But isolating and extracting that instrument’s sound has so far been a difficult and cumbersome task for casual listeners and amateur musicians. Unless you owned and knew how to use a sophisticated tool, you were out of luck.

This is where machine learning and AI have become so useful. Researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have developed a deep learning model that takes a video as input and identifies and isolates the sound of specific instruments. It even has the ability to make that instrument’s sound louder or softer.

The model, or system, has been built using self-supervised learning, which doesn’t require any pre-labelled data. Of course this makes it difficult to interpret fully how the system arrived at a certain result (how it isolated the instrument in this case) but this is something the researchers are working to understand.

So how does it all work?

The system, called PixelPlayer, was trained on over 60 hours of videos and can identify 20 different instruments. The deep learning model first locates the image regions which are producing sound. It then separates the sound into a number of components that represent the sound from each pixel in the image (this is where the system’s name comes from).

There are a number of neural networks at play within the system – one that analyzes the visuals in the video, another that works on the audio part, and a third that first “associates specific pixels with specific soundwaves”, and then separates the different sounds.

The part which surprised the researchers is that the system even recognizes actual musical elements. Their research found that “certain harmonic frequencies seem to correlate to instruments like violin, while quick pulse-like patterns correspond to instruments like the xylophone”.

You should read the research paper which outlines PixelPlayer in more detail, including details of the experiments and their results. Check out the video below which shows this technology working it’s magic:

 

Our take on this

To put things into context, this isn’t the first attempt at using machine learning and AI in the music industry. We have previously seen Google’s entry into this sphere with nSynth, a Data Science Music challenge from Michigan University, among other things. A lot of professional musicians are using AI to not only make music, but to create videos from scratch as well!

This kind of AI can potentially be used to understand environmental sounds as well. I can see this being incorporated into the self-driving car technology to make it even safer. I personally can’t wait for MIT to release the code on GitHub. Have you ever worked on any sound processing projects or datasets? Connect with me in the comments below.

 

Subscribe to AVBytes here to get regular data science, machine learning and AI updates in your inbox!

 

pranavdar 06 Jul, 2018