Learn Audio Beat Tracking for Music Information Retrieval (with Python codes)

JalFaizy Shaikh 15 May, 2020 • 7 min read

Introduction

Music is all around us. Whenever we hear any music that connects to our heart and mind – we lose ourselves to it. Subconsciously, we tap along with the beats we hear. You must have noticed your foot automatically follows along with the beats of that music. There is no logical explanation as to why we do this. It’s just that we fall into the groove of the music and our mind starts to resonate with the tunes.

What if we could train an artificial system that could catch the rhythm as we do? A cool application could be to build an expressive humanoid robot that runs a real-time beat tracking algorithm, enabling it to stay in sync with the music as it dances.

Looks interesting right?

In this article, we will understand the concept of beats, and the challenges we face trying to track it. Then we will go through the approaches of solving the problem, along with the state-of-the-art solutions in the industry.

Note : This article assumes that you have basic knowledge of audio data analysis in python. If not, you can go through this article and then have a go on this.

Overview of Beat Tracking
- What is Beat tracking?
- Applications of Beat Tracking
- Challenges in beat estimation
Approaches to solve Beat tracking
- Dynamic Programming
- RNNs + Deep Bayesian Networks

Overview of Beat Tracking

What is Beat tracking?

Audio beat tracking is commonly defined as determining the time instances in an audio recording, where a human listener is likely to tap his/her foot to the music. Audio beat tracking enables the “beat-synchronous” analysis of music.

Being an important and relevant MIR task, research has been rampant in this area. The goal of the automatic beat tracking task is to track all beat locations in a collection of sound files and output these beat onset times for each file.

To give you an intuition of the task, download and listen to the below audio file:

Now check out the audio below which has annotated timings for the respective beats.

Applications of Beat Tracking

There are numerous applications of beat tracking, but the most intriguing for me is beat synchronous light effects. Check out the video below for a real life demonstration.

On a more technical note, beat tracking could be used to for music information retrieval tasks such as music transcription. For example, you can annotate the beat of drums – so that it can then be shared with other music creators and enthusiasts

Other applications of beat tracking include:

Music Classification/Recommendation on the basis of beats
Audio performance analysis
Audio editing for a digital DJ

Challenges in beat tracking

Beat tracking may sound like a straightforward concept, but it’s actually an unsolved problem till now. For a simple tune, an algorithm may easily figure out the beats from the music. But often in real life, audio is much more complex and can be noisy. For example, there may be noise from the environment which could befuddle an algorithm and lead it to produce false positives while detecting beats.

Technically speaking, there are three major challenges when dealing with beat tracking:

Unclear pulse level
Sudden tempo changes
Vague/Noisy information

Approaches to solve Beat tracking

We now have a gist of what audio beat tracking is. So let us have a look at the approaches that are used to solve the challenge.

There is an annual evaluation campaign for music information retrieval algorithms, coupled to the ISMIR conference, called Music Information Retrieval Evaluation eXchange (MIREX). It has a task called audio beat tracking. Researchers participate in MIREX and submit their approaches. Here I will explain two of those approaches. The first one is simple and the most primitive, whereas the second one is state-of-the-art.

Approach 1: Onset Detection and Dynamic Programming

Suppose we have an audio such as this given below:

What we can do is, find out the locations where there is a sudden burst of sound (aka onset) and mark those moments of time:

This is most likely to be the representations of the beat. But it would contain many false positives, such as vocal sound of a person or background noise. So to minimize these false positives, we can find the longest common subsequence of these onsets to identify the beats. If you want to get into the details of how dynamic programming works, you can refer this article.

We will look at an implementation below to have a more clearer perspective.

# import modules
import librosa 
import IPython.display as ipd 

# read audio file 
x, sr = librosa.load('Media-103515.wav') 
ipd.Audio(x, rate=sr)

# approach 1 - onset detection and dynamic programming
tempo, beat_times = librosa.beat.beat_track(x, sr=sr, start_bpm=60, units='time')

clicks = librosa.clicks(beat_times, sr=sr, length=len(x))
ipd.Audio(x + clicks, rate=sr)

Approach 2: Ensemble of Recurrent Neural Networks coupled with Dynamic Bayesian Network

Instead of relying on sound cues manually, we can use a machine learning/deep learning approach. Shown below is an architecture of the framework for solving beat tracking. To get into the nitty-gritty of it, you can read the official research paper.

The gist of the approach is this – we preprocess the audio signal, and then use recurrent neural network to find out the most probable values of these beat times. The researchers additionally use a committee of Recurrent Neural Networks (RNN), and then ensemble their outputs using a bayesian network.

multimodel

madmom library contains the implementation of various state of the art algorithms in the field of beat tracking and is available on github. It incorporates low-level feature extraction and high-level feature analysis based on machine learning methods. The code for approach 2 is available in this repository as python files which output the beat locations from an input audio file.

Let’s look at an implementation in python:

# import modules
import madmom 

# approach 2 - dbn tracker
proc = madmom.features.beats.DBNBeatTrackingProcessor(fps=100)
act = madmom.features.beats.RNNBeatProcessor()('train/train1.wav')

beat_times = proc(act)

clicks = librosa.clicks(beat_times, sr=sr, length=len(x))
ipd.Audio(x + clicks, rate=sr)

Lets get into the details of the approach we just saw. There are four steps you perform to get audio beats from a music signal.

Step 1: Preprocessing audio signal

As with all the unstructured data, an artificial system does not grasp the concept of audio signal easily. So it has to be converted in a format that is explainable to a machine learning model (viz preprocessing the data). As mentioned in the image below, there are quite a few methods which you can use to preprocess an audio data.

Step 2: Training and Fine Tuning the RNN models

Now that you have preprocessed the data, you can apply a machine learning/deep learning model to decipher the patterns in the data.

Theoretically, if you train a single deep learning model with enough training examples and a proper architecture, it could perform well on the problem. But this is not always possible, due to reasons such as insufficient training data availability etc. So to boost the performance, what we can do is to train multiple RNN models on a single genre of music files, so that it can catch the patterns in that genre itself. This helps to mitigate some of the data insufficiency problems.

For this approach, we first train an LSTM model (a improved version of RNN) and set it as our base model. Then we fine tune multiple LSTM models derived from our base model. This fine tuning is done on different genres of musical sequences. At test time, we pass the audio signal from all the models.

Step 3: Choosing the best RNN model

At this step, we simply chose that model which has the least error among all the models, by comparing them with the score we get from the base model.

Step 4: Applying Dynamic Bayesian Network

Regardless of you use one LSTM or multiple LSTMs, there is a fundamental shortcoming: the final peak-finding stage does not try to find a global optimum when selecting the final locations of the beats. It rather determines the dominant tempo of the piece (or a segment of certain length) and then aligns the beat positions according to this tempo by simply choosing the best start position and then progressively locating the beats at positions with the highest activation function values in a certain region around the pre-determined position.

To circumvent this problem, we feed the output of the chosen neural network model into a dynamic Bayesian network (DBN) which jointly infers tempo and phase of a beat sequence. Another advantage of using DBNs is that we are able to model both beat and non-beat states, which was shown to perform superior to the case where only beat states are modelled. I will not go into the full description of how DBNs work, but you can refer this video if you are interested.

This is how we get the beats of a musical sequence from raw audio signals.

End Notes

I hope this article gave you an intuition of how to solve a beat tracking problem in python. This has potential applications in the music information retrieval domain, where we can identify similar kinds of music using the beat rhythm we get while tracking it. If you have any comments/suggestions do let me know in the comments below!

Learn, compete, hack and get hired!

JalFaizy Shaikh 15 May 2020

Faizan is a Data Science enthusiast and a Deep learning rookie. A recent Comp. Sc. undergrad, he aims to utilize his skills to push the boundaries of AI research.

Advanced Audio Audio Processing Deep Learning Project

Julien 14 Feb, 2018

Thanks for this article. Very interesting ! Do you know if there are some clues to figure out all the instruments composing a song ? The usage I imagine is to be able to auto create tablature and partitions for musician by simply playing a song.

Show 1 reply

Faizan Shaikh 04 Apr, 2018

Interesting idea. There has actually been some work done in this area (technically called audio source separation). Check out this wiki page - https://en.wikipedia.org/wiki/Source_separation

Filippo 14 Mar, 2022

Thanks JalFaizy, I am only now discovering this article that I find really interesting and useful for my goal. I have a BPM alignment problem with my 70s / 80s discomusic files. They were played by real musicians and therefore their BPM is not as perfect as a metronome. (Drifting tempo), Can python / librosa / madmonm help me make time constant from start to finish and discover the key of the song? Thank you for your answer. A greeting. DrJFilix.

Learn Audio Beat Tracking for Music Information Retrieval (with Python codes)

Introduction

Table of Contents

Overview of Beat Tracking

What is Beat tracking?

Applications of Beat Tracking

Challenges in beat tracking

Approaches to solve Beat tracking

Approach 1: Onset Detection and Dynamic Programming

Approach 2: Ensemble of Recurrent Neural Networks coupled with Dynamic Bayesian Network

Step 1: Preprocessing audio signal

Step 2: Training and Fine Tuning the RNN models

Step 3: Choosing the best RNN model

Step 4: Applying Dynamic Bayesian Network

End Notes

Learn, compete, hack and get hired!

Frequently Asked Questions

Responses From Readers

Write for us

Natural Language Processing

Learn Audio Beat Tracking for Music Information Retrieval (with Python codes)

Introduction

Table of Contents

Overview of Beat Tracking

What is Beat tracking?

Applications of Beat Tracking

Challenges in beat tracking

Approaches to solve Beat tracking

Approach 1: Onset Detection and Dynamic Programming

Approach 2: Ensemble of Recurrent Neural Networks coupled with Dynamic Bayesian Network

Step 1: Preprocessing audio signal

Step 2: Training and Fine Tuning the RNN models

Step 3: Choosing the best RNN model

Step 4: Applying Dynamic Bayesian Network

End Notes

Learn, compete, hack and get hired!

Frequently Asked Questions

Responses From Readers

Write for us

Natural Language Processing

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP