JalFaizy Shaikh — Updated On August 27th, 2021
Advanced Audio Audio Processing Classification Deep Learning Project Python Supervised Technique Unstructured Data


When you get started with data science, you start simple. You go through simple projects like Loan Prediction problem or Big Mart Sales Prediction. These problems have structured data arranged neatly in a tabular format. In other words, you are spoon-fed the hardest part in data science pipeline.

The datasets in real life are much more complex.

You first have to understand it, collect it from various sources and arrange it in a format which is ready for processing. This is even more difficult when the data is in an unstructured format such as image or audio. This is so because you would have to represent image/audio data in a standard way for it to be useful for analysis.


The abundance on unstructured data

Interestingly, unstructured data represents huge under-exploited opportunity. It is closer to how we communicate and interact as humans. It also contains a lot of useful & powerful information. For example, if a person speaks; you not only get what he / she says but also what were the emotions of the person from the voice.

Also the body language of the person can show you many more features about a person, because actions speak louder than words! So in short, unstructured data is complex but processing it can reap easy rewards.

In this article, I intend to cover an overview of audio / voice processing with a case study so that you would get a hands-on introduction to solving audio processing problems.

Let’s get on with it!


Table of Contents

  • What do you mean by Audio data?
    • Applications of Audio Processing
  • Data Handling in Audio domain
  • Let’s solve the UrbanSound challenge!
  • Intermission: Our first submission
  • Let’s solve the challenge! Part 2: Building better models
  • Future Steps to explore


What do you mean by Audio data?

Directly or indirectly, you are always in contact with audio. Your brain is continuously processing and understanding audio data and giving you information about the environment. A simple example can be your conversations with people which you do daily. This speech is discerned by the other person to carry on the discussions. Even when you think you are in a quiet environment, you tend to catch much more subtle sounds, like the rustling of leaves or the splatter of rain. This is the extent of your connection with audio.

So can you somehow catch this audio floating all around you to do something constructive? Yes, of course! There are devices built which help you catch these sounds and represent it in computer readable format. Examples of these formats are

  • wav (Waveform Audio File) format
  • mp3 (MPEG-1 Audio Layer 3) format
  • WMA (Windows Media Audio) format

If you give a thought on what an audio looks like, it is nothing but a wave like format of data, where the amplitude of audio change with respect to time. This can be pictorial represented as follows.

Applications of Audio Processing

Although we discussed that audio data can be useful for analysis. But what are the potential applications of audio processing? Here I would list a few of them

  • Indexing music collections according to their audio features.
  • Recommending music for radio channels
  • Similarity search for audio files (aka Shazam)
  • Speech processing and synthesis – generating artificial voice for conversational agents

Here’s an exercise for you; can you think of an application of audio processing that can potentially help thousands of lives?


Data Handling in Audio domain

As with all unstructured data formats, audio data has a couple of preprocessing steps which have to be followed before it is presented for analysis.. We will cover this in detail in later article, here we will get an intuition on why this is done.

The first step is to actually load the data into a machine understandable format. For this, we simply take values after every specific time steps. For example; in a 2 second audio file, we extract values at half a second. This is called sampling of audio data, and the rate at which it is sampled is called the sampling rate.

Another way of representing audio data is by converting it into a different domain of data representation, namely the frequency domain. When we sample an audio data, we require much more data points to represent the whole data and also, the sampling rate should be as high as possible.

On the other hand, if we represent audio data in frequency domain, much less computational space is required. To get an intuition, take a look at the image below


Here, we separate one audio signal into 3 different pure signals, which can now be represented as three unique values in frequency domain.

There are a few more ways in which audio data can be represented, for example. using MFCs (Mel-Frequency cepstrums. PS: We will cover this in the later article). These are nothing but different ways to represent the data.

Now the next step is to extract features from this audio representations, so that our algorithm can work on these features and perform the task it is designed for. Here’s a visual representation of the categories of audio features that can be extracted.

After extracting these features, it is then sent to the machine learning model for further analysis.


Let’s solve the UrbanSound challenge!

Let us have a better practical overview in a real life project, the Urban Sound challenge. This practice problem is meant to introduce you to audio processing in the usual classification scenario.

The dataset contains 8732 sound excerpts (<=4s) of urban sounds from 10 classes, namely:

  • air conditioner,
  • car horn,
  • children playing,
  • dog bark,
  • drilling,
  • engine idling,
  • gun shot,
  • jackhammer,
  • siren, and
  • street music

Here’s a sound excerpt from the dataset. Can you guess which class does it belong to?

To play this in the jupyter notebook, you can simply follow along with the code.

import IPython.display as ipd

Now let us load this audio in our notebook as a numpy array. For this, we will use librosa library in python. To install librosa, just type this in command line

pip install librosa

Now we can run the following code to load the data

data, sampling_rate = librosa.load('../data/Train/2022.wav')

When you load the data, it gives you two objects; a numpy array of an audio file and the corresponding sampling rate by which it was extracted. Now to represent this as a waveform (which it originally is), use the following  code

% pylab inline
import os
import pandas as pd
import librosa
import glob 

plt.figure(figsize=(12, 4))
librosa.display.waveplot(data, sr=sampling_rate)

The output comes out as follows

Let us now visually inspect our data and see if we can find patterns in the data

Class:  jackhammer

Class: drilling

Class: dog_barking

We can see that it may be difficult to differentiate between jackhammer and drilling, but it is still easy to discern between dog_barking and drilling. To see more such examples, you can use this code

i = random.choice(train.index)

audio_name = train.ID[i]
path = os.path.join(data_dir, 'Train', str(audio_name) + '.wav')

print('Class: ', train.Class[i])
x, sr = librosa.load('../data/Train/' + str(train.ID[i]) + '.wav')

plt.figure(figsize=(12, 4))
librosa.display.waveplot(x, sr=sr)



Intermission: Our first submission

We will do a similar approach as we did for Age detection problem, to see the class distributions and just predict the max occurrence of all test cases as that class.

Let us see the distributions for this problem.


jackhammer 0.122907
engine_idling 0.114811
siren 0.111684
dog_bark 0.110396
air_conditioner 0.110396
children_playing 0.110396
street_music 0.110396
drilling 0.110396
car_horn 0.056302
gun_shot 0.042318

We see that jackhammer class has more values than any other class. So let us create our first submission with this idea.

test = pd.read_csv('../data/test.csv')
test['Class'] = 'jackhammer'
test.to_csv(‘sub01.csv’, index=False)

This seems like a good idea as a benchmark for any challenge, but for this problem, it seems a bit unfair. This is so because the dataset is not much imbalanced.


Let’s solve the challenge! Part 2: Building better models

Now let us see how we can leverage the concepts we learned above to solve the problem. We will follow these steps to solve the problem.

Step 1: Load audio files
Step 2: Extract features from audio
Step 3: Convert the data to pass it in our deep learning model
Step 4: Run a deep learning model and get results

Below is a code of how I implemented these steps

Step 1 and  2 combined: Load audio files and extract features

def parser(row):
   # function to load files and extract features
   file_name = os.path.join(os.path.abspath(data_dir), 'Train', str(row.ID) + '.wav')

   # handle exception to check if there isn't a file which is corrupted
      # here kaiser_fast is a technique used for faster extraction
      X, sample_rate = librosa.load(file_name, res_type='kaiser_fast') 
      # we extract mfcc feature from data
      mfccs = np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=40).T,axis=0) 
   except Exception as e:
      print("Error encountered while parsing file: ", file)
      return None, None
   feature = mfccs
   label = row.Class
   return [feature, label]

temp = train.apply(parser, axis=1)
temp.columns = ['feature', 'label']


Step 3: Convert the data to pass it in our deep learning model

from sklearn.preprocessing import LabelEncoder

X = np.array(temp.feature.tolist())
y = np.array(temp.label.tolist())

lb = LabelEncoder()

y = np_utils.to_categorical(lb.fit_transform(y))

Step 4: Run a deep learning model and get results

import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Convolution2D, MaxPooling2D
from keras.optimizers import Adam
from keras.utils import np_utils
from sklearn import metrics 

num_labels = y.shape[1]
filter_size = 2

# build model
model = Sequential()

model.add(Dense(256, input_shape=(40,)))



model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam')

Now let us train our model

model.fit(X, y, batch_size=32, epochs=5, validation_data=(val_x, val_y))

This is the result I got on training for 5 epochs

Train on 5435 samples, validate on 1359 samples
Epoch 1/10
5435/5435 [==============================] - 2s - loss: 12.0145 - acc: 0.1799 - val_loss: 8.3553 - val_acc: 0.2958
Epoch 2/10
5435/5435 [==============================] - 0s - loss: 7.6847 - acc: 0.2925 - val_loss: 2.1265 - val_acc: 0.5026
Epoch 3/10
5435/5435 [==============================] - 0s - loss: 2.5338 - acc: 0.3553 - val_loss: 1.7296 - val_acc: 0.5033
Epoch 4/10
5435/5435 [==============================] - 0s - loss: 1.8101 - acc: 0.4039 - val_loss: 1.4127 - val_acc: 0.6144
Epoch 5/10
5435/5435 [==============================] - 0s - loss: 1.5522 - acc: 0.4822 - val_loss: 1.2489 - val_acc: 0.6637

Seems ok, but the score can be increased obviously. (PS: I could get an accuracy of  80% on my validation dataset). Now its your turn, can you increase on this score? If you do, let me know in the comments below!


Future steps to explore

Now that we saw a simple applications, we can ideate a few more methods which can help us improve our score

  1. We applied a simple neural network model to the problem. Our immediate next step should be to understand where does the model fail and why. By this, we want to conceptualize our understanding of the failures of algorithm so that the next time we build a model, it does not do the same mistakes
  2. We can build more efficient models that our “better models”, such as convolutional neural networks or recurrent neural networks. These models have be proven to solve such problems with greater ease.
  3. We touched the concept of data augmentation, but we did not apply them here. You could try it to see if it works for the problem.


End Notes

In this article, I have given a brief overview of audio processing with an case study on UrbanSound challenge. I have also shown the steps you perform when dealing with audio data in python with librosa package. Giving this “shastra” in your hand, I hope you could try your own algorithms in Urban Sound challenge, or try solving your own audio problems in daily life. If you have any suggestions/ideas, do let me know in the comments below!

Learn, engage , hack and get hired!


About the Author

JalFaizy Shaikh
JalFaizy Shaikh

Faizan is a Data Science enthusiast and a Deep learning rookie. A recent Comp. Sc. undergrad, he aims to utilize his skills to push the boundaries of AI research.

Our Top Authors

Download Analytics Vidhya App for the Latest blog/Article

40 thoughts on "Getting Started with Audio Data Analysis using Deep Learning (with case study)"

kishor Peddolla
kishor Peddolla says: August 24, 2017 at 12:38 pm
Hi Faizan, It was great explanation thank you. and i am working like same problem but it is on the financial(bank customer) speech recognition problem, would you please help on this, Thank you in advance Regards, Kishor Peddolla Reply
Karthikeyan Sankaran
Karthikeyan Sankaran says: August 24, 2017 at 12:47 pm
Nice article, Faizan. Gives a good foundation to exploring audio data. Keep up the good work. Thanks Regards Karthik Reply
Kalyanaraman says: August 24, 2017 at 12:49 pm
Thanks. This is something I had been thinking for sometime. Reply
Faizan Shaikh
Faizan Shaikh says: August 24, 2017 at 4:54 pm
Thanks Karthikeyan Reply
Faizan Shaikh
Faizan Shaikh says: August 24, 2017 at 4:55 pm
Thanks kalyanaraman Reply
Faizan Shaikh
Faizan Shaikh says: August 24, 2017 at 5:06 pm
Hey Kishor, Sure! Your problem seems interesting. I might add that Speech recognition is more complex than audio classification, as it involves natural language processing too. Can you explain what approach you followed as of now to solve the problem? Also, I would suggest creating a thread on discussion portal so that more people from the community could contribute to help you Reply
Manoj says: August 24, 2017 at 5:09 pm
Nice article. I liked the introduction to python libraries for audio. Any chance, you cover hidden markov models for audio and related libraries. Thank you Reply
Georgios Sarantitis
Georgios Sarantitis says: August 24, 2017 at 6:02 pm
Hello Faizan and thank you for your introduction to sound recognition and clustering! Just a kind remark, I noticed that you have imported the Convolutional and maxpooling layers which you do not use so I guess there's no need for them to be there....But I did say WOW when I saw them - I thought you would implement a CNN solution... Reply
Nagu says: August 24, 2017 at 9:33 pm
Hi Faizan This is a very good article to get started on Audio analysis. I do not think any other books out there could have given this type of explanation ! Keep up the great work !!! Reply
Krish says: August 25, 2017 at 12:19 am
Great Work! Appreciate your effort in documenting this. Reply
Faizan Shaikh
Faizan Shaikh says: August 26, 2017 at 12:27 pm
Thanks Krish Reply
Faizan Shaikh
Faizan Shaikh says: August 26, 2017 at 12:27 pm
Thanks Nagu Reply
Faizan Shaikh
Faizan Shaikh says: August 26, 2017 at 12:28 pm
Thanks Manoj! I'll try to cover this in the next article Reply
Gowri says: August 26, 2017 at 3:25 pm
Great work faizan! I did go through this article and I find that most of machine learning articles require extensive knowledge of dataset or domain : like speech here. How does one do that and how do you decide to work on such problems ? Any references? I usually tend to follow moocs, but how to do self research and design end to end processes especially for machine learning? Reply
Darli Yang
Darli Yang says: September 05, 2017 at 6:40 pm
Hi Faizan, I got the following result, would you give some solutions to me: In [132]: model.fit(X, y, batch_size=32, epochs=5) Traceback (most recent call last): File "", line 1, in model.fit(X, y, batch_size=32, epochs=5) File "C:\Users\admin\Anaconda2\lib\site-packages\keras\models.py", line 867, in fit initial_epoch=initial_epoch) File "C:\Users\admin\Anaconda2\lib\site-packages\keras\engine\training.py", line 1522, in fit batch_size=batch_size) File "C:\Users\admin\Anaconda2\lib\site-packages\keras\engine\training.py", line 1378, in _standardize_user_data exception_prefix='input') File "C:\Users\admin\Anaconda2\lib\site-packages\keras\engine\training.py", line 144, in _standardize_input_data str(array.shape)) ValueError: Error when checking input: expected dense_7_input to have shape (None, 40) but got array with shape (5435L, 1L) Reply
Faizan Shaikh
Faizan Shaikh says: September 06, 2017 at 9:01 pm
Hi Gowri, You are right to say that data science problems involve domain knowledge to solve problems, and this comes from experience in working on those kind of problems. When I take up a problem, I try to do as much research as I can and also, try to get hands on experience in it. Each person has his or her own learning process. So my process may or may not work for you. Still I would suggest a course that would help you https://www.coursera.org/learn/learning-how-to-learn Reply
Phani says: September 16, 2017 at 2:29 am
Thank you for the great explanation. Do you mind making the source code including data files and iPython notebook available through gitHub? Reply
Faizan Shaikh
Faizan Shaikh says: September 19, 2017 at 12:02 pm
Sure. Will do Reply
Phani says: September 25, 2017 at 9:01 pm
Hi Faizan, A friendly reminder about the ipython notebook you promised. Here is the reason for my curiosity. While experimenting with urban sound dataset (https://serv.cusp.nyu.edu/projects/urbansounddataset/urbansound8k.html), with an identical deep feed forward neural network like yours, the best accuracy I have achieved is 65%. That is after lots of hyper parameterization. I know in this blog you have reported similar accuracy and further alluded that you could achieve 80% accuracy. That is impressive, and I am aiming for similar result. However, I have noticed your dataset size is not the full 8K set. In my experimentation, I am using audio folders1-8 for training, folder 9 for validation and folder 10 for testing. I get 65% accuracy both on the validation and testing sets. Hope you could share your notebook or help me towards 80% accuracy goal. While I am currently experimenting with data augmentation, your help is much appreciated. I am aiming for this higher accuracy before using the trained model/parameters for a custom project of mine to classify a personal audio dataset. Thank you in advance, Phani. Reply
Phani says: September 25, 2017 at 9:25 pm
forgot to mention, for my training I am extracting 5 different datapoints (mfccs,chroma,mel,contrast,tonnetz) not just one (mfccs) like you did. With this fullset I get 65% accuracy. With mfccs alone I get only 53%. Also, 60% is the highest I saw so far in various other blogs with similar dataset. Interestingly convoluted networks (CNN) with mel features alone could not push this any further, making your results of 80% that much more impressive. Look forward to seeing your response. Thank you in advance. Reply
Smitha says: October 26, 2017 at 11:42 am
Nice article... even I want to classify normal and pathological voice samples using keras... if I get any difficulty please help me regarding this.... Reply
Uraj singh
Uraj singh says: November 13, 2017 at 12:35 pm
Thanks for suggesting the wonderful course !! Reply
Sourish says: November 16, 2017 at 11:09 am
Hi Faizan, Thank you for introducing this concept. However there is a basic problem,I am facing. I can't install librosa, as every time I typed import librosa I got AttributeError: module 'llvmlite.binding' has no attribute 'get_host_cpu_name'. I googled a lot, but didn't find a solution for this. Can you please provide a solution here, so that I can proceed further. Thanks Reply
Faizan Shaikh
Faizan Shaikh says: November 16, 2017 at 4:46 pm
The input which you give to the neural network is improper. You can answer the following questions to get the answer

1. What is the shape of input layer?
2. What is the shape of X? Reply
Faizan Shaikh
Faizan Shaikh says: November 16, 2017 at 4:51 pm
Hi, A solution to similar issue was to reinstall llvm package by executing sudo apt-get install llvm Reply
Faizan Shaikh
Faizan Shaikh says: November 16, 2017 at 4:58 pm
Sure Reply
Sourish says: November 17, 2017 at 12:10 pm
Tried with that, however not solved the problem.mine is windows OS with anaconda environment. Thanks Reply
Faizan Shaikh
Faizan Shaikh says: November 17, 2017 at 4:40 pm
As a last resort, you can rely on a docker system for testing out the code Reply
Darli says: November 22, 2017 at 8:24 pm
I have solved this problem, Thanks! Reply
Toke Hiber
Toke Hiber says: April 11, 2018 at 6:48 am
Hi sir. Thanks for this nice article. But how to I get datasets? Reply
LouisCC says: April 17, 2018 at 2:54 pm
Hi, How do you read train.scv to get train variable ? Thank You in advance Louis Reply
louisCC says: April 17, 2018 at 3:05 pm
Hello You can find the dataset here : https://drive.google.com/drive/folders/0By0bAi7hOBAFUHVXd1JCN3MwTEU Reply
Maxwel says: April 18, 2018 at 8:16 am
Can i get the dataset please Reply
Houda Abzd
Houda Abzd says: April 18, 2018 at 9:29 pm
Hi, I would like to use your example for my problem which is the separation of audio sources , I have some troubles using the code because I don't know what do you mean by "train" , and also I need your data to run the example to see if it is working in my python, so can you plz provide us all the data through gitHub? Reply
Aishwarya Singh
Aishwarya Singh says: April 19, 2018 at 1:20 pm
Hi Maxwel, The link to the dataset is provided in the article itself. Reply
Aishwarya Singh
Aishwarya Singh says: April 20, 2018 at 3:41 pm
Hi Louis, The link for the dataset is provided in the article itself. you can download it from there. Reply
Aishwarya Singh
Aishwarya Singh says: April 20, 2018 at 3:53 pm
Hi Houda, The dataset has two parts, train and test. The link to download the datasets is provided in the article itself. Reply
Houda bzd
Houda bzd says: April 20, 2018 at 4:13 pm
Hi, thanks for the nice article, I have a problem dealing with the code, it gives me "name 'train' is not defined" even I have the dataset , can you help me plz ? Best. Reply
Houda Abzd
Houda Abzd says: April 21, 2018 at 9:17 pm
Hi Aishwarya , First of all , thanks for your feedback, I download the data, otherwise, I get this error: TypeError: '<' not supported between instances of 'NoneType' and 'str' , this error comes with this command: y = np_utils.to_categorical(lb.fit_transform(y)) knowing that I am using python 3.6. any help or suggestion I will be upreciating that :) Best. Reply
Aishwarya Singh
Aishwarya Singh says: April 23, 2018 at 1:46 pm
Hi, Glad you liked the article. Also, check the name you have set for the dataset you're trying to load. I guess it should be 'Train', not 'train' Reply

Leave a Reply Your email address will not be published. Required fields are marked *