Shivani Sharma — August 28, 2021
Advanced Audio Libraries Project Python Unstructured Data Unsupervised

This article was published as a part of the Data Science Blogathon


The realities of the modern world are such that the analyst increasingly has to resort to the help of the latest machine learning algorithms to identify certain deviations in the operation of the system under study. The most in-demand are computer vision algorithms for processing photo and video information and techniques for working with natural languages ​​for text analysis. However, do not forget about such an important area as working with audio, which will be discussed in this article.


Understanding SciNet

Let’s take an example that you were tasked with analyzing a large number of phone calls from customers to identify the facts of pseudo-trusting management, i.e. those cases when the same person represents the interests of several clients by telephone. The total volume of audio data was more than 500 GB, and the total duration was 445 days (11 thousand hours). Naturally, it is impossible to listen to all the recordings by several people, therefore, to solve this problem, I use the automatic clustering of similar voices with the subsequent analysis of the groups obtained.

The SincNet model was chosen as a model for obtaining voice vectors. But before proceeding to the description of the applied method, let’s consider what approaches to extracting features from sound exist and why we chose SincNet.

An amplitude-time analysis is perhaps the simplest approach to sound processing.

Picture 1 Voice identification using SciNetFig 1

In this approach, an audio signal is considered as a one-dimensional representation of the oscillations of a sound wave with a certain sampling frequency (Fig. 1). One of the main advantages of this approach is the preservation of complete information about the signal, i.e. the signal is analyzed “as is”, without rejecting important information. Unfortunately, this is at the same time a disadvantage of the approach – it is difficult to separate the signal into useful and noise, and the high dimension of the data makes it difficult to process them quickly and with high quality. Unfortunately, it is difficult to extract a large number of useful properties from a signal using this approach.

Spectral analysis is an alternative approach. Its essence lies in the fact that the original signal can be decomposed into a component with the arbitrarily necessary accuracy. In other words, any complex signal can be represented in the form of its constituent sinusoids with certain frequencies and amplitudes. Thus, the signal spectrum is already a two-dimensional representation of the signal, which makes it possible to judge exactly how the signal energy is distributed over the frequencies in time.

Fig 2Picture 2 Voice identification using SciNet

To understand why spectrum analysis has made significant progress in audio processing, let’s remember what sound is and what it consists of using the example of the human voice.

Voice identification using SciNet

Fig 3

When air passes through the vocal cords, vibrations arise, which propagate in the medium in the form of elastic waves. Each sound (unless it is artificial) is a whole set of such waves. By studying the fundamental tone of a sound, its overtones, and formants, one can successfully solve certain problems associated with the analysis of audio.

For example, the pitch frequency (the lowest frequency in the signal) is often used in gender determination tasks, since the average pitch value for men and women is different, and averages 130 Hz for men versus 235 Hz for women. The analysis of the set of voice overtones is often useful in the problems of speaker identification because this set depends on the speech apparatus, which is individual for each. And finally, the analysis of formants (areas of amplification of certain frequencies) is actively used in problems related to the translation of speech into text.

The spectrogram allows you to successfully analyze all the components described above, and it is largely due to this that the models created based on data on the distribution of energy over frequencies in the signal have fairly good accuracy.

But there is also a third approach to processing data in an audio signal, and this approach is associated with the now fashionable concept – neural networks. The idea is simple – let’s feed audio to the network input and expect that the network will learn to independently identify patterns in the data and solve the tasks we need, be it feature extraction, speaker identification, speech recognition, emotion analysis, etc.

Figure 4Fig 4

As an architecture for extracting primary features from a signal, convolutional neural networks are often used, which show good results not only in computer vision tasks but also in computer hearing tasks. During training, CNN learns convolutions that the network thinks best fit for describing the data. In this case, the data itself does not have to be presented in the form of the original signal. For example, the representation of a signal in the form of a spectrum, fbank, or mfcc gives a significant increase in the quality and speed of training the network.

The problem is that by replacing the original signal with its representation, on the one hand, we simplify the task for the network, and on the other, we deprive it of the opportunity to find the best representation of the signal itself. The essence of our new approach is that a signal is fed to the input of the network in its original form, and band-pass filters are used as convolutions, the parameters of which are selected by the network in the learning process. For each filter, only 2 parameters are trained – the upper and lower frequencies. Thus, on the one hand, we allow the algorithm to see the raw data, and on the other hand, we teach the network to look at the data in the context of only certain frequency ranges.

The bandpass filters themselves can be represented as the difference between two low-frequency filters (Fig. 5). Passing into the time domain, the filter is the difference of two sin-functions (hence the name of the network – SincNet). Multiplying the original signal by the resulting convolution is equivalent to selecting a certain frequency band in the signal.

Figure 5

Comparison of convolutions learned by standard CNN with those taught by SincNet leads to the conclusion that both networks ultimately learn the same thing, namely, the allocation of certain frequencies of interest in the signal, however, SincNet has a certain trump card – it is added to its first layer information about exactly what the filter looks like, while the usual CNN is forced to select the best filter shape on its own (Fig. 6)

Figure 6Figure 6

Because of this approach, SincNet has several advantages over standard convolutional networks, namely:

  1. Fewer parameters (each SincNet convolution always depends on only 2 parameters – low and high frequency);

  2. Fewer operations to calculate the filter since the convolution is based on symmetric. We calculate the first half, we mirror the second;

  3. Fast convergence

  4. Good network interpretability – each convolution is a filter with clear boundaries. At the same time, the normalized sum of filters indicates the focus of attention on the main tone and formants.

All these advantages are supported by comparisons of quality metrics, where SincNet shows better results than the classic bundles DNN-MFCC, CNN-FBANK, CNN-RAW.



We used wav files with a sampling rate of 16 kHz as input data. Information about the caller’s number was used as a markup, implying that the same person is calling from the same number. Primary preprocessing of data includes, in addition to marking, the removal of pauses and silence using the third-party Voice Activity Detection model, as well as loudness normalization:

def audio_normalization(source_file, norm_file):
    [signal, fs] =
    signal = signal.astype(np.float64)
    signal = signal / np.max(np.abs(signal))
    sf.write(norm_file, signal, fs)
# load the VAD model
model = torch.jit.load (r'D: \ caller_cluster \ sincnet \ silero-vad \ files \ model_mini.jit ')
model.eval ()
empty_file = []
for wav in tqdm (wavs):
     way = os.path.join ('calls', wav)
     wav = read_audio (way)
     speech_timestamps = get_speech_ts_adaptive (wav, model, step = 500, num_samples_per_window = 4000)
     if len (speech_timestamps)> 0:
         # remove pauses
         save_audio (way, collect_chunks (speech_timestamps, wav), 16000)
         # normalize
         audio_normalization (way, way)
         # delete the file if there is no speech
         empty_file.append (way)
         ! rm - r {way}

The network settings were taken by default, the changes affected only the paths to the training datasets:

from dnn_models import MLP
from dnn_models import SincNet as CNN 

At its core, SincNet consists of 3 models. The first model is CNN with sinc-based convolution kernels, which is used to extract primary features. The second model collects all the features and transforms them into a vector of lower dimension. The task of the third model, based on the 2048-dimensional vector obtained as a result of the work of the first two models, is to predict whether a voice belongs to one or another speaker:

# Feature extractor CNN
trunk_arch = {'input_dim': wlen, 'fs': fs, 'cnn_N_filt': cnn_N_filt, 'cnn_len_filt': cnn_len_filt,
              'cnn_max_pool_len':cnn_max_pool_len, 'cnn_use_laynorm_inp': cnn_use_laynorm_inp,
              'cnn_use_batchnorm_inp': cnn_use_batchnorm_inp, 'cnn_use_laynorm':cnn_use_laynorm,
              'cnn_use_batchnorm':cnn_use_batchnorm, 'cnn_act': cnn_act, 'cnn_drop':cnn_drop}
trunk = torch.nn.DataParallel(CNN(trunk_arch).to(device))
# Set embedder model.
embedd_arch = {'input_dim': trunk_out_dim, 'fc_lay': fc_lay, 'fc_drop': fc_drop,
                 'fc_use_batchnorm': fc_use_batchnorm, 'fc_use_laynorm': fc_use_laynorm,
                 'fc_use_laynorm_inp': fc_use_laynorm_inp, 'fc_use_batchnorm_inp':fc_use_batchnorm_inp,
                 'fc_act': fc_act}
# Set the classifier. The classifier will take the embeddings and output a dimensional vector.
classifier_arch = {'input_dim':fc_lay[-1], 'fc_lay': class_lay, 'fc_drop': class_drop,
                   'fc_use_batchnorm': class_use_batchnorm, 'fc_use_laynorm': class_use_laynorm,
                   'fc_use_laynorm_inp': class_use_laynorm_inp, 'fc_use_batchnorm_inp':class_use_batchnorm_inp,
                   'fc_act': class_act}


Since in our task, the ultimate goal was not classification, but clustering, we needed to train this stack in this way so that the vectors from the output of the second model (in fact, the embedder) were close in n-dimensional space, if they belong to the same person and are distant, if belong to different people. For this, the learning process was modified using the MetricLearning technique.

Since different people can call from the same number and the same person can call from different numbers, at the first stage it was decided to pre-train the model on a dataset, in the markup of which we were 100% sure. This dataset was collected from the voices of employees and marked up automatically.

To estimate the proximity of the obtained vectors, the UMAP dimension reduction method was applied (Fig. 8). After several eras of training, it became clear that the model works correctly and successfully separates the voice of one person from the voice of another. Visually, this is manifested in the fact that dots of the same color (the same voice) are organized into separate groups.

Figure 8Figure 8

Despite the good work of the vectorizer, the accuracy of the classifier was only 0.225 (err = torch. mean (predict! = Label)), which may initially be confusing. However, it should be understood that this accuracy is considered in the context of the batch, and random parts from the audio, including those by which it is impossible to unambiguously judge who the voice belongs to (for example, a short pause before the next word or extraneous noise), fall into the batch. To assess the quality of the model, a slightly different approach is used: the phrase from the speaker is divided into short sections, each section is classified, and after that the most frequently predicted label is awarded to the entire set. This approach to evaluating the model shows that it turns out that our model has fairly high accuracy. The graph below shows the learning process.


As you can see, the test_sent_error error of the resulting model had already decreased to only 0.055 by the 32nd epoch, which can be interpreted as accuracy = 0.945. This is a decent result. It took about 6 minutes to train one epoch using a single GTX2070 and about 2 minutes using a Tesla V100 data lab. Thus, in less than a day, we managed to obtain a model of acceptable quality for our purposes.

As a result, an unthinkable number of telephone conversations were processed using the model. The processing made it possible to identify about 55 groups from 2 to 5 clients, applications for which were submitted by one vote. The management was informed about this fact and appropriate measures were taken.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

About the Author

Our Top Authors

  • Analytics Vidhya
  • Guest Blog
  • Tavish Srivastava
  • Aishwarya Singh
  • Aniruddha Bhandari
  • Abhishek Sharma
  • Aarshay Jain

Download Analytics Vidhya App for the Latest blog/Article

Leave a Reply Your email address will not be published. Required fields are marked *