Complete Introductory Guide to Speech to Text with Transformers

Ritobrata Ghosh 17 Jul, 2023 • 9 min read


We all deal with audio data much more than we realize. The world is full of audio data and related problems that beg solving. And we can use Machine Learning to solve many of these problems. You are probably familiar with image, text, and tabular data being used to train Machine Learning models- and Machine Learning being used to solve problems in these domains. With the advent of Transformer architectures, it has been possible to solve audio-related problems with much better accuracy than previously known methods. We will learn the basics of Audio ML using speech-to-text with transformers and learn to use the Huggingface library to solve audio-related problems with Machine Learning.

Learning Objectives

  • Learn about the basics of audio Machine Learning and gain related background knowledge.
  • Learn how audio data is collected, stored, and processed for Machine Learning.
  • Learn about a common and valuable task: speech-to-text using Machine Learning.
  • Learn how to use Huggingface tools and libraries for your audio tasks- from finding datasets to trained models, and use them to solve audio problems with Machine Learning leveraging the Huggingface Python library.

This article was published as a part of the Data Science Blogathon.


Since the Deep Learning revolution happened in the early 2010s with AlexNet surpassing human expertise in recognizing objects, Transformer architectures are probably the biggest breakthrough since then. Transformers have made previously unsolvable tasks possible and simplified the solution to many problems. Although it was first intended for better results in natural language translation, it was soon adopted to not only other tasks in Natural Language Processing but also across domains- ViT or Vision Transformers are applied to solve tasks related to images, Decision Transformers are used for decision making in Reinforcement Learning agents, and a recent paper called MagViT demonstrated the use of Transformers for various video-related tasks.

This all started with the now-famous paper Attention is All You Need, which introduced the Attention mechanism that led to Transformers’ creation. This article does not assume that you already know the inner workings of Transformers architecture.

Although in the public domain and the domain of regular developers, ChatGPT and GitHub Copilot are very famous names, Deep Learning has been used in many real-world use cases across many fields- Vision, Reinforcement Learning, Natural Language Processing, and so on.

In recent years, we have learned about many other use cases, such as drug discovery and protein folding. Audio is one of the fascinating fields yet not fully solved by Deep Learning; in a sense, image classification in the Imagenet dataset was solved by Convolutional Neural Networks.


  • I assume that you have experience working with Python. Basic Python knowledge is necessary. You should have an understanding of libraries and their common usage.
  • I also assume that you know the basics of Machine Learning and Deep Learning.
  • Previous knowledge of Transformers is not necessary but will be helpful.

Notes Regarding Audio Data: Inserting audio is not supported by this platform, so I have created a Colab notebook with all codes and audio data. You can find it here. Launch it in Google Colaboratory, and you can play all the audio in the browser from the notebook.

Introduction to Audio Machine Learning

You probably have seen audio ML in action. Saying “Hi Siri” or “Okay, Google” launches assistants for their respective platforms- this is audio-related Machine Learning in action. This particular application is known as “keyword detection”.

But there is a good chance that many problems can be solved using Transformers in this domain. But, before jumping into the use of Transformers, let me quickly tell you how audio-related tasks were solved before Transformers.

Before Transformers, audio data was usually converted to a melspectrogram- an image describing the audio clip at hand, and it was treated as a piece of image and fed into Convolutional Neural Networks for training. And during inference, the audio sample was first transformed into the melspectrogram representation, and the CNN architecture would infer based on that.

Exploring Audio Data

Now I will quickly introduce you to the `librosa` Python package. It is a very helpful package for dealing with audio data. I will generate a melspectrogram to give you an idea of their appearance. You can find the librosa documentation on the web.

First, install the librosa library by running the following from your Terminal:

pip install librosa

Then, in your notebook, you have to import it simply like this:

import librosa

We will explore some basic functionalities of the library using some data that comes bundled with the library.

array, sampling_rate = librosa.load(librosa.ex("trumpet"))

We can see that the librosa.load() method returns an audio array along with a sampling rate for a trumpet sound.

import matplotlib.pyplot as plt
import librosa.display

librosa.display.waveshow(array, sr=sampling_rate)

This plots the audio data values to a graph like this:


On the X-axis, we see time, and on the Y-axis, we see the amplitude of the clip. Listen to it by:

from IPython.display import Audio as aud

aud(array, rate=16_000)

You can listen to the sound in the Colab notebook I created for this blog post.

Plot a melspectrogram directly using librosa.

import numpy as np

S = librosa.feature.melspectrogram(y=array, sr=sampling_rate,

					  n_mels=128, fmax=8_000)

S_dB = librosa.power_to_db(S, ref=np.max)


librosa.display.specshow(S_dB, x_axis="time",

			     y_axis="mel", sr=sampling_rate,



We use melspectrogram over other representations because it contains much more information than other representations- frequency, and amplitude in one curve. You can visit this nice article on Analytics Vidhya to learn more about spectrograms.

This is exactly what much input data looked like in audio ML before Transformers- for training Convolutional Neural Networks.

Audio ML Using Transformers

As introduced in the “Attention is All You Need” paper, the attention mechanism successfully solves language-related tasks because, as seen from a high level, the Attention head decides which part of a sequence deserves more attention than the rest when predicting the next token.

Now, audio is a very fitting example of sequence data. Audio is naturally a continuous signal generated by the vibrations in nature- or our speech organs- in the case of human speech or animal sounds. But computers can neither process nor store continuous data. All data is stored discretely.

The same is the case for audio. Only values of certain time intervals are stored; these work well enough to listen to songs, watch movies, and communicate with ourselves over the phone or the internet.

And transformers, too, work on this data.

Just like NLP (Natural Language Processing), we can use different architectures of transformers for different needs. We will use an Encoder-Decoder architecture for our task.


Training Data from Huggingface Hub

As mentioned, we will work with the Huggingface library for each process step. You can navigate to the Huggingface Dataset Hub to check out audio datasets. The dataset that we will work out here is the MINDS dataset. It is the dataset of speech data from speakers of different languages. And all of the examples in the dataset are fully annotated.

Let’s load the dataset and explore it a little bit.

First, install the Huggingface datasets library.

pip install datasets


to pip install ensures that we download the datasets library with the added support of audio-related functionalities.

Then we explore the MINDS dataset. I highly advise you to go through the Huggingface page of the dataset and read the dataset card.


On the Huggingface dataset page, you can see the dataset has very relevant information such as tasks, available languages, and licenses to use the dataset.

Now we will load the data and learn more about it.

from datasets import load_dataset, Audio

minds = load_dataset("PolyAI/minds14", name="en-AU",

minds = minds.cast_column("audio", Audio(sampling_rate=16_000))

Note how the dataset is loaded. The name goes first, and we are, only interested in the Australian accent English, and we are interested only in the training split.

Before feeding into training or inference task, we want all our audio data to have the same sampling rate. That is done by the `Audio` method in the code.

We can look into individual examples, like so:

example = minds[0]


{‘path’: ‘/root/.cache/huggingface/datasets/downloads/extracted/a19fbc5032eacf25eab0097832db7b7f022b42104fbad6bd5765527704a428b9/en-AU~PAY_BILL/response_4.wav’,
‘audio’: {‘path’: ‘/root/.cache/huggingface/datasets/downloads/extracted/a19fbc5032eacf25eab0097832db7b7f022b42104fbad6bd5765527704a428b9/en-AU~PAY_BILL/response_4.wav’,
‘array’: array([2.36119668e-05, 1.92324660e-04, 2.19284790e-04, …,
9.40907281e-04, 1.16613181e-03, 7.20883254e-04]),
‘sampling_rate’: 16000},
‘transcription’: ‘I would like to pay my electricity bill using my card can you please assist’,
‘english_transcription’: ‘I would like to pay my electricity bill using my card can you please assist’,
‘intent_class’: 13,
‘lang_id’: 2}

It is very simple to understand. It is a Python dictionary with levels. We have the path and sampling rate all stored. Look at the transcription key in the dictionary. This contains the label when we are interested in Automatic Speech Recognition. [“audio”][“aray”] contains the audio data that we will use to train or infer.

We can easily listen to any audio example that we want.

from IPython.display import Audio as aud

aud(example["audio"]["array"], rate=16_000)

You can listen to the audio in the Colab Notebook.

Now, we have a clear idea of how exactly the data looks and how it is structured. We can now move forward to getting inferences from a pretrained model for Automatic Speech Recognition.

Exploring the Huggingface Hub for Models

The Huggingface hub has many models that can be used for various tasks like text generation, summarization, sentiment analysis, image classification, and so on. We can sort the models in the hub based on the task we want. Our use case is speech-to-text, and we will explore models specifically designed for this task.

For this, you should navigate to and then, on the left sidebar, click on your intended task. Here you can find models that you can use out-of-the-box or find a good candidate for fine-tuning your specific task.


In the above image, I have already selected Automatic Speech Recognition as the task, and I get all relevant models listed on the right.

Notice the different pretrained models. One architecture like wav2vec2 can have many models fine-tuned to particular datasets.

You need to do some searching and remember the resources you can use for using that model or fine-tuning.

I think the wav2vec2-base-960h from Facebook will be apt for our task. Again, I encourage you to go to the model’s page and read the model card.

Getting Inference with Pipeline Method

Huggingface has a very friendly API that can help with various transformers-related tasks. I suggest going through a Kaggle notebook I authored that gives you many examples of using the Pipeline method: A Gentle Introduction to Huggingface Pipeline.

Previously, we found the model we needed for our task, and now we will use it with the Pipeline method we saw in the last section.

First, install the Huggingface transformers library.

pip install transformers

Then, import the Pipeline class and select the task and model.

from transformers import pipeline

asr = pipeline("automatic-speech-recognition",


print(asr(example["audio"]["example"])) # example is one example from the dataset

The output is:


You can see that this matches very well with the annotation that we saw above.

This way, you can get inference out of any other example.


In this guide, I have covered the basics of audio data processing and exploration and the basics of audio Machine Learning. After a brief discussion of the Transformer architecture for audio machine learning, I showed you how to use audio datasets in the Huggingface hub and how to use pre-trained models using the Huggingface models hub.

You can use this workflow for many audio-related problems and solve them by leveraging transformer architectures.

Key Takeaways

  • Audio Machine Learning is concerned with solving audio-related problems that arise in the real world in the audio domain- with Machine Learning techniques.
  • As audio data is stored as a sequence of numbers, it can be treated as a sequence-related problem and solved with the tooling we already have for solving other sequence-related problems.
  • As Transformers successfully solve sequence-related problems, we can use Transformer architectures to solve audio problems.
  • As speech data and audio data generally vary widely due to factors such as age, accent, habit of speaking, etc., it is always better to use fine-tuned solutions to particular datasets.
  • Huggingface has many audio-related solutions regarding datasets, trained models, and easy means to use and tune training and fine-tuning.


1. Huggingface Audio ML course to learn more about Audio Machine Learning

2. Think DSP by Allen Downey to delve deeper into Digital Signal Processing

Frequently Asked Questions

Q1. What is Audio Machine Learning?

A. Audio Machine Learning is the field where Machine Learning techniques are used to solve problems related to audio data. Examples include: turning lights on and off in a smart home with keyword detection, asking voice assistants for a day’s weather with speech-to-text, etc.

Q2. How to collect audio data for Machine Learning?

A. Machine Learning usually requires a large amount of data. To collect data for Audio Machine Learning, one must first decide what problems to solve. And collect related data. For example, if you are creating a voice assistant named “Jarvis”, and want the phrase “Good day, Jarvis” to activate it, then you need to collect the phrase uttered by people from different regions, of different ages, and belonging to multiple genders- and store the data with proper labels. In every audio task, labeling the data is very important.

Q3. What is audio classification in ML?

A. Audio classification is a Machine Learning task that aims to classify audio samples into a certain number of predetermined classes. For example, if an audio model is deployed in a bank, then audio classification can be used to classify incoming calls based on the intent of the customer to forward the call to the appropriate department- loans, savings accounts, cheques and drafts, mutual funds, etc.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion. 

Ritobrata Ghosh 17 Jul 2023

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

  • [tta_listen_btn class="listen"]