Learn everything about Analytics

Home » Speech to Text Conversion- An application of NLP

Speech to Text Conversion- An application of NLP

This article was published as a part of the Data Science Blogathon


Speech is that the most typical means of communication and also the majority of the population within the world relies on speech to speak with each other. A speech recognition system translates spoken languages into text. There are various real-life samples of speech recognition systems. for instance, Apple SIRI recognizes the speech and truncates it into text. A human speech utterance is taken by Speech-To-Text (STT) system as an input and a string of words is required as output. The only objective of this system is to extract, characterize and recognize the information about speech.


1.System Block Diagram

2.How does speech recognition works?

3.Converting an audio file into Text

4.How about converting to different audio languages?

5.Microphone speech to Text



System Block Diagram

system block diagram | speech to text conversion


1.Acoustic Model

In order to recognize speech, the Acoustic Model is used by a speech recognition engine. To create an acoustic model we need to take audio recordings of speech, and their text transcriptions, and we use software to create statistical representations of the sounds that make up each word.

2.Language Model

A language model is a file that includes the probabilities of sequences of words. We use Language models for dictation applications, whereas grammars are used in desktop command and control or telephony interactive voice response (IVR) type applications.

3.Speech Engine

A speech engine is the heart of the speech recognition system. This is the software that gives your computer the ability to playback text in a spoken voice (commonly referred to as text-to-speech or TTS).

How does Speech recognition work?

Figure | speech to text conversion | speech to text conversion

Speech Recognition process

Speech Recognition process Hidden Markov Model (HMM), deep neural network models are wont to convert the audio into text.

HMM (HIDDEN MARKOV MODEL) is the statistical model that produced the output as a sequence of symbols or quantities. The reason behind using the HMMs as a speech recognition tool is their ability to treat speech recognization as a piecewise stationary signal or a short-time stationary signal. In a short time scale (e.g., 10 milliseconds), speech can be approximated as a stationary process.

HMM codebook


In this blog, I’m demonstrating a way to convert speech to text using Python. This will be through with the assistance of the “Speech Recognition” API and “PyAudio” library. Speech Recognition API supports several APIs, during this blog I used Google speech recognition API.

Python Libraries

!pip install SpeechRecognition

Convert an audio file into text

These are the following steps to convert audio files into text:


  1. Import Speech recognition library

  2. Initializing recognizer class to acknowledge the speech. We are using google speech recognition.

  3. Audio files which are supported by a speech recognition system include wav, AIFF, AIFF-C, FLAC. I used the ‘wav’ to get into this instance

  4. Here we used the audio clips of ‘Taken’ movie which says “I don’t know who you’re I don’t know what you would like if you’re searching for ransom I can tell you I don’t have money”

  5. By default, google recognizer reads English.


#import library
import speech_recognition as sr
# Initialize recognizer class (for recognizing the speech)
r = sr.Recognizer()
# Reading Audio file as source
# listening to the audio file and store in audio_text variable
with sr.AudioFile('I-dont-know.wav') as source:
    audio_text = r.listen(source)
# if the API is unreachable, the recoginize_() method will throw a request error, hence using exception handling
        # using google speech recognition
        text = r.recognize_google(audio_text)
        print('Converting audio transcripts into text ...')
         print('Sorry.. run again...')


output1 | speech to text conversion

How about converting to different audio languages?

English is one of the very common languages. But what if we want to convert from different languages like, German and French. From this Speech-To-Text(STT) system, you can convert your speech from any language to Text. Let’s see how?

For example, if we want to read a french language audio file, then need to add a language option in the recogonize_google. The remaining code remains the same.

#Adding french language option
text = r.recognize_google(audio_text, language = "fr-FR")


output 2

Again, the required language option is added in the recognize_google() for the language recognization. I am talking in Tamil, Indian languages and adding “ta-IN” in the language option.

# Adding "Tamil language"
print(“Text: “+r.recognize_google(audio_text, language = “ta-IN”))

I just said “how are you” in Tamil and it prints the text in Tamil accurately.



Microphone speech into text

Microphones are used to take audio as input from users. There are many different libraries are available for converting Microphone speech into Text. Here we use PyAudio for this conversion.


  1. We are required to install the PyAudio library which is used to receive audio input and output through the microphone and speaker. It helps to extract our voice through the microphone.

!pip install PyAudio

  1. We have to use the Microphone class, Instead of an audio file source. The remaining steps are the same.


#import library
import speech_recognition as sr
# Initialize recognizer class (for recognizing the speech)
r = sr.Recognizer()
# Reading Microphone as source
# listening to the speech and store in audio_text variable
with sr.Microphone() as source:
    audio_text = r.listen(source)
    print("Time over, thanks")
# recoginize_() method will throw a request error if the API is unreachable, hence using exception handling
        # using google speech recognition
        print("Text: "+r.recognize_google(audio_text))
         print("Sorry, I did not get that")

I just talked “How are you?”




  1. In-Car Systems

  2. Health Care

  3. Military

  4. Training air traffic controllers

  5. Telephony and other domains

  6. Usage in education and daily life


Google speech recognition API is a straightforward method to convert speech into text, but it requires an online connection to work. In this blog, we’ve seen a way to convert the speech into text using Google speech recognition API. This is able to be very helpful for NLP projects especially handling audio transcripts data. If you’ve got anything to feature, please be at liberty to go away a comment! Thanks for reading. Continue learning and stay tuned for more!

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

You can also read this article on our Mobile APP Get it on Google Play