This article was published as a part of the Data Science Blogathon.
In this article, we will take a closer look at how speech recognition really works. Now, when we say speech recognition, we’re really talking about ASR, or automatic speech recognition. With automatic speech recognition, the goal is to simply input any continuous audio speech and output the text equivalent. We want our ASR to be speaker-independent and have high accuracy. Such a system has long been a core goal of AI, and in the 1980s and 1990s, advances in probabilistic models began to make automatic speech recognition a reality.
Like many other AI problems we’ve seen, automatic speech recognition can be implemented by gathering a large pool of labeled data, training a model on that data, and then deploying the trained model to accurately label new data. The twist is that speech is structured in time and has a lot of variabilities.
We’ll identify the specific challenges we face when decoding spoken words and sentences into text. To understand how these challenges can be met, we’ll take a deeper dive into the sound signal itself as well as various speech models. The sound signal is our data. We’ll get into the signal analysis, phonetics, and how to extract features to represent speech data.
Models in speech recognition can conceptually be divided into an acoustic model and a language model. The acoustic model solves the problems of turning sound signals into some kind of phonetic representation. The language model houses the domain knowledge of words, grammar, and sentence structure for the language. These conceptual models can be implemented with probabilistic models using machine learning algorithms. Hidden Markov models have been refined with advances for automatic speech recognition over a few decades now, and are considered the traditional ASR solution. Meanwhile, the cutting edge of ASR today is end-to-end Deep Neural Network Models. We’ll talk about both.
Continuous speech recognition has had a rocky history. In the early 1970s, the United States funded automatic speech recognition research with a DARPA challenge. The goal was achieved a few years later by Carnegie-Mellon’s Harpy System. But the future prospects were disappointing and funding dried up. More recently computing power has made larger dimensions in neural network modeling a reality. So what makes speech recognition hard?
The first set of problems to solve are related to the audio signal itself, noise for instance. Cars going by, clocks ticking, other people talking, microphones static, our ASR has to know which parts of the audio signal matter and which parts to discard. Another factor is the variability of pitch and variability of volume. One speaker sounds different than another even when saying the same word. The pitch and loudness at least in English don’t change the ground truth of which word was spoken.
If I say hello in a different pitch, it’s all the same word and spelling. We could even think of these differences as another kind of noise that needs to be filtered out. Variability of word speed is another factor. Words spoken at different speeds need to be aligned and matched. If I give a speech at a different speed, it’s still the same word with the same number of letters.
Aligning the sequences of sound correctly is done by ASR. Also, word boundaries are an important factor. When we speak, words run from one another without a pause. We don’t separate them naturally. Humans understand it because we already know that the word boundaries should be in certain places. This brings us to another class of problems that are language or knowledge related.
We have domain knowledge of our language that allows us to automatically sort out ambiguities as we hear them. Word groups that are reasonable in one context but not in another.
Also, spoken language is different than written language. There are hesitations, repetitions, fragments of sentences, slips of the tongue, a human listener is able to filter this out. Imagine a computer that only knows language from audiobooks and newspapers read aloud. Such a system may have a hard time decoding unexpected sentence structures. Okay, we’ve identified lots of problems to solve here.
Some are the variability of the pitch, volume, and speed, ambiguity due to word boundaries, spelling, and context. I am going to introduce some ways to solve these problems with a number of models and technologies. I’ll start at the beginning with the voice itself.
When we speak we create sinusoidal vibrations in the air. Higher pitches vibrate faster with a higher frequency than lower pitches. These vibrations can be detected by a microphone and transduced from acoustical energy carried in the sound wave, to electrical energy where it is recorded as an audio signal. The amplitude in the audio signal tells us how much acoustical energy is in the sound, how loud it is. Our speech is made up of many frequencies at the same time. The actual signal is really a sum of all those frequencies stuck together. To properly analyze the signal, we would like to use the component frequencies as features. We can use a fourier transform to break the signal into these components. The FFT algorithm or Fast Fourier Transform, is widely available for this task.
We can use this splitting technique to convert the sound to a Spectrogram. To create a Spectrogram first, divide the signal into time frames. Then split each frame signal into frequency components with an FFT. Each time frame is now represented with a vector of amplitudes at each frequency. If we line up the vectors again in their time-series order, we can have a visual picture of the sound components, the Spectrogram.
The Spectrogram can be lined up with the original audio signal in time. With the Spectrogram, we have a complete representation of our sound data. But we still have noise and variability embedded into the data. In addition, there may be more information here than we really need. Next, we’ll look at Feature Extraction techniques to, both, reduce the noise and reduce the dimensionality of our data.
One human creates words and another human hears them. Our speech is constrained by both our voice-making mechanisms and what we can perceive with our ears. Let’s start with the ear and the pitches we can hear.
The Mel Scale was developed in 1937 and tells us what pitches human listeners can truly discern. It turns out that some frequencies sound the same to us but we hear differences in lower frequencies more distinctly than in higher frequencies. If we can’t hear a pitch, there is no need to include it in our data, and if our ear can’t distinguish two different frequencies, then they might as well be considered the same for our purposes.
For the purposes of feature extraction, we can put the frequencies of the spectrogram into bins that are relevant to our own ears and filter out the sound that we can’t hear. This reduces the number of frequencies we’re looking at by quite a bit. That’s not the end of the story though. We also need to separate the elements of sound that are speaker-independent. For this, we focus on the voice-making mechanism we use to create speech. Human voices vary from person to person even though our basic anatomy features are the same. We can think of a human voice production model as a combination of source and filter, where the source is unique to an individual and the filter is the articulation of words that we all use when speaking.
The cepstral analysis relies on this model for separating the two. The main thing to remember is that we’re dropping the component of speech unique to individual vocal cords and preserving the shape of the sound made by the vocal tract. The cepstral analysis combined with mel frequency analysis gets you 12 or 13 MFCC features related to speech. Delta and Delta-Delta MFCC features can optionally be appended to the feature set. This will double or triple the number of features but has been shown to give better results in ASR. The takeaway for using MFCC feature extraction is that we greatly reduce the dimensionality of our data and at the same time we squeeze noise out of the system. Next, we’ll look at the sound from a language perspective, the phonetics of the words we hear.
Phonetics is the study of sound in human speech. Linguistic analysis of language around the world is used to break down human words into their smallest sound segments. In any given language, some number of phonemes define the distinct sounds in that language. In US English, there are generally 39 to 44 phonemes to find. A Grapheme, in contrast, is the smallest distinct unit that can be written in a language. In US English the smallest grapheme set we can define is a set of the 26 letters in the alphabet plus space. Unfortunately, we can’t simply map phonemes to a grapheme or individual letters because some letters map to multiple phonemes sounds, and some phonemes map to more than one letter combination.
For example, in English, the C letter sounds different in cat, chat, and circle. Meanwhile, the phoneme E sound we hear in receive and beat is represented by different letter combinations. Here’s a sample of a US English phoneme set called Arpabet. Arpabet was developed in 1971 for speech recognition research and contains thirty-nine phonemes, 15 vowel sounds, and 24 consonants, each represented as a one or two-letter symbol.
Phonemes are often a useful intermediary between speech and text. If we can successfully produce an acoustic model that decodes a sound signal into phonemes the remaining task would be to map those phonemes to their matching words. This step is called Lexical Decoding and is based on a lexicon or dictionary of the data set. Why not just use our acoustic model to translate directly into words?
That’s a good question and there are systems that do translate features directly to words. This is a design choice and depends on the dimensionality of the problem. If we want to train a limited vocabulary of words we might just skip the phonemes, but if we have a large vocabulary converting to smaller units first, reduces the number of comparisons that need to be made in the system overall.
We’ve learned a lot about speech audio. We’ve introduced signal analysis and feature extraction techniques to create data representations for that speech audio. Now, we need a lot of examples of audio, matched with text, the labels, that we can use to create our dataset. If we have those labeled examples, say a string of words matched with an audio snippet, we can turn the audio into spectrograms or MFCC representations for training a probabilistic model.
Fortunately for us, ASR is a problem that a lot of people have worked on. That means there is labeled audio data available to us and there are lots of tools out there for converting sound into various representations.
One popular benchmark data source for automatic speech recognition training and testing is the TIMIT Acoustic-Phonetic Corpus. This data was developed specifically for speech research in 1993 and contains 630 speakers voicing 10 phoneme-rich sentences each, sentences like, ‘George seldom watches daytime movies.’ Two popular large vocabulary data sources are the LDC Wall Street Journal Corpus, which contains 73 hours of newspaper reading, and the freely available LibriSpeech Corpus, with 1000 hours of readings from public domain books. Tools for converting these various audio files into spectrograms and other feature sets are available in a number of software libraries.
With feature extraction, we’ve addressed noise problems due to environmental factors as well as the variability of speakers. Phonetics gives us a representation of sounds and language that we can map to. That mapping, from the sound representation to the phonetic representation, is the task of our acoustic model. We still haven’t solved the problem of matching variable lengths of the same word. DTW calculates the similarity between two signals, even if their time lengths differ. This can be used in speech recognition, for instance, to align the sequence data of a new word to its most similar counterpart in a dictionary of word examples.
As we’ll soon see, hidden Markov models are well-suited for solving this type of time series pattern sequencing within an acoustic model, as well. This characteristic explains their popularity in speech recognition solutions for the past 30 years. If we choose to use deep neural networks for our acoustic model, the sequencing problem reappears. We can address the problem with a hybrid HMM/DNN system, or we can solve it another way.
Later, we’ll talk about how we can solve the problem in DNNs with connectionist temporal classification or CTC. First, though, we’ll review HMMs and how they’re used in speech recognition.
We learned the basics of hidden Markov models. HMMs are useful for detecting patterns through time. This is exactly what we are trying to do with an acoustic model. HMMs can solve the challenge, we identified earlier, of time variability. For instance, my earlier example of speech versus speech, the same word but spoken at different speeds. We could train an HMM with label time series sequences to create individual HMM models for each particular sound unit. The units could be phonemes, syllables, words, or even groups of words. Training and recognition are fairly straightforward if our training and test data are isolated units.
We have many examples, we train them, we get a model for each word. Then recognition of a single word comes down to scoring the new observation likelihood over each model. It gets more complicated when our training data consists of continuous phrases or sentences which we’ll refer to as utterances. How can the series of phonemes or words be separated in training?
In this example, we have the word brick, connected continuously in nine different utterance combinations. To train from continuous utterances HMMs can be tied together as pairs. We define these connectors as HMMs. In this case, we would train her brick, my brick, a brick, brick house, brick walkway, and brick wall, by tying the connecting states together. This will increase dimensionality. Not only will we need an HMM for each word, but we also need one for each possible work connection, which could be a lot if there are a lot of words.
The same principle applies if we use phonemes. But for large vocabularies, the dimensionality increase isn’t as profound as with words. With a set of 40 phonemes, we need 1600 HMMs to account for the transitions. Still a manageable number. Once trained, the HMM models can be used to score new utterances through chains of probable paths.
So far, we have tools for addressing noise and speech variability through our feature extraction. We have HMM models that can convert those features into phonemes and address the sequencing problems for our full acoustic model. We haven’t yet solved the problems in language ambiguity though. With automatic speech recognition, the goal is to simply input any continuous audio speech and output the text equivalent. The system can’t tell from the acoustic model which combinations of words are most reasonable.
That requires knowledge. We either need to provide that knowledge to the model or give it a mechanism to learn this contextual information on its own. We’ll talk about possible solutions to these problems, next.
The job of the Language Model is to inject the language knowledge into the words to text step in speech recognition, providing another layer of processing between words and text to solve ambiguities in spelling and context. For example, since an Acoustic Model is based on sound, we can’t distinguish the correct spelling for words that sound the same, such as hear. Other sequences may not make sense but could be corrected with a little more information.
The words produced by the Acoustic Model are not absolute choices. They can be thought of as a probability distribution over many different words. Each possible sequence can be calculated as the likelihood that the particular word sequence could have been produced by the audio signal. A statistical language model provides a probability distribution over sequences of words.
If we have both of these, the Acoustic Model and the Language Model, then the most likely sequence would be a combination of all these possibilities with the greatest likelihood score. If all possibilities in both models were scored, this could be a very large dimension of computations.
We can get a good estimate though by only looking at some limited depth of choices. It turns out that in practice, the words we speak at any time are primarily dependent upon only the previous three to four words. N-grams are probabilities of single words, ordered pairs, triples, etc. With N-grams we can approximate the sequence probability with the chain rule.
The probability that the first word occurs is multiplied by the probability of the second given the first and so on to get probabilities of a given sequence. We can then score these probabilities along with the probabilities from the Acoustic Model to remove language ambiguities from the sequence options and provide a better estimate of the utterance in text.
The previous discussion identified the problems of speech recognition and provided a traditional ASR solution using feature extraction HMMs and language models. These systems have gotten better and better since they were introduced in the 1980s.
As computers become more powerful and data more available, deep neural networks have become the go-to solution for all kinds of large probabilistic problems including speech recognition. In particular, recurrent neural networks RNNs can be leveraged, because these types of networks have temporal memory, an important characteristic for training and decoding speech. This is a hot topic and an area of active research.
The information that follows is primarily based on recent research presentations. The tech is bleeding edge, and changing rapidly but we’re going to jump right in. Here we go.
If HMM’s work why do we need a new model. It comes down to potential. Suppose we have all the data we need and all the processing power we want. How far can an HMM model take us, and how far could some other model take us?
According to Baidu’s Adam Coates in a recent presentation, additional training of a traditional ASR level off inaccuracy. Meanwhile, Deep Neural Network Solutions are unimpressive with small data sets but they shine as we increase data and model sizes. Here’s the process we’ve looked at so far. We extract features from the audio speech signal with MFCC. Use an HMM acoustic model to convert to sound units, phonemes, or words. Then, it uses statistical language models such as N-grams to straighten out language ambiguities and create the final text sequence. It’s possible to replace the many tune parts with a multiple layer deep neural network. Let’s get a little intuition as to why they can be replaced.
In feature extraction, we’ve used models based on human sound production and perception to convert a spectrogram into features. This is similar, intuitively, to the idea of using Convolutional Neural Networks to extract features from image data. Spectrograms are visual representations of speech. So, we ought to be able to let CNN find the relevant features for speech in the same way. An acoustic model implemented with HMMs includes transition probabilities to organize time series data. Recurrent Neural Networks can also track time-series data through memory, as we’ve seen in RNNs.
The traditional model also uses HMMs to sequence sound units into words. The RNNs produce probability densities over each time slice. So we need a way to solve the sequencing issue. A Connectionist Temporal Classification layer is used to convert the RNN outputs into words. So, we can replace the acoustic portion of the network with a combination of RNN and CTC layers. The end-to-end DNN still makes linguistic errors, especially on words that it hasn’t seen in enough examples. The existing use of N-grams can be made. Alternately, a Neural Network Language Model can be trained on massive amounts of available text. Using an NLM layer, the probabilities of spelling and context can be re-scored for the system.
We’ve covered a lot of ground. We started with signal analysis taking apart the sound characteristics of the signal, and extracting only the features we required to decode the sounds and the words. We learned how the features could be mapped to sound representations of phonemes with HMM models, and how language models increase accuracy when decoding words and sentences.
Finally, we shifted our paradigm and looked into the future of speech recognition, where we may not need feature extraction or separate language models at all. I hope you’ve enjoyed learning this subject as much as I’ve enjoyed writing it 😃
1. Introduction to Stemming vs Lemmatization (NLP)
2. Introduction to Word Embeddings (NLP)
With this, we have come to the end of this article. Thanks for reading this and following along. Hope you loved it! Bundle of thanks for reading it!
My Portfolio and Linkedin 🙂
The media shown in this article are not owned by Analytics Vidhya and at the Author’s discretion.
Lorem ipsum dolor sit amet, consectetur adipiscing elit,