Contextual Spelling Correction of the ASR Systems

Drishti Last Updated : 12 Oct, 2022

9 min read

This article was published as a part of the Data Science Blogathon.

Source: Canva

Introduction

Contextual ASR, which uses a list of biasing terms as input and audio, has gained popularity as virtual assistant (VA) devices with speech interfaces is increasing. I’m sure many of us have had experiences with virtual assistants where the virtual assistant did not understand the spoken audio/word correctly about the context. For example, if the word “Apple” in the sentence “I like the interface provided by Apple” is interpreted in the context of a fruit by the VA, then this interpretation would be incorrect based on the context (interface) in this instance. Scenarios like this highlight the importance of devising robust ASR solutions which could decipher the spoken audio correctly based on context.

In light of this, we will look at one such technique for contextual biasing in this article, which entails adding a contextual spelling correction model on top of the end-to-end ASR system. This method is proposed by Xiaoqiang Wang et. al.

To begin, we will go over the highlights of this article, followed by a detailed discussion of the proposed methods in the following sections. Now, let’s get started!

Highlights

To make intelligent and competitive contextual ASR systems, a method for contextual biasing that involves adding a contextual spelling correction model on top of the end-to-end ASR system is proposed.
The proposed model includes the autoregressive contextual spelling correction (CSC) model and the non-autoregressive Fast Contextual Spelling Correction (FCSC) model.
Effective filtering algorithms for large-size context lists and performance balancing mechanisms to control the biasing degree of the model were also put forth.
Experiments suggest that the proposed method reduces the ASR system’s relative word error rate (WER) by up to 51% and outperforms traditional biasing methods. The FCSC (NAR) model decreases the model’s size by 43.2% and accelerates inference by 2.1 times compared to the CSC (AR) solution.

What is Contextual Biasing?

Contextual biasing is a crucial and challenging task for end-to-end automatic speech recognition (ASR) systems, which aims to achieve better recognition performance by biasing the ASR system with a user-defined list of important words and phrases, including person names, music lists, proper nouns, etc. that are submitted along with the audio to be transcribed.

Contextual ASR is most often studied for virtual assistant devices with speech interfaces, given that these systems must recognize the names of a user’s contacts when that user wants to dial the phone number or the names of artists in a user’s music library when that user wants to listen to a certain song.

Traditionally, there are two main methods for adding contextual knowledge to E2E ASR systems. First, an external contextual language model (LM) could be added to the E2E decoding framework to bias the recognition results towards a context phrase list. Typically, this is done using shallow fusion with a contextual finite state transducer (FST). Second, including a context encoder that incorporates contextual information into E2E ASR systems. However, this method changes the source ASR model and has scalability issues with a large biasing phrase list.

In contrast to these traditional methods, a new method was proposed to do contextual biasing on the ASR output with a contextual spelling correction model, which we will discuss in the following section.

Model Architectures

To have reliable ASR output, contextual biasing is done on the ASR output using a contextual spelling correction model. The proposed model includes two mechanisms: autoregressive (AR) and non-autoregressive (NAR).

1. Autoregressive contextual spelling correction (CSC) model:

As depicted in Figure 1, the autoregressive contextual spelling correction model, also known as Contextual Spelling Correction (CSC), is a seq2seq model with a text encoder, a context encoder, and a decoder, which uses the ASR hypothesis as the text encoder input and the context phrase list as the context encoder input, respectively.

The context encoder encodes each context phrase as hidden states, and the context embedding generator then averages these hidden states to generate context embedding.
The decoder uses the output of the previous step as input auto-regressively and subsequently attends to the outputs of both the encoders. These attentions are then added to generate the final attention, from which the decoder obtains information from the ASR hypothesis and context phrase embeddings to correct contextual misspelling errors.

model architectures | Contextual Spelling

Fig. 1. Autoregressive contextual spelling correction (CSC) model (Source: Arxiv)

Since both the text encoder and the context encoder use transcriptions as inputs, it makes sense that their parameters would be shared, reducing the model size and facilitating context encoder training.
The final loss is the cross entropy of the ground truth label and output probabilities.

2. Non-autoregressive contextual spelling correction (FCSC) model:

As shown in Figure 2, similar to the Autoregressive CSC model, the proposed NAR model, denoted as Fast Contextual Spelling Correction (FCSC), consists of a text (ASR hypothesis) encoder, a context encoder, and a decoder, where the text encoder uses ASR decoding results as input. In contrast, the context encoder takes biasing phrase list as input.
Both the encoders share the parameters. The decoder effectively takes the output of the text encoder as an input and attends to the context encoder to determine where corrections need to be made.

Fig. 2. Non-autoregressive contextual spelling correction (FCSC) model (Source: Arxiv)

The output hidden states of each context phrase are averaged to generate the context phrase embedding by the context embedding generator. The similarity layer determines how similar hidden states in decoder output are to context embeddings by an inner product operation:

Where Q_iis the hidden state of the decoder at the i-th position, K_j is j-th context phrase embedding, and d_k is the dimension of K.

CLS tag (cls): The CLS tag has the same sequence length as the input ASR hypothesis, which determines whether to correct the token at this position. It employs a “BILO” representation in which “B”, “I” and “L” represent the beginning, inside, and the last position of a context phrase, respectively, and “O” represents a general position outside of a context phrase.
Context index (cind): It is the output of the similarity layer, which is the expected index of the ground-truth context phrase in the bias list. Since an empty context precedes the bias list, the context index for general tokens, which should not be corrected, is 0. As indicated by the following equation, the output hid dimension of the similarity layer at each position i is the same as the input bias list size, and the context phrase corresponding to the largest
value in s_iis selected during decoding for this position:

Based on the output CLS tag and context index, the final correction results are determined by changing the words tagged by the CLS tag with the context phrase selected by the context index cind.
Filter mechanisms for the context list are suggested during inference to increase inference effectiveness and address scalability problems for large context lists.
The key difference between the CSC and FCSC is that the FCSC focuses on biasing phrases, while the CSC has the potential to correct errors made by the ASR system. Also, FCSC generates results in parallel, and label-by-label prediction is not required, which speeds up the decoding speed.

Results

On testing, the following outcomes were observed:

1. WER Reduction: Table 1 illustrates that the NAR (FCSC) solution outperforms the autoregressive CSC model among all the test sets. This may be because the former doesn’t have to align the encoder and decoder by using a length or duration predictor to avoid the possibility of error accumulation.

Table 1: Model performance on the name domain (Source: Arxiv)

Furthermore, the model performance on the two general biasing sets is shown in Table 2 from which it can be inferred that the model achieves approximately 50.7% relative WER improvement compared to baseline RNN-T.

Table 2: Model performance on general biasing sets (Source: Arxiv)

2. Latency improvement: The NAR model decreases the model’s size by 43.2% and speeds up the inference by 2.1 times, which makes the on-device deployment of the NAR model advantageous over the AR solution. Moreover, this latency can be further reduced and optimized in runtime by calculating context phrase encoding in advance and loading as a cache in the real application.

Table 3: Model size and latency (Source: Arxiv)

3. Influence of Context List Size: WER tends to increase with K_r in a concave curve for both CSC and FCSC, indicating that raw context list size has little effect on model performance when it is large enough, and the proposed method can handle the scalability problem well. Figure 3 also illustrates that the WER curve of FCSC mostly lies below CSC for both Kr and Kf, demonstrating that FCSC consistently outperforms CSC with the change in decoding parameter.

4. Performance balancing of FCSC: When the model is overly biased, it can suffer from regressions in anti-context scenarios. To effectively control the regression on anti-context cases, a portion of the training set is general scripts without context phrases which teaches the model to decide when to make corrections based on the context phrase list and the input ASR hypothesis. Nevertheless, regressions still occur in some cases.

To navigate this, performance balancing mechanisms are proposed to modify biasing degree and control any WER regressions on general utterances that we don’t want to bias. For FCSC, a regression control mechanism is put forth that uses a controllable threshold parameter s^oto balance model performance between biasing set and anti-context cases.

The relative WER gap narrowing ratio (r) is defined as follows:

Where W represents WER, W₀ and W₁represent the WER
for s^o= 0 and s^o = 1, respectively. Figure 5 shows the variation of r with the change in s^o. We can see that the WER on the Name set experiences a protracted flat period when it is small and then experiences a steep increase only when close to 1.0, indicating that the majority of the cases in the Name set are corrected with sufficient confidence to provide enough of a safe environment for us to carry out performance balancing.

Fig. 4. WER relative change with s^o(Source: Arxiv)

Conclusion

To summarize, in this article, we learned the following:

1. A method for a general contextual biasing solution that is domain-insensitive that involves adding a contextual spelling correction model on top of the end-to-end ASR system is proposed. The proposed method includes two variants: i) Autoregressive contextual spelling correction (CSC) model and ii) Non-autoregressive Fast Contextual Spelling Correction (FCSC) model.

2. CSC model consists of a text encoder, a context encoder, and a decoder. This design augments a context encoder into an AR E2E spelling correction model. The contextual information is added to the decoder by attending to the hidden representations from the context encoder using an attention mechanism.

3. For the FCSC model, the output of the text encoder is directly fed into the decoder. The decoder attends to the context encoder and determines the locations that need to be corrected and the candidate context index.

4. The NAR (FCSC) solution reduces the model size by 43.2% and speeds up the inference by 2.1 times while achieving WER improvement. Moreover, the results are generated in parallel and don’t need to conduct label-by-label prediction in AR; hence, the decoding speed is fast.

5. Effective performance balancing mechanisms to balance model performance on anti-context terms and filtering algorithms to handle large-size context lists were also implemented.

That concludes this article. Thanks for reading. If you have any questions or concerns, please post them in the comments section below. Happy learning!

Link to the Research Paper: https://arxiv.org/pdf/2203.00888.pdf

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

ASR blogathon machine learning

Drishti

I'm a Researcher who works primarily on various Acoustic DL, NLP, and RL tasks. Here, my writing predominantly revolves around topics related to Acoustic DL, NLP, and RL, as well as new emerging technologies. In addition to all of this, I also contribute to open-source projects @Hugging Face.
For work-related queries please contact: [email protected]

Deep Learning Intermediate NLP

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

Contextual Spelling Correction of the ASR Systems

Introduction

Highlights

What is Contextual Biasing?

Model Architectures

Results

Conclusion

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Congratulations, You Did It!

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at