This article was published as a part of the Data Science Blogathon.
Speech impairments afflict millions of people with underlying reasons, including neurological or hereditary diseases, physical handicaps, brain damage, hearing loss, etc. Furthermore, the resulting speech patterns are diverse and complex, including stuttering, dysarthria, apraxia, etc., and adding to that; there is a lack of sufficient training data, which affects the recognition accuracy of the ASR models. The poor recognition/performance of existing ASR solutions on disordered speech and other atypical speech patterns makes the technology unsuitable for many speakers who may benefit the most.
Source: https://bit.ly/3P70gtq
In light of this, Google AI researchers have proposed an approach for on-device personalization of ASR models with a focus on acoustic adaptation to handle deviant voice characteristics. They utilized a modification of the RNN-T model architecture, having 8 LSTM encoder layers and 2 LSTM layers for the language model component. In this post, we will look at this approach; now, let’s dive in!
1. Google AI researchers have proposed an approach for on-device personalization of ASR models with a focus on acoustic adaptation to handle deviant voice characteristics.
2. ASR personalization using the seed model was more effective (resulting in larger WER improvements) and more efficient (needing lesser transcript corrections) than personalizing from a base model trained on typical speech only.
3. With as little as 50 utterances, the proposed on-device training procedure can decrease the median WER by 71% and boost the median Assistant Task Success Rate (ATSR) from 40% to over 80%.
4. Future work includes investigating on-device personalization for a domain other than home automation, ideally, open conversations with longer phrases, which require more training data and increased training times.
To navigate the challenges outlined in the Introduction section, a popular solution is to fine-tune models originally optimized for typical speech on user-specific data. ASR model personalization works well for various atypical speech patterns, including accents and speech disorders. Figure 1 illustrates the overview of ASR personalization techniques.
Figure 1: Overview of ASR Personalization techniques
Based on the infrastructure for fine-tuning the model, ASR personalization can be broadly classified as i) Server-side Personalization and ii) On-device Personalization.
1) Server-side Personalization of ASR models: ASR personalization can be done via model-fine tuning, starting from a speaker-independent base model trained on typical speech and fine-tuning (all or parts of the) model weights to the target speaker’s data. Fine-tuning is usually performed on the server, leveraging accelerators with large amounts of memory and simultaneous access to all of a speaker’s adaption data.
👉 Challenge: Server-based training environment poses problems around data privacy, delayed model-update times, and communication costs for copying data and models between mobile devices and server infrastructure
2) On-device Personalization of ASR Models: As new smartphones are becoming powerful enough to perform offline ASR entirely on-device, this curtails the requirement to transfer the user’s audio data to a server for inference. The on-device personalization scenario differs from the server-based scenario in the following ways: a) hardware constraints primarily affect the maximum possible memory footprint and training speed, b) limited battery capacity makes long training processes undesirable, and c) focuses on reducing the amount of time any sensitive/private user data will be stored.
👉Challenge: Even in high-end mobile devices, memory consumption is one of the main limitations for on-device training. The number of layers that need to be updated directly impacts memory consumption.
i) Model Architecture: GoogleAI researchers modified the RNN-T model architecture with a focus on the acoustic model (AM) adaptation approach, which allows deployment on mobile devices and supports streaming. They utilized 8 LSTM encoder layers and 2 LSTM layers for the language model component. The AM adaptation techniques attempt to reduce the mismatch in terms of users’ voice characteristics, such as accent, pitch, and speaking rate.
Furthermore, the sliding-window approach (”consecutive training”) was simplified, where a small amount of N utterances are collected from a speaker and then trained for E epochs with a batch size of N. Following each training round, the new checkpoint is saved so that training can be continued as soon as the user has recorded the next batch of N utterances.
The advantage of this approach is that user data is retained until a single training round is completed. Furthermore, a user can directly benefit from an improved model, which can be leveraged to transcribe new phrases as they are recorded.
ii) Dataset: A random subset of 100 speakers from the Euphonia corpus of disordered speech consisting of over 1M utterance audio clips of over 1000 speakers with different severities and types of speech impairments was used. For testing, i.e. whenever word error rates (WERs) are reported, the test sets are limited to utterances from the home automation (HA) domain. For training, two sets were built: a) all training utterances per speaker (covering all domains) and b) 50 randomly sampled utterances from the speaker’s home automation (HA) training utterances.
1. Word Error Rate Improvements: Figure 2 shows the average WER w.r.t number of utterances for training across all 100 speakers. From this graph, we can conclude that WER improves as more and more training rounds are performed on the device however, further training with additional data beyond 50 utterances would likely lead to more but much smaller gains.
Furthermore, the graph also highlights the importance of the seed model, which is a model that has been pre-adapted to a large pool of speakers with speech impairments. It was found to be more effective (resulting in larger WER improvements) and more efficient (requiring lesser transcript corrections) than personalizing from a base model trained on typical speech. Noticeably, higher improvement rates for speakers with more severe speech disorders over the seed model were observed as compared to 22% for mildly disordered speech.
Figure 2: Average WER with the number of utterances for training for on-device personalization from seed or base model.
(Source: Arxiv)
2. Assistant Task Success Rate In the experiments discussed above, the researchers focused on the home automation domain. Aside from WER, a pertinent extrinsic evaluation metric for this domain is the success rate of correct query intent recognition on the generated transcript.
Assistant Task Success Rate (ATSR) is the proportion of test utterances for which the intent derived from the predicted transcript matches the intent derived from the true transcript. An ATSR of 80% is usually considered to provide a satisfactory user experience.
On randomly choosing 20 out of 100 speakers and obtaining ATSR on the base model, the seed model, and the on-device ASR personalized model after all 50 utterances have been utilized for training, the on-device personalized models attain a median ATSR above the 80% threshold for all severity groups. It should also be noted that the seed model already had a much-improved ATSR compared to the base model, which underperforms with a median ATSR of just 40%. Median ATSRs are shown in Table 1.
Table 1: Median Assistant Task Success Rate for base models (BM), seed models (SM), and personalized models (pers).
(Source: Arxiv)
3) User benefits of consecutive training: One key advantage of consecutive training is that users can already benefit from improved models during their recording sessions, resulting in reduced transcription errors and manual correction efforts.
When comparing the on-device personalization with consecutive training against a single training round, transcript correction decreased by 27% across all speakers. In contrast, the speakers with moderate speech disorder saw the biggest gain (31%). This analysis emphasizes that using seed models dramatically improves the user experience during recording training utterances by massively reducing transcript correction needs.
Researchers tested their on-device personalization with one deaf speaker having severely impaired speech. The on-device training was carried out exactly as in the simulations, with the speaker speaking 50 training phrases over 10 consecutive sessions and the model being updated on-device after each session. It was found that the transcript accuracy increased dramatically from 48% (seed model) to 63% (after personalization).
1. Enhances Privacy: Performs personalization on user devices without having to ever upload sensitive user voice data to a server for processing.
2. Reduces communication cost with a server: In a scenario with frequent model adaptation, on-device training is a solution to reduce communication costs with a server.
3. Faster Turnaround: It can offer a faster turnaround for users to experience improved personalized models.
Mobile devices have limited memory and computation capabilities. The availability and dependability of on-device training data differ greatly from server-side training, where data may be carefully collected and annotated.
1. To include a larger user study to ensure that the promising results obtained in the simulation are replicable.
2. To investigate on-device personalization for a domain other than home automation. Ideally, more challenging tasks like open conversations with longer phrases will likely require more training data and increased training times.
To summarize, we learned the following in this post:
1. On-device personalization of ASR models approach with a focus on acoustic adaptation can be leveraged to deal with disordered speech and other atypical speech patterns.
2. The outcomes of the experiments highlight the importance of starting from a seed model. Moreover, ASR personalized using a seed model was more effective and efficient than personalization with a base model trained on ordinary speech.
3. On-device personalized models attain a median ATSR above the 80% threshold for all severity groups. On testing, it was found that with as few as 50 utterances, the proposed on-device training process can decrease the median WER by 71% and boost the median Assistant Task Success Rate (ATSR) from 40% to over 80%.
4. This approach enables voice-controlled services, like Google Assistant, to be useful for people with severe speech disorders.
Thanks for reading. If you have any questions or concerns, please leave them in the comments section below. Happy Learning!
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
Lorem ipsum dolor sit amet, consectetur adipiscing elit,