Gradio App for Translating Spanish Audio

Drishti Sharma 06 Jun, 2022
8 min read

This article was published as a part of the Data Science Blogathon.

Gradio App

Introduction to Gradio App

Language is an important part of our lives since it connects us and gives us a sense of belonging. But unfortunately, many languages have already become extinct and some are on the verge of becoming extinct as a result of the increased rate of globalization, urbanization, acculturation, and cultural/political/economic alienation. Considering the gravity of the situation, we must devise solutions to aid the revitalization and preservation of indigenous languages. In light of this, here is an attempt to create an app for translating Spanish audio transcriptions to Nahuatl text.


  1. Hugging Face pipeline function
  2. Model for ASR (Module 1) – jonatasgrosman/wav2vec2-xls-r-1b-spanish
  3. Model for translating Spanish transcript to Nahuatl (Module 2) – hackathon-pln-es/t5-small-spanish-nahuatl
  4. Hugging Face Spaces
  5. Gradio

Why is it Important to Protect Indigenous Languages?

A language is a tool for expressing oneself and bringing others together. People who speak the same language and have similar cultural identities tend to bond more closely. When a language dies, it loses one of its main motivators for people to communicate with one another. Not only will a language perish if it becomes extinct, but we will also lose access to a wealth of traditional knowledge that could be valuable. Furthermore, the preservation of indigenous languages may be critical to decreasing discrimination against indigenous peoples and strengthening the link between culture, language, and identity.

Core Idea

The key concept is that we will feed audio (speech) input into the ASR module (Module1), which will convert Spanish speech into text. The resulting audio transcription will then be transmitted to the Nahuatl translator module (Module 2), which will translate the Spanish audio transcription to Nahuatl.

Core Idea | Gradio App

                            Figure 1: Diagram depicting the flow of the application

Module 1 (ASR module):

For converting speech to the corresponding text we will be leveraging “jonatasgrosman/wav2vec2-xls-r-1b-spanish” pre-trained model from the Hugging Face hub which has been trained and contributed by Jonatas Grosman. This model is a fine-tuned version of facebook/wav2vec2-xls-r-1b on the Spanish Common Voice 8.0 dataset.  For using this model one needs to ensure that the speech input is sampled at 16 kHz.

Module 2 (Nahuatl Translator):

For this, t5-small was fine-tuned by Emilio Alejandro Morales, Rodrigo Martínez Arzate, Luis Armando Mercado, and Jacobo del Valle on Spanish and Nahuatl sentences in two stages as follows:

1. In the first stage, they fine-tuned the t5-small model on the Anki dataset which consists of 118,964 English-Spanish text pairs.

2. In the second step, to train the model for the Nahuatl language, the t5-small model fine-tuned on the Spanish dataset was further trained on the best examples from the Axolotl corpus that did not exhibit misalignments. In addition, the Nahuatl orthographies were normalized with the help of py-elotl.

Furthermore, we will utilize Gradio app Interface class to establish a UI for the machine learning model(s) and deploy our app on Hugging Face Spaces.

Step-by-Step Implementation

The steps below will walk you through developing a Gradio app for Spanish ASR and then translating the resulting transcription to Nahuatl.

Step 1: Building a Hugging Face Account and Repository for the App

If you don’t already have a Hugging Face account, go visit the website and create one. After you’ve created a Hugging Face account, go to the top-right side of the page and click on the profile icon, and then the ‘New Space’ button. Then you’ll be directed to a new page where you’ll be asked to name the repository you want to create. Give the space a name, and then choose ‘Gradio’ app from the SDK options before clicking the ‘create new space’ button. As a result, the repository for your app will be created. For your convenience, I’ve provided a demonstration video below.

Hugging Face Account | Gradio App

Step 2: Creating a requirements.txt File

Now we will create a requirements.txt file in which we will list the Python packages for our app to run successfully. Those dependencies will be installed with the help of pip install -r requirements.txt.

We will need to add transformers, torch, librosa==0.81, pyctcdecode, and pypi-kenlm.

Hugging Face Requirement | Gradio App

Step 3: Creating File

For the sake of clarity and to make things easier to understand, I’ve broken the code into sections. We’ll go over each code block one by one.

1. Import necessary libraries

We will start with importing the required dependencies.

import gradio as gr   
import librosa 
from transformers import AutoFeatureExtractor, AutoModelForSeq2SeqLM, AutoTokenizer, pipeline

2. Defining a function that makes sure that the speech input has a sampling rate of 16 kHz

Now we will define a function that makes sure that the speech input has a sampling rate of 16 kHz.

def load_and_fix_data(input_file, model_sampling_rate):
    speech, sample_rate = librosa.load(input_file)
    if len(speech.shape) > 1:
        speech = speech[:, 0] + speech[:, 1]
    if sample_rate != model_sampling_rate:
        speech = librosa.resample(speech, sample_rate, model_sampling_rate)
    return speech

3. Specifying the model name, loading the feature extractor, and setting up the pipeline for ASR

For converting the Spanish speech to text, we will be leveraging “jonatasgrosman/wav2vec2-xls-r-1b-spanish” model. We will also be downloading the feature extractor and will calculate the sampling rate. Then, as illustrated below, we’ll instantiate a pipeline by calling pipeline() for automatic speech recognition:

#Loading the model and feature extractor for ASR
model_name1 = "jonatasgrosman/wav2vec2-xls-r-1b-spanish"
feature_extractor = AutoFeatureExtractor.from_pretrained(model_name1)
sampling_rate = feature_extractor.sampling_rate
asr = pipeline("automatic-speech-recognition", model=model_name1)

4. Loading the model and tokenizer for translating Spanish transcriptions to Nahuatl

#Loading the model and tokenizer for Spanish-to-Nahuatl translation

model_name2 = 'hackathon-pln-es/t5-small-spanish-nahuatl'

model = AutoModelForSeq2SeqLM.from_pretrained(model_name2)

tokenizer = AutoTokenizer.from_pretrained(model_name2)
new_line = 'nnn'

5. Defining a function for ASR

Now we will define a function for converting speech to text.

#Defining the function for ASR

def speech_to_text(input_file):

    speech = load_and_fix_data(input_file, sampling_rate)

    transcribed_text = asr(speech, chunk_length_s=15, stride_length_s=1)

    transcribed_text = transcribed_text["text"]

    return transcribed_text

6. Defining a function for translating Spanish audio transcriptions to Nahuatl

#Defining a function for translating the Spanish audio transcription to Nahuatl    

def spanish_transcription_to_nahuatl(transcribed_text):

    input_ids = tokenizer('translate Spanish to Nahuatl: ' + transcribed_text, return_tensors='pt').input_ids

    outputs = model.generate(input_ids, max_length=512)

    outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]

    return outputs

7. Defining a function that will generate a Spanish audio transcript and translate that to Nahuatl

Finally, we’ll write a function that outputs the generated transcript for the Spanish input audio, and the translated transcript in Nahuatl.

#Defining a function that will generate Spanish audio transcripts and translate them to Nahuatl
def asr_and_nahuatl_translation(input_file):
    transcribed_text = speech_to_text(input_file)
    outputs = spanish_transcription_to_nahuatl(transcribed_text)
    return f"Spanish Audio Transcription: {transcribed_text} {new_line} Nahuatl Translation :{outputs}"

8. Creating an interface to the model using gr.Interface

Next, we will utilize Gradio’s Interface class to establish an interface for the machine learning model by providing (1) the function, (2) the desired input components, and (3) the desired output components, which will allow us to quickly prototype and test our model. In our case, the function is asr_and_nahuatl_translation. For providing the audio input, we will use a microphone or drop an audio file from a local directory. In this regard, we will use this code:  gr.inputs.Audio(source=”microphone”, type=”filepath”, label=”Record your audio”)] for providing input. And since the intended output is a string we will use outputs = gr.outputs.Textbox(label=”Output Text”) for displaying the string output. Finally, to launch the demo, call the launch() method.

⚠️If you wish to test audio files stored locally, ensure sure they’ve been uploaded and the location to them is listed in the examples (as shown in the code snippet below). It’s worth mentioning that the components can be specified as either instantiated objects or string shortcuts.

To upload audio files, go to “Files and versions” –> “Contribute” –> “Upload Files” in the order stated here.

inputs=[gr.inputs.Audio(source="microphone", type="filepath", label="Record your audio")]
examples = [["audio1.wav"], ["travel.wav"], ["sample_audio.wav"]]
description = """ This is a Gradio demo of Spanish Audio Transcriptions to Nahuatl Translation. To use this, simply provide an audio input (audio recording or via microphone), which will subsequently be transcribed and translated to the Nahuatl language.
Pre-trained model used for Spanish ASR: [jonatasgrosman/wav2vec2-xls-r-1b-spanish](
Pre-trained model used for translating Spanish audio transcription to the Nahuatl language: [hackathon-pln-es/t5-small-spanish-nahuatl](
    inputs = inputs,
    description = description,

Step 4: Debugging

If you encounter an error, go to the “See log” tab, which is right next to the spot where Runtime Error is shown, take a cue from the error log and fix the error.

The app should function like this once the Space is up and running without errors:

Link to the Space:


1. The pre-trained model used for ASR is trained on the Common Voice dataset, which consists primarily of audio captured in a studio setting with professional artists as contributors. Although this model performs wonderfully in most circumstances, it may suffer in a noisy setting or with audio input that contains background audio or noise.

2. Nahuatl is a widely spoken indigenous language in Mexico, but the availability of structured data for training a model for translation tasks is sparse. In addition to that, there are multiple varieties of Nahuatl, which makes this task even more difficult.

3. Challenging audio – this could be due to a variety of reasons such as low pitch, audio with low SNR, etc

4. Data protection – consent and privacy

5. Different phonation syles and dialects

6. Speech Modality

7. Demographic Bias

Limitations and Solutions

1. Pre-trained model used for ASR – It’s not far from the truth that the pre-trained model for ASR has a crucial role to play; if the audio isn’t identified correctly, the transcription will be erroneous, and the Nahuatl translation module output will be erroneous. Although the pre-trained model in this app’s backend performs excellently in many situations, it may struggle when faced with a large range of voice patterns (different accents, pitches, play speed, and background audio conditions).

👉Solution to the aforementioned limitation: To circumvent the aforementioned constraint, the pre-trained ASR model should be further trained on example audios that closely reflect the auditory environment in which the app will be used/tested to suit the needs.

2. Pre-trained model used for translating Spanish transcriptions to Nahuatl – Even though this model performs excellently in the majority of instances, we can’t rule out the possibility that it still produces erroneous results for some specific kind of text inputs.

👉Solution to the aforementioned limitation: After evaluating and determining which kind of text inputs the Nahuatl translation module fails to accurately output, the pre-trained model can be further trained on that kind of example audios to get reliable detection output.


It could be used to devise solutions that require Spanish ASR as well as translating the resulting audio transcriptions into Nahuatl.

Things to Try

1. How about creating an app that translates Spanish transcriptions to the language of your choice?

2. If you find that any of the two modules (ie. Spanish ASR and the Spanish transcript to Nahuatl translator) isn’t functioning for the application you’re most interested in, perhaps consider training that particular model to do the targeted improvements for the example audios/texts which the model often outputs incorrectly.

3. Try building this app using Gradio Blocks.


To sum it up, in this blog post we learned:

1. How to create a Gradio app for translating Spanish audio transcripts to Nahuatl?

2. What are the challenges encountered in developing robust ASR solutions and text-to-text translation for indigenous languages?

3. How certain limitations could be circumvented?

4. What are the potential applications?

Thanks for reading my article on Gradio App. If you have any questions or concerns please post them in the comments section below. Happy learning!

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Drishti Sharma 06 Jun, 2022

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers


Francisco 07 Jun, 2022

There is no doubt that NLP can become a great tool for speakers and learners of underrepresented languages. And this is a good example. Great article! Saludos :)