Hacking Into Nvidia Nemo Script(Download Common Voice)

Purnendu Shukla 15 Mar, 2022 • 6 min read

This article was published as a part of the Data Science Blogathon.

Introduction to Nvidia Nemo Script

Hey all👋

I am sure you must have heard Nvidia Nemo in recent times. It’s a great library for creating NLP models using just a few lines and code, and needless to say, the team has done a great job.

So as with all others, I wanted to try it out for myself and create something unique. This article covers a few current snippets of my journey, along with code creations from scratch. Have a good time reading it😀.

Nvidia Nemo Script: The Problem Statement

Like every proficient Data Scientist, I picked the problem statement of creating an ASR(Automatic Speech Recognition) model.

The goal here was to create a model which works similarly to the actual Google Assistant / YT Auto Captioning services but only in a single language-Eng.

To achieve this, I planned to use the Mozilla Common Voice Dataset 7.0, a 65 GB word corpus of spoken English sentences. Now the question was how to download such a huge file and process it simultaneously. This is where google helps, and a quick search landed me on a script that was doing the heavy lifting, which I quickly used, and suddenly everything changed👀

If you have read the above dilemma, the problem statement is unambiguous, making the script work. So let’s dive into the exact walkthrough of how it was fixed.

Understanding Script

The Nvidia Nemo Script, which we are modifying is originally by SeanNaren and is hosted at this link. So before changing it, let’s define what we are supposed to do.

👉Download, Store& Unzip: We start by downloading the dataset using- mozilla_voice_bundler , storing it in the directory specified by data_root And finally unzipping the tar file.

👉Processing Data: After extracting, the next part focuses on parsing the data by converting given mp3 files (present in tsv files)wav ones and then passing it to sox library to get the duration of the voice sample. This step will also capture the path where the new files are stored along with the text.

👉 Creating Manifest: Finally, with all the info given, the last part is about appending extracted values to create manifests passed to the Nemo models.

Having defined the explicit goals, we can now move to the fun part, Coding!

Editing The Nvidia Nemo Script

Here are a few plans to keep in mind:

Our script should work the exact similar way as the previous one
There is no ambiguity in code/code should be in industry-standard approach.

Lets’ start

🌟 Some Imports

import argparse
import csv
import json
import logging
import multiprocessing
import os
import tarfile
from multiprocessing.pool import ThreadPool
from pathlib import Path
from typing import List
import sox
from sox import Transformer
import wget
from tqdm import tqdm

Pretty straightforward here, simple implications! tqdm,logging Are optional.

🌟Command Lines Arguments

After imports next step is to edit command line args:

parser = argparse.ArgumentParser(description=’Downloads and processes Mozilla Common Voice dataset.’)
parser.add_argument(“–data_root”, default=’./’, type=str, help=”Directory to store the dataset.”)
parser.add_argument(‘–manifest_dir’, default=’manifest_dir/’, type=str, help=’Output directory for manifests’)
parser.add_argument(“–num_workers”, default=multiprocessing.cpu_count(), type=int, help=”Workers to process dataset.”)
parser.add_argument(‘–sample_rate’, default=16000, type=int, help=’Sample rate’)
parser.add_argument(‘–files_to_process’, nargs=’+’, default=[‘test.tsv’, ‘dev.tsv’, ‘train.tsv’],
type=str, help=’list of *.csv file names to process’)
parser.add_argument(‘–version’, default=’cv-corpus-7.0-2021-07-21′,
type=str, help=’Version of the dataset (obtainable via https://commonvoice.mozilla.org/en/datasets’)
parser.add_argument(‘–language’, default=’hi’,
type=str, help=’Which language to download.(default english,’
‘check https://commonvoice.mozilla.org/en/datasets for more language codes’)
args = parser.parse_args()

Two things worth noting here is the default = "cv-corpus-7.0-2021-07-21 and default = "hi" . For general readers, the above code will greet you with cmd-like options, and if nothing is passed, take the default value. to learn more, use the --help/h Option.

🌟 Changing URL format

One key thing to change is the URL format that downloads the dataset from the amazon s3 bucket, which keeps changing from time to time. Currently, the link looks similar to :

https://mozilla-common-voice-datasets.s3.dualstack.us-west-2.amazonaws.com/cv-corpus-7.0-2021-07-21/cv-corpus-7.0-2021-07-21-en.tar.gz

Given the original script can’t fetch, we must match the current format and can be adding structures as basic_url/{}/{}-{}.tar.gz where {}/{} will be version no and {}will be language code.

The below format does just that:

# https://mozilla-common-voice-datasets.s3.dualstack.us-west-2.amazonaws.com/cv-corpus-7.0-2021-07-21/cv-corpus-7.0-2021-07-21-en.tar.gz
COMMON_VOICE_URL = f”https://voice-prod-bundler-ee1969a6ce8178826482b88e843c335139bd3fb4.s3.amazonaws.com/”
“{}/{}-{}.tar.gz”.format(args.version, args.version, args.language)

🌟 Working On Processing & Manifest Helper Function

One aspect of functional programming is to split the helper functions separately and later use them later with main function. So in this section, we will be editing our helper functions process_files.py and manifest.py to do the heavy lifting.

process_files.py📜

def process_files(csv_file, data_root, num_workers):
“”” Read *.csv file description, convert mp3 to wav, process text.
Save results to data_root.
Args:
csv_file: str, path to *.csv file with data description, usually start from ‘cv-‘
data_root: str, path to dir to save results; wav/ dir will be created
“””
wav_dir = os.path.join(data_root, ‘wav/’)
os.makedirs(wav_dir, exist_ok=True)
audio_clips_path = os.path.dirname(csv_file) + ‘/clips/’
def process(x):
file_path, text = x
file_name = os.path.splitext(os.path.basename(file_path))[0]
text = text.lower().strip()
audio_path = os.path.join(audio_clips_path, file_path)
output_wav_path = os.path.join(wav_dir, file_name + ‘.wav’)
tfm = Transformer()
tfm.rate(samplerate=args.sample_rate)
tfm.build(
input_filepath=audio_path,
output_filepath=output_wav_path
)
duration = sox.file_info.duration(output_wav_path)
return output_wav_path, duration, text
logging.info(‘Converting mp3 to wav for {}.’.format(csv_file))
with open(csv_file) as csvfile:
reader = csv.DictReader(csvfile, delimiter=’t’)
next(reader, None) # skip the headers
data = [(row[‘path’], row[‘sentence’]) for row in reader]
with ThreadPool(num_workers) as pool:
data = list(tqdm(pool.imap(process, data), total=len(data)))
return data

This function takes our tsv files, resulting in a data path. data_root And no of cores to use. The main job is to read CSV file description, navigate to the given file path, perform mp3->wav the conversion, process the text and save the result to data_root , the given directory.

For simplicity, let’s split it into pieces/lines:

Line 8–10: Sets default paths such as wav_dir (if none present creates new ) and audio_clips_path which is present in clips folder in the same directory as the tsv files.
Line 12–26 : def process(x)-A sub-function to process_files whose job is to return duration , test and output_wav_path given input path.
Line 28–35: This part is responsible to open CSV/tsv file, read the contents of the column path & sentence and finally, process it using process the function defined above while displaying progress bar using tqdm and return the processed data

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Extras:

Here is a quick breakdown of the subprocess process.py:

👉 Line 13–17: Extracts the filename from the tail of file_path, converts text to lower case and finally defines the output path for wav files.
👉 Line 19–26: Pass the audio and output path to sox transformerclass defines the sample rate given by sample_rate , finds the duration of audio using sox.file_info.duration() and finally returns the values required.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — –

manifest.py📜

Having returned our data, we can now parse the data as JSON format, and that’s what create_manifest function does:

def create_manifest(
data: List[tuple],
output_name: str,
manifest_path: str):
output_file = Path(manifest_path) / output_name
output_file.parent.mkdir(exist_ok=True, parents=True)

Pretty straightforward, all we do here is pass data(a tuple of file paths), output_name (name of output_file), manifest_path (path to store the manifest/created files).

Note: For an extreme case, the path is created if the folder is not present and the files are stored in that path. (Line 6)

🌟Combining All Functionality

So now that we have all functionality ready, let’s combine them in main function — the actual backbone of the script and contains all the functionalities defined at the start:

def main():
data_root = args.data_root
os.makedirs(data_root, exist_ok=True)
target_unpacked_dir = os.path.join(data_root, “CV_unpacked”)
if os.path.exists(target_unpacked_dir):
logging.info(‘Find existing folder {}’.format(target_unpacked_dir))
else:
logging.info(“Could not find Common Voice, Downloading corpus…”)
filename = wget.download(COMMON_VOICE_URL, data_root)
target_file = os.path.join(data_root, os.path.basename(filename))
os.makedirs(target_unpacked_dir, exist_ok=True)
logging.info(“Unpacking corpus to {} …”.format(target_unpacked_dir))
tar = tarfile.open(target_file)
tar.extractall(target_unpacked_dir)
tar.close()
folder_path = os.path.join(target_unpacked_dir, args.version + f’/{args.language}/’)
for csv_file in args.files_to_process:
data = process_files(
csv_file=os.path.join(folder_path, csv_file),
data_root=os.path.join(data_root, os.path.splitext(csv_file)[0]),
num_workers=args.num_workers
)
logging.info(‘Creating manifests…’)
create_manifest(
data=data,
output_name=f’commonvoice_{os.path.splitext(csv_file)[0]}_manifest.json’,
manifest_path=args.manifest_dir,
)
if __name__ == “__main__”:
main()

I hope it’s pretty explanatory after reading the Understanding Script section. However few things to add are :

By default, the unpacked files are stored in CV_unpacked the folder (to keep things simple). To extract it to pwd remove it.
We have added a functionality to check for the path. If present, it will just process and create desired files. The download, unzip, process, and storing way will be taken.
Finally, a boilerplate to run the script automatically — if __name__ = "__main__ call main.

✨Win Or Lose Time

Ok, so what’s the proof the scripts actually work?

Well, below is a small clip showing the working of the file:)

Link to the video: https://youtu.be/SrKhromAdoI

Working Proof — Sorry For Water Mark – Run at 2x and max res – Video By Author

Note — The script is used with default settings.

Summary

So that ends our coding and evaluation part. If you have followed along, you have learned how to: recreate an entire script from scratch, understand different components, & write modularised and production-ready code.

On the other hand, you may have figured out how to use the argparse to turn any function into a command liner.

However, it will be much more beneficial if you apply these concepts in real life and concrete your learning. I would really love to see them😍.

Hope you liked my article on Nvidia Nemo Script. Below are some of the resources for advanced readers.

Extra Resources

Github: For downloading and usage, click here.

Contact Links: You can contact me on Twitter, LinkedIn, and GitHub.

Must Read: Nvidia Nemo ASR.

Finally, If you like the article, support my efforts by sharing and passing on your suggestions. To read more articles like these, kindly visit my author page & make sure to follow and get notified🔔. You are welcome to comment, too⏬.

Thanks😀

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.

Purnendu Shukla 15 Mar 2022

Hey All✋, My name is Purnendu Shukla a.k.a Harsh. I am a passionate individual who likes exploring & learning new technologies, creating real-life projects, and returning to the community as blogs. My Blogs range from various topics, including Data Science, Machine Learning, Deep Learning, Optimization Problems, Excel and Python Guides, MLOps, Cloud Technologies, Crypto Mining, Quantum Computing.

Beginner Datasets NLP