Hugging Face Releases New NLP ‘Tokenizers’ Library Version (v0.8.0)

Prateek Joshi 01 Mar, 2024

5 min read

Introduction

Hugging Face is at the forefront of a lot of updates in the NLP space. They have released one groundbreaking NLP library after another in the last few years. Honestly, I have learned and improved my own NLP skills a lot thanks to the work open-sourced by Hugging Face.

And today, they’ve released another big update – a brand new version of their popular Tokenizer library.

A Quick Introduction to Tokenization

So, what is tokenization? Tokenization is a crucial cog in Natural Language Processing (NLP). It’s a fundamental step in both traditional NLP methods like Count Vectorizer and Advanced Deep Learning-based architectures like Transformers.

Tokens are the building blocks of Natural Language.

What is Hugging Face?

Hugging Face is a company that makes tools for understanding and working with language. They create software that helps computers understand and generate human language better. They also provide a platform where people can share and use these tools for free.

What is Tokenization?

Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens can be either words, characters, or subwords. Hence, tokenization can be broadly classified into 3 types – word, character, and subword (n-gram characters) tokenization.

For example, consider the sentence: “Never give up”.

The most common way of forming tokens is based on space. Assuming space as a delimiter, the tokenization of the sentence results in 3 tokens – Never-give-up. As each token is a word, it becomes an example of Word tokenization.

Why is Tokenization Required?

As tokens are the building blocks of Natural Language, the most common way of processing the raw text happens at the token level. The sentences or phrases of a text dataset are first tokenized and then those tokens are converted into integers which are then fed into the deep learning models.

For example, Transformer-based models – the State-of-the-Art (SOTA) Deep Learning architectures in NLP – process the raw text at the token level. Similarly, the most popular deep learning architectures for NLP like RNN, GRU, and LSTM also process the raw text at the token level.

Hugging Face’s Tokenizers Library

We all know about Hugging Face thanks to their Transformer library that provides a high-level API to state-of-the-art transformer-based models such as BERT, GPT2, ALBERT, RoBERTa, and many more.

The Hugging Face team also happens to maintain another highly efficient and super fast library for text tokenization called Tokenizers. Recently, they have released the v0.8.0 version of the library.

Key Highlights of Tokenizers v0.8.0

Now both pre-tokenized sequences and raw text strings can be encoded.
Training a custom tokenizer is now five to ten times faster.
Saving a tokenizer is easier than ever. It takes just one line of code to save a tokenizer as a JSON file.
And many other improvements, and fixes.

To see the entire list of updates and changes refer to this link. In this article, I’ll show how you can easily get started with this latest version of the Tokenizers library for NLP tasks.

Getting Started with Tokenizers

I’ll be using Google Colab for this demo. However, you are free to use any other platform or IDE of your choice. So, first of all, let’s quickly install the tokenizers library:

!pip install tokenizers

You can check the version of the library by executing the command below:

tokenizers.__version__

Let’s import some required libraries and the BertWordPieceTokenizer from the tokenizer library:

There are other different types of tokenization schemes available as well, such as ByteLevelBPETokenizer, CharBPETokenizer, and SentencePieceBPETokenizer. In this article, I will be using BertWordPieceTokenizer only. This is the tokenization schemes used in the BERT model.

Tokenization

Next, we have to download a vocabulary set for our tokenizer:

# Bert Base Uncased Vocabulary
!wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt

Now, let’s tokenize a sample sentence:

The three main components of “encoded_output” are:

ids – The integer values assigned to the tokens of the input sentence.
tokens – The tokens after tokenization.
offsets – The position of all the tokens in the input sentence

print(encoded_output.ids)

Output: [101, 2653, 2003, 1037, 2518, 1997, 5053, 1012, 2021, 11495, 1037, 2047, 2653, 2013, 11969, 2003, 3243, 1037, 4830, 16671, 2075, 9824, 1012, 102]

print(encoded_output.tokens)

Output: [‘[CLS]’, ‘language’, ‘is’, ‘a’, ‘thing’, ‘of’, ‘beauty’, ‘.’, ‘but’, ‘mastering’, ‘a’, ‘new’, ‘language’, ‘from’, ‘scratch’, ‘is’, ‘quite’, ‘a’, ‘da’, ‘##unt’, ‘##ing’, ‘prospect’, ‘.’, ‘[SEP]’]

print(encoded_output.offsets)

Output: [(0, 0), (0, 8), (9, 11), (12, 13), (14, 19), (20, 22), (23, 29), (29, 30), (31, 34), (35, 44), (45, 46), (47, 50), (51, 59), (60, 64), (65, 72), (73, 75), (76, 81), (82, 83), (84, 86), (86, 89), (89, 92), (93, 101), (101, 102), (0, 0)]

Saving and Loading Tokenizer

The tokenizers library also allows us to easily save our tokenizer as a JSON file and load it for later use. This is helpful for large text datasets. We won’t have to initialize the tokenizer again and again.

Encode Pre-Tokenized Sequences

While working with text data, there are often situations where the data is already tokenized. However, it is not tokenized as per the desired tokenization scheme. In such a case, the tokenizers library can come in handy as it can encode pre-tokenized text sequences as well.

So, instead of the input sentence, we will pass the tokenized form of the sentence as input. Here, we have tokenized the sentence based on the space between two consecutive words:

print(sentence.split())

Output: [‘Language’, ‘is’, ‘a’, ‘thing’, ‘of’, ‘beauty.’, ‘But’, ‘mastering’, ‘a’, ‘new’, ‘language’, ‘from’, ‘scratch’, ‘is’, ‘quite’, ‘a’, ‘daunting’, ‘prospect.’]

Output: [‘[CLS]’, ‘language’, ‘is’, ‘a’, ‘thing’, ‘of’, ‘beauty’, ‘.’, ‘but’, ‘mastering’, ‘a’, ‘new’, ‘language’, ‘from’, ‘scratch’, ‘is’, ‘quite’, ‘a’, ‘da’, ‘##unt’, ‘##ing’, ‘prospect’, ‘.’, ‘[SEP]’]

It turns out that this output is identical to the output we got when the input was a text string.

Speed Testing Tokenizers

As I mentioned above, tokenizers is a fast tokenization library. Let’s test it out on a large text corpus.

I will use the WikiText-103 dataset (181 MB in size). Let’s first download it and then unzip it:

!wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip

!unzip wikitext-103-v1.zip

The unzipped data contains three files – wiki.train.tokens, wiki.valid.tokens, and wiki.test.tokens. We will use wiki.train.tokens file only for benchmarking:

Output: 1801350

There are close to two million sequences of text in the train set. It is quite a huge number. Let’s see how the tokenizers library deals with this huge data. We will use “encode_batch” instead of “encode” because now we are going to tokenize more than one sequence:

Output: 218.2345

This is mind-blowing! It took just 218 seconds or close to 3.5 minutes to tokenize 1.8 million text sequences. Most of the other tokenization methods would crash even on Colab.

Go ahead, try it out and let me know your experience using Hugging Face’s Tokenizers NLP library!

Conclusion

Tokenization is the process of breaking down text into smaller units called tokens. It’s needed for tasks like natural language processing. Hugging Face’s Tokenizers library helps with this, offering efficient tools. Starting with it is easy, and speed testing ensures it performs well. Tokenization simplifies text analysis, making it manageable.

Prateek Joshi 01 Mar, 2024

Data Scientist at Analytics Vidhya with multidisciplinary academic background. Experienced in machine learning, NLP, graphs & networks. Passionate about learning and applying data science to solve real world problems.

Beginner Libraries NLP Python Text