What is Tokenization in NLP? Here’s All You Need To Know

Aravindpai Pai 17 Mar, 2024 • 12 min read

Introduction

Language is a thing of beauty. But mastering a new language from scratch is quite a daunting prospect. If you’ve ever picked up a language that wasn’t your mother tongue, you’ll relate to this! There are so many layers to peel off and syntaxes to consider – it’s quite a challenge to learn what us tokenization NLP.

And that’s exactly the way with our machines. In order to get our computer to understand any text, we need to break that word down in a way that our machine can understand. That’s where the concept of tokenization in Natural Language Processing (NLP) comes in.

Simply put, we can’t work with text data if we don’t perform tokenization. Yes, it’s really that important!

what is tokenization

And here’s the intriguing thing about tokenization – it’s not just about breaking down the text. Tokenization plays a significant role in dealing with text data. So in this article, we will explore the depths of tokenization in Natural Language Processing and how you can implement it in Python.

I recommend taking some time to go through the below resource if you’re new to NLP:

A Quick Rundown of Tokenization

Tokenization is a common task in Natural Language Processing (NLP). It’s a fundamental step in both traditional NLP methods like Count Vectorizer and Advanced Deep Learning-based architectures like Transformers.

Tokens are the building blocks of Natural Language.

Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens can be either words, characters, or subwords. Hence, tokenization can be broadly classified into 3 types – word, character, and subword (n-gram characters) tokenization.

For example, consider the sentence: “Never give up”.

The most common way of forming tokens is based on space. Assuming space as a delimiter, the tokenization of the sentence results in 3 tokens – Never-give-up. As each token is a word, it becomes an example of Word tokenization.

Similarly, tokens can be either characters or subwords. For example, let us consider “smarter”:

  1. Character tokens: s-m-a-r-t-e-r
  2. Subword tokens: smart-er

But then is this necessary? Do we really need tokenization to do all of this?

Note: If you are new to NLP, check out our NLP Course Online

What is tokenization?

Tokenization is the process of breaking down a piece of text, like a sentence or a paragraph, into individual words or “tokens.” These tokens are the basic building blocks of language, and tokenization helps computers understand and process human language by splitting it into manageable units.

For example, tokenizing the sentence “I love ice cream” would result in three tokens: “I,” “love,” and “ice cream.” It’s a fundamental step in natural language processing and text analysis tasks.

The True Reasons behind Tokenization

As tokens are the building blocks of Natural Language, the most common way of processing the raw text happens at the token level.

For example, Transformer based models – the State of The Art (SOTA) Deep Learning architectures in NLP – process the raw text at the token level. Similarly, the most popular deep learning architectures for NLP like RNN, GRU, and LSTM also process the raw text at the token level.

RNN
Working of Recurrent Neural Network

As shown here, RNN receives and processes each token at a particular timestep.

Hence, Tokenization is the foremost step while modeling text data. Tokenization is performed on the corpus to obtain tokens. The following tokens are then used to prepare a vocabulary. Vocabulary refers to the set of unique tokens in the corpus. Remember that vocabulary can be constructed by considering each unique token in the corpus or by considering the top K Frequently Occurring Words.

Creating Vocabulary is the ultimate goal of Tokenization.

One of the simplest hacks to boost the performance of the NLP model is to create a vocabulary out of top K frequently occurring words.

Now, let’s understand the usage of the vocabulary in Traditional and Advanced Deep Learning-based NLP methods.

  • Traditional NLP approaches such as Count Vectorizer and TF-IDF use vocabulary as features. Each word in the vocabulary is treated as a unique feature:
count vectorizer
Traditional NLP: Count Vectorizer
  • In Advanced Deep Learning-based NLP architectures, vocabulary is used to create the tokenized input sentences. Finally, the tokens of these sentences are passed as inputs to the model

Which Tokenization Should you use?

As discussed earlier, tokenization can be performed on word, character, or subword level. It’s a common question – which Tokenization should we use while solving an NLP task? Let’s address this question here.

Word Tokenization

Word Tokenization is the most commonly used tokenization algorithm. It splits a piece of text into individual words based on a certain delimiter. Depending upon delimiters, different word-level tokens are formed. Pretrained Word Embeddings such as Word2Vec and GloVe comes under word tokenization.

But, there are few drawbacks to this.

Drawbacks of Word Tokenization

One of the major issues with word tokens is dealing with Out Of Vocabulary (OOV) words. OOV words refer to the new words which are encountered at testing. These new words do not exist in the vocabulary. Hence, these methods fail in handling OOV words.

But wait – don’t jump to any conclusions yet!

  • A small trick can rescue word tokenizers from OOV words. The trick is to form the vocabulary with the Top K Frequent Words and replace the rare words in training data with unknown tokens (UNK). This helps the model to learn the representation of OOV words in terms of UNK tokens
  • So, during test time, any word that is not present in the vocabulary will be mapped to a UNK token. This is how we can tackle the problem of OOV in word tokenizers.
  • The problem with this approach is that the entire information of the word is lost as we are mapping OOV to UNK tokens. The structure of the word might be helpful in representing the word accurately. And another issue is that every OOV word gets the same representation
hold on

Another issue with word tokens is connected to the size of the vocabulary. Generally, pre-trained models are trained on a large volume of the text corpus. So, just imagine building the vocabulary with all the unique words in such a large corpus. This explodes the vocabulary!

This opens the door to Character Tokenization.

Character Tokenization

Character Tokenization splits apiece of text into a set of characters. It overcomes the drawbacks we saw above about Word Tokenization.

  • Character Tokenizers handles OOV words coherently by preserving the information of the word. It breaks down the OOV word into characters and represents the word in terms of these characters
  • It also limits the size of the vocabulary. Want to talk a guess on the size of the vocabulary? 26 since the vocabulary contains a unique set of characters

Drawbacks of Character Tokenization

Character tokens solve the OOV problem but the length of the input and output sentences increases rapidly as we are representing a sentence as a sequence of characters. As a result, it becomes challenging to learn the relationship between the characters to form meaningful words.

This brings us to another tokenization known as Subword Tokenization which is in between a Word and Character tokenization.

Also Read- What are Categorical Data Encoding Methods

Tokenization Libraries and Tools in Python

Python provides several powerful libraries and tools that make it easy to perform tokenization and text preprocessing for natural language processing tasks. Here are some of the most popular ones:

NLTK (Natural Language Toolkit)

NLTK is a suite of libraries and programs for symbolic and statistical natural language processing. It includes a wide range of tokenizers for different needs:

  • word_tokenize: Tokenizes a string into word tokens.
  • sent_tokenize: Tokenizes a string into sentence tokens.
  • WordPunctTokenizer: Tokenizes a string into word and punctuation tokens.
  • TweetTokenizer: Tokenizer designed specifically for tokenizing tweets.

NLTK tokenizers support different token types like words, punctuation, and provide functionality to filter out stopwords.

spaCy

spaCy is a popular open-source library for advanced natural language processing in Python. It provides highly efficient tokenization that accounts for linguistic structure and context:

  • Multi-lingual tokenization support for over 50 languages.
  • Contextual tokenization that handles rare/unknown words intelligently.
  • Tokenization that preserves URLs, emails, emoticons as single tokens.
  • Easy customization to add new rules for tokenizing domain-specific text.

spaCy’s tokenization forms the base for its advanced NLP capabilities like named entity recognition, part-of-speech tagging, etc.

Hugging Face Tokenizers

The Hugging Face Tokenizers library provides access to tokenizers from popular transformer models used for tasks like text generation, summarization, translation, etc. It includes:

  • BERT’s WordPiece tokenizer
  • GPT-2’s Byte-Level BPE tokenizer
  • T5’s SentencePiece tokenizer
  • And tokenizers from many other transformers

This library allows you to use the same tokenization as pre-trained models, ensuring consistency between tokenization during pre-training and fine-tuning.

Other Libraries

There are also tokenization utilities in other Python data science and NLP libraries like:

  • Gensim: Has basic tokenizers as part of its data preprocessing tools.
  • Polyglot: Provides word, line, and character tokenizers for over 165 languages.
  • PyThai: Library for tokenizing and processing Thai text.

The choice of tokenization library depends on the specific NLP task, performance requirements, and whether you need special handling for languages, domains or data types.

Subword Tokenization

Subword Tokenization splits the piece of text into subwords (or n-gram characters). For example, words like lower can be segmented as low-er, smartest as smart-est, and so on.

Transformed based models – the SOTA in NLP – rely on Subword Tokenization algorithms for preparing vocabulary. Now, I will discuss one of the most popular Subword Tokenization algorithm known as Byte Pair Encoding (BPE).

Welcome to Byte Pair Encoding (BPE)

Byte Pair Encoding (BPE) is a widely used tokenization method among transformer-based models. BPE addresses the issues of Word and Character Tokenizers:

  • BPE tackles OOV effectively. It segments OOV as subwords and represents the word in terms of these subwords
  • The length of input and output sentences after BPE are shorter compared to character tokenization

BPE is a word segmentation algorithm that merges the most frequently occurring character or character sequences iteratively. Here is a step by step guide to learn BPE.

Steps to learn BPE

  1. Split the words in the corpus into characters after appending </w>
  2. Initialize the vocabulary with unique characters in the corpus
  3. Compute the frequency of a pair of characters or character sequences in corpus
  4. Merge the most frequent pair in corpus
  5. Save the best pair to the vocabulary
  6. Repeat steps 3 to 5 for a certain number of iterations

We will understand the steps with an example.

corpus

Consider a corpus:

1a) Append the end of the word (say </w>) symbol to every word in the corpus:

BPE

1b) Tokenize words in a corpus into characters:

BPE

2. Initialize the vocabulary:

BPE

Iteration 1:

3. Compute frequency:

BPE

4. Merge the most frequent pair:

BPE

5. Save the best pair:

BPE

Repeat steps 3-5 for every iteration from now. Let me illustrate for one more iteration.

Iteration 2:

3. Compute frequency:

BPE

4. Merge the most frequent pair:

BPE

5. Save the best pair:

BPE

After 10 iterations, BPE merge operations looks like:

BPE

Pretty straightforward, right?

Applying BPE to OOV words

But, how can we represent the OOV word at test time using BPE learned operations? Any ideas? Let’s answer this question now.

At test time, the OOV word is split into sequences of characters. Then the learned operations are applied to merge the characters into larger known symbols.

– Neural Machine Translation of Rare Words with Subword Units, 2016

Here is a step by step procedure for representing OOV words:

  1. Split the OOV word into characters after appending </w>
  2. Compute pair of character or character sequences in a word
  3. Select the pairs present in the learned operations
  4. Merge the most frequent pair
  5. Repeat steps 2 and 3 until merging is possible

Let’s see all this in action next!

Implementing Tokenization – Byte Pair Encoding in Python

We are now aware of how BPE works – learning and applying to the OOV words. So, its time to implement our knowledge in Python.

The python code for BPE is already available in the original paper itself (Neural Machine Translation of Rare Words with Subword Units, 2016)

Reading Corpus

We’ll consider a simple corpus to illustrate the idea of BPE. Nevertheless, the same idea applies to another corpus as well:

Text Preparation

Tokenize the words into characters in the corpus and append </w> at the end of every word:

Python Code:

Learning BPE

Compute the frequency of each word in the corpus:

Output:

BPE

Let’s define a function to compute the frequency of a pair of character or character sequences. It accepts the corpus and returns the pair with its frequency:

Now, the next task is to merge the most frequent pair in the corpus. We will define a function that accepts the corpus, best pair, and returns the modified corpus:

Next, its time to learn BPE operations. As BPE is an iterative procedure, we will carry out and understand the steps for one iteration. Let’s compute the frequency of bigrams:

BPE

Output:

Find the most frequent pair:

Output: (‘e’, ‘s’)

Finally, merge the best pair and save to the vocabulary:

bpe

Output:

We will follow similar steps for certain iterations:

BPE

Output:

The most interesting part is yet to come! That’s applying BPE to OOV words.

Applying BPE to OOV word

Now, we will see how to segment the OOV word into subwords using learned operations. Consider OOV word to be “lowest”:

Applying BPE to an OOV word is also an iterative process. We will implement the steps discussed earlier in the article:

Output:

BPE

As you can see here, the unknown word “lowest” is segmented as low-est.

Advanced Tokenization Techniques

While basic word and character level tokenization are common, there are several advanced tokenization algorithms and methods designed to handle the complexities of natural language:

Byte-Level Byte-Pair Encoding (BPE)

An extension of the original BPE, Byte-Level BPE operates on a byte-level rather than character-level. It encodes each token as a sequence of bytes rather than characters. This allows it to:

  • Better handle Unicode characters and multi-lingual text
  • Avoid maintaining separate vocabularies for each language
  • Achieve open-vocabulary by representing any unseen word as a sequence of subword tokens

Byte-Level BPE is used by models like GPT-2 for text generation.

SentencePiece Tokenization

SentencePiece is an advanced tokenization technique that treats text as a sequence of pieces or tokens which can be words, subwords or even characters. It uses language models to dynamically construct a vocabulary based on the input text during training.

Key features of SentencePiece include:

  • Builds vocabularies that minimize the total length of encoded sequences
  • Supports encoding/decoding for any arbitrary language
  • Provides lossless data compression/decompression

SentencePiece tokenization is used in models like T5, ALBERT and XLNet.

WordPiece Tokenization

Introduced by Google for their BERT model, WordPiece is a subword tokenization technique that iteratively creates a vocabulary of “wordpieces” – common words and subwords occurring in the training data.

The WordPiece algorithm starts with a single wordpiece for each character and iteratively:

  1. Finds two most frequent pairs of wordpieces
  2. Merges them to create a new wordpiece
  3. Repeats until reaching the desired vocabulary size

This allows representing rare/unknown words as sequences of common wordpieces.

Unigram Language Model Tokenization

Used in models like XLNet, this is a data-driven subword tokenization method that creates tokens based on the statistics of the training data. It constructs a vocabulary of tokens (words/subwords) that maximizes the likelihood of the training data.

Some key aspects are:

  • Likelihood-based tokenization using unigram language model
  • Constructs a vocabulary tailored to the target task/data
  • Better handles intra-word splitting compared to BPE

These advanced techniques aim to strike the right balance between vocabulary size and handling rare/unknown words for robust language modeling.

Conclusion

Tokenization is a powerful way of dealing with text data. We saw a glimpse of that in this article and also implemented tokenization using Python.

Go ahead and try this out on any text-based dataset you have. The more you practice, the better your understanding of how tokenization works (and why it’s such a critical NLP concept). Feel free to reach out to me in the comments below if you have any queries or thoughts on this article.

Frequently Asked Questions

Q1. What do you mean by tokenization?

A. Tokenization is the process of breaking up a sequence of strings into pieces such as words, keywords, phrases, symbols, and other meaningful elements called tokens. The process allows structured data to be created from unstructured data like bodies of text. Tokenization is a fundamental step in many natural language processing and text mining tasks.

Q2. What is an example of tokenization?

A. Let’s take the sentence “The quick brown fox jumps over the lazy dog.” Tokenizing this sentence could result in the following tokens:
[“The”, “quick”, “brown”, “fox”, “jumps”, “over”, “the”, “lazy”, “dog”, “.”]

Q3. What is tokenization in NLP?

A. In natural language processing, tokenization is a critical first step in preparing text for further processing. Different tokenizers handle different cases – some split only on whitespace, others are more advanced and separate punctuation, numbers, hashtags, and more. The tokens produced are then used as input for tasks like part-of-speech tagging, parsing, named entity recognition, etc.

Q4. What is tokenization in digital banking?

A. In digital banking and finance, tokenization in NLP tokenization refers to the process of substituting a sensitive data element (like a credit card number) with a non-sensitive equivalent (a token) that has no extrinsic or exploitable meaning or value. This token maps to and can be de-tokenized to retrieve the original sensitive data element. Tokenization helps increase data security by allowing sensitive information to be replaced when storing or transmitting it.

Aravindpai Pai 17 Mar 2024

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Satpal
Satpal 26 May, 2020

It would be good if you also provide a link to download the "sample.txt" file.

Inhyeok Yoo
Inhyeok Yoo 16 Jul, 2020

Hi. Thanks for the wonderful posting. May I translate this article into Korean and post it? I will clarify that I just translate it and URL of original post and the author's name.

Inhyeok Yoo
Inhyeok Yoo 16 Jul, 2020

Hi. I want to know what I should choose between subword tokenization and character-level tokenization. Which one is SOTA?

Comments are Closed

Natural Language Processing
Become a full stack data scientist

  • [tta_listen_btn class="listen"]