- Tokenization is a key (and mandatory) aspect of working with text data
- We’ll discuss the various nuances of tokenization, including how to handle Out-of-Vocabulary words (OOV)
Language is a thing of beauty. But mastering a new language from scratch is quite a daunting prospect. If you’ve ever picked up a language that wasn’t your mother tongue, you’ll relate to this! There are so many layers to peel off and syntaxes to consider – it’s quite a challenge.
And that’s exactly the way with our machines. In order to get our computer to understand any text, we need to break that word down in a way that our machine can understand. That’s where the concept of tokenization in Natural Language Processing (NLP) comes in.
Simply put, we can’t work with text data if we don’t perform tokenization. Yes, it’s really that important!
And here’s the intriguing thing about tokenization – it’s not just about breaking down the text. Tokenization plays a significant role in dealing with text data. So in this article, we will explore the depths of tokenization in Natural Language Processing and how you can implement it in Python.
I recommend taking some time to go through the below resource if you’re new to NLP:
Table of Contents
- A Quick Rundown of Tokenization
- The True Reasons behind Tokenization
- Which Tokenization (Word, Character, or Subword) Should we Use?
- Implementing Tokenization– Byte Pair Encoding in Python
A Quick Rundown of Tokenization
Tokenization is a common task in Natural Language Processing (NLP). It’s a fundamental step in both traditional NLP methods like Count Vectorizer and Advanced Deep Learning-based architectures like Transformers.
Tokens are the building blocks of Natural Language.
Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens can be either words, characters, or subwords. Hence, tokenization can be broadly classified into 3 types – word, character, and subword (n-gram characters) tokenization.
For example, consider the sentence: “Never give up”.
The most common way of forming tokens is based on space. Assuming space as a delimiter, the tokenization of the sentence results in 3 tokens – Never-give-up. As each token is a word, it becomes an example of Word tokenization.
Similarly, tokens can be either characters or subwords. For example, let us consider “smarter”:
- Character tokens: s-m-a-r-t-e-r
- Subword tokens: smart-er
But then is this necessary? Do we really need tokenization to do all of this?
Note: If you are new to NLP, check out our NLP Course Online
The True Reasons behind Tokenization
As tokens are the building blocks of Natural Language, the most common way of processing the raw text happens at the token level.
For example, Transformer based models – the State of The Art (SOTA) Deep Learning architectures in NLP – process the raw text at the token level. Similarly, the most popular deep learning architectures for NLP like RNN, GRU, and LSTM also process the raw text at the token level.
As shown here, RNN receives and processes each token at a particular timestep.
Hence, Tokenization is the foremost step while modeling text data. Tokenization is performed on the corpus to obtain tokens. The following tokens are then used to prepare a vocabulary. Vocabulary refers to the set of unique tokens in the corpus. Remember that vocabulary can be constructed by considering each unique token in the corpus or by considering the top K Frequently Occurring Words.
Creating Vocabulary is the ultimate goal of Tokenization.
One of the simplest hacks to boost the performance of the NLP model is to create a vocabulary out of top K frequently occurring words.
Now, let’s understand the usage of the vocabulary in Traditional and Advanced Deep Learning-based NLP methods.
- Traditional NLP approaches such as Count Vectorizer and TF-IDF use vocabulary as features. Each word in the vocabulary is treated as a unique feature:
- In Advanced Deep Learning-based NLP architectures, vocabulary is used to create the tokenized input sentences. Finally, the tokens of these sentences are passed as inputs to the model
Which Tokenization Should you use?
As discussed earlier, tokenization can be performed on word, character, or subword level. It’s a common question – which Tokenization should we use while solving an NLP task? Let’s address this question here.
Word Tokenization is the most commonly used tokenization algorithm. It splits a piece of text into individual words based on a certain delimiter. Depending upon delimiters, different word-level tokens are formed. Pretrained Word Embeddings such as Word2Vec and GloVe comes under word tokenization.
But, there are few drawbacks to this.
Drawbacks of Word Tokenization
One of the major issues with word tokens is dealing with Out Of Vocabulary (OOV) words. OOV words refer to the new words which are encountered at testing. These new words do not exist in the vocabulary. Hence, these methods fail in handling OOV words.
But wait – don’t jump to any conclusions yet!
- A small trick can rescue word tokenizers from OOV words. The trick is to form the vocabulary with the Top K Frequent Words and replace the rare words in training data with unknown tokens (UNK). This helps the model to learn the representation of OOV words in terms of UNK tokens
- So, during test time, any word that is not present in the vocabulary will be mapped to a UNK token. This is how we can tackle the problem of OOV in word tokenizers.
- The problem with this approach is that the entire information of the word is lost as we are mapping OOV to UNK tokens. The structure of the word might be helpful in representing the word accurately. And another issue is that every OOV word gets the same representation
Another issue with word tokens is connected to the size of the vocabulary. Generally, pre-trained models are trained on a large volume of the text corpus. So, just imagine building the vocabulary with all the unique words in such a large corpus. This explodes the vocabulary!
This opens the door to Character Tokenization.
Character Tokenization splits apiece of text into a set of characters. It overcomes the drawbacks we saw above about Word Tokenization.
- Character Tokenizers handles OOV words coherently by preserving the information of the word. It breaks down the OOV word into characters and represents the word in terms of these characters
- It also limits the size of the vocabulary. Want to talk a guess on the size of the vocabulary? 26 since the vocabulary contains a unique set of characters
Drawbacks of Character Tokenization
Character tokens solve the OOV problem but the length of the input and output sentences increases rapidly as we are representing a sentence as a sequence of characters. As a result, it becomes challenging to learn the relationship between the characters to form meaningful words.
This brings us to another tokenization known as Subword Tokenization which is in between a Word and Character tokenization.
Subword Tokenization splits the piece of text into subwords (or n-gram characters). For example, words like lower can be segmented as low-er, smartest as smart-est, and so on.
Transformed based models – the SOTA in NLP – rely on Subword Tokenization algorithms for preparing vocabulary. Now, I will discuss one of the most popular Subword Tokenization algorithm known as Byte Pair Encoding (BPE).
Welcome to Byte Pair Encoding (BPE)
Byte Pair Encoding (BPE) is a widely used tokenization method among transformer-based models. BPE addresses the issues of Word and Character Tokenizers:
- BPE tackles OOV effectively. It segments OOV as subwords and represents the word in terms of these subwords
- The length of input and output sentences after BPE are shorter compared to character tokenization
BPE is a word segmentation algorithm that merges the most frequently occurring character or character sequences iteratively. Here is a step by step guide to learn BPE.
Steps to learn BPE
- Split the words in the corpus into characters after appending </w>
- Initialize the vocabulary with unique characters in the corpus
- Compute the frequency of a pair of characters or character sequences in corpus
- Merge the most frequent pair in corpus
- Save the best pair to the vocabulary
- Repeat steps 3 to 5 for a certain number of iterations
We will understand the steps with an example.
Consider a corpus:
1a) Append the end of the word (say </w>) symbol to every word in the corpus:
1b) Tokenize words in a corpus into characters:
2. Initialize the vocabulary:
3. Compute frequency:
4. Merge the most frequent pair:
5. Save the best pair:
Repeat steps 3-5 for every iteration from now. Let me illustrate for one more iteration.
3. Compute frequency:
4. Merge the most frequent pair:
5. Save the best pair:
After 10 iterations, BPE merge operations looks like:
Pretty straightforward, right?
Applying BPE to OOV words
But, how can we represent the OOV word at test time using BPE learned operations? Any ideas? Let’s answer this question now.
At test time, the OOV word is split into sequences of characters. Then the learned operations are applied to merge the characters into larger known symbols.
– Neural Machine Translation of Rare Words with Subword Units, 2016
Here is a step by step procedure for representing OOV words:
- Split the OOV word into characters after appending </w>
- Compute pair of character or character sequences in a word
- Select the pairs present in the learned operations
- Merge the most frequent pair
- Repeat steps 2 and 3 until merging is possible
Let’s see all this in action next!
Implementing Tokenization – Byte Pair Encoding in Python
We are now aware of how BPE works – learning and applying to the OOV words. So, its time to implement our knowledge in Python.
The python code for BPE is already available in the original paper itself (Neural Machine Translation of Rare Words with Subword Units, 2016)
We’ll consider a simple corpus to illustrate the idea of BPE. Nevertheless, the same idea applies to another corpus as well:
Tokenize the words into characters in the corpus and append </w> at the end of every word:
Compute the frequency of each word in the corpus:
Let’s define a function to compute the frequency of a pair of character or character sequences. It accepts the corpus and returns the pair with its frequency:
Now, the next task is to merge the most frequent pair in the corpus. We will define a function that accepts the corpus, best pair, and returns the modified corpus:
Next, its time to learn BPE operations. As BPE is an iterative procedure, we will carry out and understand the steps for one iteration. Let’s compute the frequency of bigrams:
Find the most frequent pair:
Output: (‘e’, ‘s’)
Finally, merge the best pair and save to the vocabulary:
We will follow similar steps for certain iterations:
The most interesting part is yet to come! That’s applying BPE to OOV words.
Applying BPE to OOV word
Now, we will see how to segment the OOV word into subwords using learned operations. Consider OOV word to be “lowest”:
Applying BPE to an OOV word is also an iterative process. We will implement the steps discussed earlier in the article:
As you can see here, the unknown word “lowest” is segmented as low-est.
Tokenization is a powerful way of dealing with text data. We saw a glimpse of that in this article and also implemented tokenization using Python.
Go ahead and try this out on any text-based dataset you have. The more you practice, the better your understanding of how tokenization works (and why it’s such a critical NLP concept). Feel free to reach out to me in the comments below if you have any queries or thoughts on this article.You can also read this article on our Mobile APP