Are you fascinated by the amount of text data available on the internet? Are you looking for ways to work with this text data but aren’t sure where to begin? Machines, after all, recognize numbers, not the letters of our language. And that can be a tricky landscape to navigate in machine learning. One fundamental step in working with text data in Python is tokenization in Python.
So how can we manipulate and clean this text data to build a model? The answer lies in the wonderful world of Natural Language Processing (NLP).
Solving an NLP problem is a multi-stage process. We need to clean the unstructured text data first before we can even think about getting to the modeling stage. Cleaning the data consists of a few key steps:
In this article, we will talk about the very first step – tokenization. We will first see what tokenization is and why it’s required in NLP. We will then look at six unique ways to perform tokenization in Python.
This article has no prerequisites. Anyone with an interest in NLP or data science will be able to follow along. If you’re looking for an end-to-end resource for learning NLP, you should check out our comprehensive course:
Tokenization is one of the most common tasks when it comes to working with text data. But what does the term ‘tokenization’ actually mean?
Tokenization in Python is essentially splitting a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or terms. Each of these smaller units are called tokens.
Check out the below image to visualize this definition:
The tokens could be words, numbers or punctuation marks. In tokenization, smaller units are created by locating word boundaries. Wait – what are word boundaries?
These are the ending point of a word and the beginning of the next word. These tokens are considered as a first step for stemming and lemmatization (the next stage in text preprocessing which we will cover in the next article).
Difficult? Do not worry! The 21st century has made learning and knowledge accessibility easy. Any Natural Language Processing Course can be used to learn them easily.
Three simple types of tokenization in Python:
I want you to think about the English language here. Pick up any sentence you can think of and hold that in your mind as you read this section. This will help you understand the importance of tokenization in a much easier manner.
Before processing a natural language, we need to identify the words that constitute a string of characters. That’s why tokenization is the most basic step to proceed with NLP (text data). This is important because the meaning of the text could easily be interpreted by analyzing the words present in the text.
Let’s take an example. Consider the below string:
“This is a cat.”
What do you think will happen after we perform tokenization on this string? We get [‘This’, ‘is’, ‘a’, cat’].
There are numerous uses of doing this. We can use this tokenized form to:
And so on. We can extract a lot more information which we’ll discuss in detail in future articles. For now, it’s time to dive into the meat of this article – the different methods of performing tokenization in NLP.
We are going to look at six unique ways we can perform tokenization in python on text data. I have provided the Python code for each method so you can follow along on your own machine.
Let’s start with the split() method as it is the most basic one. It returns a list of strings after breaking the given string by the specified separator. By default, split() breaks a string at each space. We can change the separator to anything. Let’s check it out.
Word Tokenization
Output : ['Founded', 'in', '2002,', 'SpaceX’s', 'mission', 'is', 'to', 'enable', 'humans', 'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 'multi-planet', 'species', 'by', 'building', 'a', 'self-sustaining', 'city', 'on', 'Mars.', 'In', '2008,', 'SpaceX’s', 'Falcon', '1', 'became', 'the', 'first', 'privately', 'developed', 'liquid-fuel', 'launch', 'vehicle', 'to', 'orbit', 'the', 'Earth.']
Sentence Tokenization
This is similar to word tokenization. Here, we study the structure of sentences in the analysis. A sentence usually ends with a full stop (.), so we can use “.” as a separator to break the string:
Output : ['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet \nspecies by building a self-sustaining city on Mars', 'In 2008, SpaceX’s Falcon 1 became the first privately developed \nliquid-fuel launch vehicle to orbit the Earth.']
One major drawback of using Python’s split() method is that we can use only one separator at a time. Another thing to note – in word tokenization, split() did not consider punctuation as a separate token.
First, let’s understand what a regular expression is. It is basically a special character sequence that helps you match or find other strings or sets of strings using that sequence as a pattern.
We can use the re library in Python to work with regular expression. This library comes preinstalled with the Python installation package.
Now, let’s perform word tokenization and sentence tokenization keeping RegEx in mind.
Word Tokenization
Python Code:
The re.findall() function finds all the words that match the pattern passed on it and stores it in the list.
The “\w
” represents “any word character” which usually means alphanumeric (letters, numbers) and underscore (_). ‘+’ means any number of times. So [\w’]+ signals that the code should find all the alphanumeric characters until any other character is encountered.
Output : ['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet \nspecies by building a self-sustaining city on Mars.', 'In 2008, SpaceX’s Falcon 1 became the first privately developed \nliquid-fuel launch vehicle to orbit the Earth.']
Here, we have an edge over the split() method as we can pass multiple separators at the same time. In the above code, we used the re.compile() function wherein we passed [.?!]. This means that sentences will split as soon as any of these characters are encountered.
Interested in reading more about RegEx? The below resources will get you started with Regular Expressions in NLP:
Now, this is a library you will appreciate the more you work with text data. NLTK, short for Natural Language ToolKit, is a library written in Python for symbolic and statistical Natural Language Processing.
You can install NLTK using the below code:
pip install --user -U nltk
NLTK contains a module called tokenize() which further classifies into two sub-categories:
Let’s see both of these one-by-one.
Word Tokenization
Output: ['Founded', 'in', '2002', ',', 'SpaceX', '’', 's', 'mission', 'is', 'to', 'enable', 'humans', 'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 'multi-planet', 'species', 'by', 'building', 'a', 'self-sustaining', 'city', 'on', 'Mars', '.', 'In', '2008', ',', 'SpaceX', '’', 's', 'Falcon', '1', 'became', 'the', 'first', 'privately', 'developed', 'liquid-fuel', 'launch', 'vehicle', 'to', 'orbit', 'the', 'Earth', '.']
Sentence Tokenization
Output: ['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet \nspecies by building a self-sustaining city on Mars.', 'In 2008, SpaceX’s Falcon 1 became the first privately developed \nliquid-fuel launch vehicle to orbit the Earth.']
I love the spaCy library. I can’t remember the last time I didn’t use it when I was working on an NLP project. It is just that useful.
spaCy is an open-source library for advanced Natural Language Processing (NLP). It supports over 49+ languages and provides state-of-the-art computation speed.
To install Spacy in Linux:
pip install -U spacy python -m spacy download en
To install it on other operating systems, go through this link.
So, let’s see how we can utilize the awesomeness of spaCy to perform tokenization. We will use spacy.lang.en which supports the English language.
Word Tokenization
Sentence Tokenization
Output : ['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet \nspecies by building a self-sustaining city on Mars.', 'In 2008, SpaceX’s Falcon 1 became the first privately developed \nliquid-fuel launch vehicle to orbit the Earth.']
spaCy is quite fast as compared to other libraries while performing NLP tasks (yes, even NLTK). I encourage you to listen to the below DataHack Radio podcast to know the story behind how spaCy was created and where you can use it:
And here’s an in-depth tutorial to get you started with spaCy:
Keras! One of the hottest deep learning frameworks in the industry right now. It is an open-source neural network library for Python. Keras is super easy to use and can also run on top of TensorFlow.
In the NLP context, we can use Keras for cleaning the unstructured text data that we typically collect.
You can install Keras on your machine using just one line of code:
pip install Keras
Let’s get cracking. To perform word tokenization using Keras, we use the text_to_word_sequence method from the keras.preprocessing.text class.
Let’s see Keras in action.
Word Tokenization
Output : ['founded', 'in', '2002', 'spacex’s', 'mission', 'is', 'to', 'enable', 'humans', 'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 'multi', 'planet', 'species', 'by', 'building', 'a', 'self', 'sustaining', 'city', 'on', 'mars', 'in', '2008', 'spacex’s', 'falcon', '1', 'became', 'the', 'first', 'privately', 'developed', 'liquid', 'fuel', 'launch', 'vehicle', 'to', 'orbit', 'the', 'earth']
Keras lowers the case of all the alphabets before tokenizing them. That saves us quite a lot of time as you can imagine!
The final tokenization method we will cover here is using the Gensim library. It is an open-source library for unsupervised topic modeling and natural language processing and is designed to automatically extract semantic topics from a given document.
Here’s how you can install Gensim:
pip install gensim
We can use the gensim.utils class to import the tokenize method for performing word tokenization.
Word Tokenization
Outpur : ['Founded', 'in', 'SpaceX', 's', 'mission', 'is', 'to', 'enable', 'humans', 'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 'multi', 'planet', 'species', 'by', 'building', 'a', 'self', 'sustaining', 'city', 'on', 'Mars', 'In', 'SpaceX', 's', 'Falcon', 'became', 'the', 'first', 'privately', 'developed', 'liquid', 'fuel', 'launch', 'vehicle', 'to', 'orbit', 'the', 'Earth']
Sentence Tokenization
To perform sentence tokenization, we use the split_sentences method from the gensim.summerization.texttcleaner class:
Output : ['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet ', 'species by building a self-sustaining city on Mars.', 'In 2008, SpaceX’s Falcon 1 became the first privately developed ', 'liquid-fuel launch vehicle to orbit the Earth.']
You might have noticed that Gensim is quite strict with punctuation. It splits whenever a punctuation is encountered. In sentence splitting as well, Gensim tokenized the text on encountering “\n” while other libraries ignored it.
Tokenization is a critical step in the overall NLP pipeline. We cannot simply jump into the model building part without cleaning the text first.
In this article, we saw six different methods of tokenization in Python (word as well as a sentence) from a given text. There are other ways as well but these are good enough to get you started on the topic.
I’ll be covering other text cleaning steps like removing stopwords, part-of-speech tagging, and recognizing named entities in my future posts. Till then, keep learning!
A. In Python, tokenization in NLP can be accomplished using various libraries such as NLTK, SpaCy, or the tokenization module in the Transformers library. These libraries offer functions to split text into tokens, such as words or subwords, based on different rules and language-specific considerations. Tokenization plays a crucial role in various NLP tasks, including text preprocessing and feature extraction.
A. To create tokens in Python, you can use the split()
method available for strings, which splits a string into a list of substrings based on a specified delimiter. For example, to tokenize a sentence into individual words:
sentence = “Hello, how are you?”
tokens = sentence.split()
print(tokens)
This will output:
[‘Hello,’, ‘how’, ‘are’, ‘you?’]
You can further preprocess the tokens by removing punctuation, converting to lowercase, or applying other transformations as per your requirements.
In NLTK, tokenization means splitting text into smaller parts like words or sentences.
In coding, tokenization is breaking down source code into smaller elements like keywords or punctuation.