Tutorial on Text Classification (NLP) using ULMFiT and fastai Library in Python

Prateek Joshi 30 Apr, 2020
9 min read


Natural Language Processing (NLP) needs no introduction in today’s world. It’s one of the most important fields of study and research, and has seen a phenomenal rise in interest in the last decade. The basics of NLP are widely known and easy to grasp. But things start to get tricky when the text data becomes huge and unstructured.

That’s where deep learning becomes so pivotal. Yes, I’m talking about deep learning for NLP tasks – a still relatively less trodden path. DL has proven its usefulness in computer vision tasks like image detection, classification and segmentation, but NLP applications like text generation and classification have long been considered fit for traditional ML techniques.

Source: Tryolabs

And deep learning has certainly made a very positive impact in NLP, as you’ll see in this article. We will focus on the concept of transfer learning and how we can leverage it in NLP to build incredibly accurate models using the popular fastai library. I will introduce you to the ULMFiT framework as well in the process.

Note- This article assumes basic familiarity with neural networks, deep learning and transfer learning. If you are new to deep learning, I would strongly recommend reading the following articles first:

  1. An Introductory Guide to Deep Learning and Neural Networks
  2. A Complete Guide on Getting Started with Deep Learning in Python


If you are a beginner in NLP, check out this video course with 7 real life projects.


Table of Contents

  1. The Advantage of Transfer Learning
  2. Pre-trained Models in NLP
  3. Overview of ULMFiT
  4. Understanding the Problem Statement
  5. System Setup: Google Colab
  6. Implementation in Python
  7. What’s Next?


The Advantage of Transfer Learning

I praised deep learning in the introduction, and deservedly so. However, everything comes at a price, and deep learning is no different. The biggest challenge in deep learning is the massive data requirements for training the models. It is difficult to find datasets of such huge sizes, and it is way too costly to prepare such datasets. It’s simply not possible for most organizations to come up with them.

Another obstacle is the high cost of GPUs needed to run advanced deep learning algorithms.

Thankfully, we can use pre-trained state-of-the-art deep learning models and tweak them to work for us. This is known as transfer learning. It is not as resource intensive as training a deep learning model from scratch and produces decent results even on small amounts of training data. This concept will be expanded upon later in the article when we implement our learning on quite a small dataset.


Pre-trained Models in NLP

Pre-trained models help data scientists start off on a new problem by providing an existing framework they can leverage. You don’t always have to build a model from scratch, especially when someone else has already put in their hard work and effort! And these pre-trained models have proven to be truly effective and useful in the field of computer vision (check out this article to see our pick of the top 10 pre-trained models in CV).

Their success is popularly attributed to the Imagenet dataset. It has over 14 million labeled images with over 1 million images also accompanying bounding boxes. This dataset was first published in 2009 and has since become one of the most sought-after image datasets ever. It led to several breakthroughs in deep learning research for computer vision, with transfer learning being one of them.

However, in NLP, transfer learning has not been as successful (as compared to computer vision, anyway). Of course we have pre-trained word embeddings like word2vec, GloVe, and fastText, but they are primarily used to initialize only the first layer of a neural network. The rest of the model still needs to be trained from scratch and it requires a huge number of examples to produce a good performance.

What do we really need in this case? Like the aforementioned computer vision models, we require a pre-trained model for NLP which can be fine-tuned and used on different text datasets. One of the contenders for pre-trained natural language models is the Universal Language Model Fine-tuning for Text Classification, or ULMFiT (Imagenet dataset [cs.CL]).

How does it work? How widespread are it’s applications? How can we make it work in Python? In the rest of this article, we will put ULMFiT to the test by solving a text classification problem and check how well it performs.


Overview of ULMFiT

Proposed by fast.ai’s Jeremy Howard and NUI Galway Insight Center’s Sebastian Ruder, ULMFiT is essentially a method to enable transfer learning for any NLP task and achieve great results. All this, without having to train models from scratch. That got your attention, didn’t it?

ULMFiT achieves state-of-the-art result using novel techniques like:

  • Discriminative fine-tuning
  • Slanted triangular learning rates, and
  • Gradual unfreezing

This method involves fine-tuning a pre-trained language model (LM), trained on the Wikitext 103 dataset, to a new dataset in such a manner that it does not forget what it previously learned.

Language modeling ( covered in this course) can be considered a counterpart of Imagenet for NLP. It captures general properties of a language and courseprovides an enormous amount of data which can be fed to other downstream NLP tasks. That is why Language modeling has been chosen as the source task for ULMFiT.

I highly encourage you to go through the original ULMFiT paper to understand more about how it works, the way Jeremy and Sebastian went about deriving it, and parse through other interesting details.


Problem Statement

Alright, enough theoretical concepts – let’s get our hands dirty by implementing ULMFiT on a dataset and see what the hype is all about.

Our objective here is to fine-tune a pre-trained model and use it for text classification on a new dataset. We will implement ULMFiT in this process. The interesting thing here is that this new data is quite small in size (<1000 labeled instances). A neural network model trained from scratch would overfit on such a small dataset. Hence, I would like to see whether ULMFiT does a great job at this task as promised in the paper.

Dataset: We will use the 20 Newsgroup dataset available in sklearn.datasets. As the name suggests, it includes text documents from 20 different newsgroups.


System Setup: Google Colab

We will perform the python implementation on Google Colab instead of our local machines. If you have never worked on colab before, then consider this a bonus! Colab, or Google Colaboratory, is a free cloud service for running Python. One of the best things about it is that it provides GPUs and TPUs for free and hence, it is pretty handy for training deep learning models.

Some major benefits of Colab:

  • Completely free of cost
  • Comes with pretty decent hardware configuration
  • Connected to your Google Drive
  • Very well integrated with Github
  • And many more features you’ll discover as you play around with it..

So, it doesn’t matter even if you have a system with pretty ordinary hardware specs – as long as you have a steady internet connection, you are good to go. The only other requirement is that you must have a Google account. Let’s get started!


Implementation in Python

First, sign in to your Google account. Then select ‘NEW PYTHON 3 NOTEBOOK’. This notebook is similar to your typical Jupyter Notebook, so you won’t have much trouble working on it if you are familiar with the Jupyter environment. A Colab notebook looks something like the screenshot below:

Then go to Runtime, select Change runtime type, then select GPU as the hardware accelerator to utilise GPU for free.


Import Required Libraries

Most of the popular libraries like pandas, numpy, matplotlib, nltk, and keras, come preinstalled with Colab. However, 2 libraries, PyTorch and fastai v1 (which we need in this exercise), will need to be installed manually. So, let’s load them into our Colab environment:

!pip install torch_nightly -f https://download.pytorch.org/whl/nightly/cu92/torch_nightly.html
!pip install fastai
# import libraries
import fastai
from fastai import *
from fastai.text import * 
import pandas as pd
import numpy as np
from functools import partial
import io
import os

Import the dataset which we downloaded earlier.

from sklearn.datasets import fetch_20newsgroups
dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'))
documents = dataset.data

Let’s create a dataframe consisting of the text documents and their corresponding labels (newsgroup names).

df = pd.DataFrame({'label':dataset.target, 'text':dataset.data})

(11314, 2)

We’ll convert this into a binary classification problem by selecting only 2 out of the 20 labels present in the dataset. We will select labels 1 and 10 which correspond to ‘comp.graphics’ and ‘rec.sport.hockey’, respectively.

df = df[df['label'].isin([1,10])]
df = df.reset_index(drop = True)

Let’s have a quick look at the target distribution.

10    600
1     584
Name: label, dtype: int64

The distribution looks pretty even. Accuracy would be a good evaluation metric to use in this case.


Data Preprocessing

It’s always a good practice to feed clean data to your models, especially when the data comes in the form of unstructured text. Let’s clean our text by retaining only alphabets and removing everything else.

df['text'] = df['text'].str.replace("[^a-zA-Z]", " ")

Now, we will get rid of the stopwords from our text data. If you have never used stopwords before, then you will have to download them from the nltk package as I’ve shown below:

import nltk

from nltk.corpus import stopwords 
stop_words = stopwords.words('english')
# tokenization 
tokenized_doc = df['text'].apply(lambda x: x.split())

# remove stop-words 
tokenized_doc = tokenized_doc.apply(lambda x: [item for item in x if item not in stop_words])

# de-tokenization 
detokenized_doc = [] 
for i in range(len(df)): 
    t = ' '.join(tokenized_doc[i]) 

df['text'] = detokenized_doc

Now let’s split our cleaned dataset into training and validation sets in a 60:40 ratio.

from sklearn.model_selection import train_test_split

# split data into training and validation set
df_trn, df_val = train_test_split(df, stratify = df['label'], test_size = 0.4, random_state = 12)
df_trn.shape, df_val.shape
((710, 2), (474, 2))


Before proceeding further, we’ll need to prepare our data for the language model and for the classification model separately. The good news? This can be done quite easily using the fastai library:

# Language model data
data_lm = TextLMDataBunch.from_df(train_df = df_trn, valid_df = df_val, path = "")

# Classifier model data
data_clas = TextClasDataBunch.from_df(path = "", train_df = df_trn, valid_df = df_val, vocab=data_lm.train_ds.vocab, bs=32)


Fine-Tuning the Pre-Trained Model and Making Predictions

We can use the data_lm object we created earlier to fine-tune a pre-trained language model. We can create a learner object, ‘learn’, that will directly create a model, download the pre-trained weights, and be ready for fine-tuning:

learn = language_model_learner(data_lm, pretrained_model=URLs.WT103, drop_mult=0.7)

The one cycle and cyclic momentum allows the model to be trained on higher learning rates and converge faster. The one cycle policy provides some form of regularisation. We won’t go into the depth of how this works as this article is about learning the implementation. However, if you wish to know more about one cycle policy, then feel free to refer to this excellent paper by Leslie Smith – “A disciplined approach to neural network hyper-parameters: Part 1 — learning rate, batch size, momentum, and weight decay”.

# train the learner object with learning rate = 1e-2
learn.fit_one_cycle(1, 1e-2)

Total time: 00:09

epoch train_loss valid_loss accuracy
1 7.803613 6.306118 0.139369


We will save this encoder to use it for classification later.


Let’s now use the data_clas object we created earlier to build a classifier with our fine-tuned encoder.

learn = text_classifier_learner(data_clas, drop_mult=0.7)

We will again try to fit our model.

learn.fit_one_cycle(1, 1e-2)

Total time: 00:32

epoch train_loss valid_loss accuracy
1 0.534962 0.377784 0.907173


Wow! We got a whopping increase in the accuracy and even the validation loss is far less than the training loss. It is a pretty outstanding performance on a small dataset. You can even get the predictions for the validation set out of the learner object by using the below code:

# get predictions
preds, targets = learn.get_preds()

predictions = np.argmax(preds, axis = 1)
pd.crosstab(predictions, targets)
col_0 0 1
0 181 1
1 53 239



What’s Next?

With the emergence of methods like ULMFiT, we are moving towards more generalizable NLP systems. These models would be able to perform multiple tasks at once. Moreover, these models would not be limited just to the English language, but to several other languages spoken across the globe.

We also have upcoming techniques like ELMo, a new word embedding technique, and BERT, a new language representation model designed to pre-train deep bidirectional representations by jointly conditioning on both left and right context in all layers. These techniques have already achieved state-of-the-art results on many NLP tasks. Hence, the golden period for NLP has just arrived and it is here to stay.


End Notes

I hope you found this article helpful. However, there are still a lot more things to explore in ULMFiT using the fastai library which I encourage you guys to go after. If you have any recommendations/suggestions, then feel free to let me know in the comments section below. Also, try to use ULMFiT on different problems and domains of your choice and see how the results pan out.

Code: You can find the complete code here.

Thanks for reading and happy learning!

Prateek Joshi 30 Apr, 2020

Data Scientist at Analytics Vidhya with multidisciplinary academic background. Experienced in machine learning, NLP, graphs & networks. Passionate about learning and applying data science to solve real world problems.

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers


Abhay Kumar
Abhay Kumar 29 Nov, 2018

Nice article. Thanks for sharing.

Ashwin Perti
Ashwin Perti 29 Nov, 2018

really such a nice article

Chris 30 Nov, 2018

Nice tutorial. I just walked through it, but I wondered why you removed stop words? I think there is a belief in NLP that it's always good to remove stop words, but this is often not true. I tried re-running the tutorial but skipped the remove stop words part and I got a 2.4% increase in accuracy. I thought you might want to try that and see if you see the same increase.

Abhay Kumar
Abhay Kumar 30 Nov, 2018

Hey, I was able to run succesfully on Google collab. But I am not able to run same code with required library installed on my local machine. It gives following error for line below ` learn = language_model_learner(data_lm, pretrained_model=URLs.WT103, drop_mult=0.7)` ` Traceback (most recent call last): File "transfer_learning_classification_nlp_rir_classification.py", line 114, in learn = language_model_learner(data_lm, pretrained_model=URLs.WT103, drop_mult=0.7) File "/home/abhay/mml/venv-nerapi/lib/python3.6/site-packages/fastai/text/learner.py", line 135, in language_model_learner model_path = untar_data(pretrained_model, data=False) File "/home/abhay/mml/venv-nerapi/lib/python3.6/site-packages/fastai/datasets.py", line 108, in untar_data tarfile.open(fname, 'r:gz').extractall(dest.parent) File "/home/abhay/mml/venv-nerapi/lib/python3.6/tarfile.py", line 1587, in open return func(name, filemode, fileobj, **kwargs) File "/home/abhay/mml/venv-nerapi/lib/python3.6/tarfile.py", line 1641, in gzopen t = cls.taropen(name, mode, fileobj, **kwargs) File "/home/abhay/mml/venv-nerapi/lib/python3.6/tarfile.py", line 1617, in taropen return cls(name, mode, fileobj, **kwargs) File "/home/abhay/mml/venv-nerapi/lib/python3.6/tarfile.py", line 1480, in __init__ self.firstmember = self.next() File "/home/abhay/mml/venv-nerapi/lib/python3.6/tarfile.py", line 2310, in next raise ReadError("empty file") tarfile.ReadError: empty file ` What could be possible reason. Please help me with this. Thanks

Gasto 02 Dec, 2018

congrats on the post! have you seen a more advance example with text classification? other pre-train model, or a grid search? All the best!

Ravichandran Annaswamy
Ravichandran Annaswamy 02 Dec, 2018

Awesome tutorial

M Jahangeer Qureshi
M Jahangeer Qureshi 03 Dec, 2018

It runs fine for the given data but I seem to be running into problems when I try a different dataset, can you help me out here?

Sachin Kalsi
Sachin Kalsi 03 Dec, 2018

How we can extend this to multi-label classification problems?

Sandhiya 05 Dec, 2018

Hi, Thanks for the article. I am trying to do a text classification problem. Is there a way in which I can initialize some other pre-trained model apart from the WikiText103 or any language model that I have trained? learn = language_model_learner(data_lm, pretrained_model=URLs.WT103, drop_mult=0.7) Please help.

anass nasserallah
anass nasserallah 07 Dec, 2018

so helpful,thank you a lot for helping people.I have a question about : 1- how can I detect language in a phrase 2-the semantic similarity between two sentences or two different language words 3- Mesure d'importance d'une phrase ou d'un mot dans le texte.


vary new learning today ...keep sharing

Doug 06 Feb, 2019

As everyone has already stated...great article! It's amazing how transfer learning is changing NLP. Jeremy and Rachel have done some great work on ULMFit and your clear example is a great demonstration Prateek! What I'm missing, and I don't know why I can't understand this but, is on your last bit of code on getting the predictions. It's unclear to me on what you're returning here. Those are supposed to be prediction values from the validation set, but I don't know how to read that. Can you clarify what col and rows 0,1 represent? What are those numbers? Also, what if I wanted to test on a new entry and get a prediction back? For example, what if I wanted to pass in: "I have to get new skates as the season is about to begin", which would predict rec.sport.hockey as the correct class. Could you please give a code example of this Prateek?

doug 06 Feb, 2019

I can't believe I didn't try: learn.predict('I have to get new skates as the season is about to begin') Please disregard the last portion of my previous post. However, I am still confused on your last bit of code on getting the predictions per my earlier question: What I'm missing, and I don't know why I can't understand this but, is on your last bit of code on getting the predictions. It's unclear to me on what you're returning here. Those are supposed to be prediction values from the validation set, but I don't know how to read that. Can you clarify what col and rows 0,1 represent? What are those numbers?

Shubham 21 Feb, 2019

# Language model data data_lm = TextLMDataBunch.from_df(train_df = df_trn, valid_df = df_val, path = "") is showing following error- NameError: name 'TextLMDataBunch' is not defined. Can you please help me out??

Jerry 02 Mar, 2019

Great article. I've been trying to run this similar code for the full 20 newsgroup but have been getting subpar results. Is there any pointer on how I can improve the accuracy. I've been running the language model fit_one_cycle one layer at a time and similarly with classifier model. That got me closed to 70% accuracy. Do you know any other ways to improve the accuracy? Any ideas would be great. Thanks a gain for such an awesome article.

Philomene 13 May, 2019

Thanks a lot for the article ! Maybe one question, once your classifier is fine tuned, how can you save it and load it later to apply the model to a new dataset? Thanks !

Emile 17 May, 2019

Thanks so much for this great article ! Any ideas of how to find the tokens that contribute the most to the classification of each class ? Thanks again !

Himanshu Kriplani
Himanshu Kriplani 27 May, 2019

Nice article. Can you explain me about LM fine tuning? According to Paper, 1. General Pretuning [Used a pretrained model] 2. LM fine tuning 3. Classifier fine tuning For 2 and 3, you have used the same data. I guess LM fine tuning wont require a labelled dataset as its not a classification task. The data you have used is fine for 3rd task . But Why would Language modelling require Labels? Can you explain a bit about this?

Parul Mishra
Parul Mishra 17 Jun, 2019

@Prateek Everytime I run the program ,the epochs run again.Please tell me how to use the learn.predict to predict the classification for a given text using the above pretrained model for testing point of view.I have already done the training in pycharm and don't want to train again.It doesn't provide checkpoints so how I can use your model for testing only