Prateek Joshi — Updated On April 30th, 2020
Classification Deep Learning Intermediate Libraries Machine Learning NLP Programming Python Supervised Text Unstructured Data


Natural Language Processing (NLP) needs no introduction in today’s world. It’s one of the most important fields of study and research, and has seen a phenomenal rise in interest in the last decade. The basics of NLP are widely known and easy to grasp. But things start to get tricky when the text data becomes huge and unstructured.

That’s where deep learning becomes so pivotal. Yes, I’m talking about deep learning for NLP tasks – a still relatively less trodden path. DL has proven its usefulness in computer vision tasks like image detection, classification and segmentation, but NLP applications like text generation and classification have long been considered fit for traditional ML techniques.

Source: Tryolabs

And deep learning has certainly made a very positive impact in NLP, as you’ll see in this article. We will focus on the concept of transfer learning and how we can leverage it in NLP to build incredibly accurate models using the popular fastai library. I will introduce you to the ULMFiT framework as well in the process.

Note- This article assumes basic familiarity with neural networks, deep learning and transfer learning. If you are new to deep learning, I would strongly recommend reading the following articles first:

  1. An Introductory Guide to Deep Learning and Neural Networks
  2. A Complete Guide on Getting Started with Deep Learning in Python


If you are a beginner in NLP, check out this video course with 7 real life projects.


Table of Contents

  1. The Advantage of Transfer Learning
  2. Pre-trained Models in NLP
  3. Overview of ULMFiT
  4. Understanding the Problem Statement
  5. System Setup: Google Colab
  6. Implementation in Python
  7. What’s Next?


The Advantage of Transfer Learning

I praised deep learning in the introduction, and deservedly so. However, everything comes at a price, and deep learning is no different. The biggest challenge in deep learning is the massive data requirements for training the models. It is difficult to find datasets of such huge sizes, and it is way too costly to prepare such datasets. It’s simply not possible for most organizations to come up with them.

Another obstacle is the high cost of GPUs needed to run advanced deep learning algorithms.

Thankfully, we can use pre-trained state-of-the-art deep learning models and tweak them to work for us. This is known as transfer learning. It is not as resource intensive as training a deep learning model from scratch and produces decent results even on small amounts of training data. This concept will be expanded upon later in the article when we implement our learning on quite a small dataset.


Pre-trained Models in NLP

Pre-trained models help data scientists start off on a new problem by providing an existing framework they can leverage. You don’t always have to build a model from scratch, especially when someone else has already put in their hard work and effort! And these pre-trained models have proven to be truly effective and useful in the field of computer vision (check out this article to see our pick of the top 10 pre-trained models in CV).

Their success is popularly attributed to the Imagenet dataset. It has over 14 million labeled images with over 1 million images also accompanying bounding boxes. This dataset was first published in 2009 and has since become one of the most sought-after image datasets ever. It led to several breakthroughs in deep learning research for computer vision, with transfer learning being one of them.

However, in NLP, transfer learning has not been as successful (as compared to computer vision, anyway). Of course we have pre-trained word embeddings like word2vec, GloVe, and fastText, but they are primarily used to initialize only the first layer of a neural network. The rest of the model still needs to be trained from scratch and it requires a huge number of examples to produce a good performance.

What do we really need in this case? Like the aforementioned computer vision models, we require a pre-trained model for NLP which can be fine-tuned and used on different text datasets. One of the contenders for pre-trained natural language models is the Universal Language Model Fine-tuning for Text Classification, or ULMFiT (Imagenet dataset [cs.CL]).

How does it work? How widespread are it’s applications? How can we make it work in Python? In the rest of this article, we will put ULMFiT to the test by solving a text classification problem and check how well it performs.


Overview of ULMFiT

Proposed by’s Jeremy Howard and NUI Galway Insight Center’s Sebastian Ruder, ULMFiT is essentially a method to enable transfer learning for any NLP task and achieve great results. All this, without having to train models from scratch. That got your attention, didn’t it?

ULMFiT achieves state-of-the-art result using novel techniques like:

  • Discriminative fine-tuning
  • Slanted triangular learning rates, and
  • Gradual unfreezing

This method involves fine-tuning a pre-trained language model (LM), trained on the Wikitext 103 dataset, to a new dataset in such a manner that it does not forget what it previously learned.

Language modeling ( covered in this course) can be considered a counterpart of Imagenet for NLP. It captures general properties of a language and courseprovides an enormous amount of data which can be fed to other downstream NLP tasks. That is why Language modeling has been chosen as the source task for ULMFiT.

I highly encourage you to go through the original ULMFiT paper to understand more about how it works, the way Jeremy and Sebastian went about deriving it, and parse through other interesting details.


Problem Statement

Alright, enough theoretical concepts – let’s get our hands dirty by implementing ULMFiT on a dataset and see what the hype is all about.

Our objective here is to fine-tune a pre-trained model and use it for text classification on a new dataset. We will implement ULMFiT in this process. The interesting thing here is that this new data is quite small in size (<1000 labeled instances). A neural network model trained from scratch would overfit on such a small dataset. Hence, I would like to see whether ULMFiT does a great job at this task as promised in the paper.

Dataset: We will use the 20 Newsgroup dataset available in sklearn.datasets. As the name suggests, it includes text documents from 20 different newsgroups.


System Setup: Google Colab

We will perform the python implementation on Google Colab instead of our local machines. If you have never worked on colab before, then consider this a bonus! Colab, or Google Colaboratory, is a free cloud service for running Python. One of the best things about it is that it provides GPUs and TPUs for free and hence, it is pretty handy for training deep learning models.

Some major benefits of Colab:

  • Completely free of cost
  • Comes with pretty decent hardware configuration
  • Connected to your Google Drive
  • Very well integrated with Github
  • And many more features you’ll discover as you play around with it..

So, it doesn’t matter even if you have a system with pretty ordinary hardware specs – as long as you have a steady internet connection, you are good to go. The only other requirement is that you must have a Google account. Let’s get started!


Implementation in Python

First, sign in to your Google account. Then select ‘NEW PYTHON 3 NOTEBOOK’. This notebook is similar to your typical Jupyter Notebook, so you won’t have much trouble working on it if you are familiar with the Jupyter environment. A Colab notebook looks something like the screenshot below:

Then go to Runtime, select Change runtime type, then select GPU as the hardware accelerator to utilise GPU for free.


Import Required Libraries

Most of the popular libraries like pandas, numpy, matplotlib, nltk, and keras, come preinstalled with Colab. However, 2 libraries, PyTorch and fastai v1 (which we need in this exercise), will need to be installed manually. So, let’s load them into our Colab environment:

!pip install torch_nightly -f
!pip install fastai
# import libraries
import fastai
from fastai import *
from fastai.text import * 
import pandas as pd
import numpy as np
from functools import partial
import io
import os

Import the dataset which we downloaded earlier.

from sklearn.datasets import fetch_20newsgroups
dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'))
documents =

Let’s create a dataframe consisting of the text documents and their corresponding labels (newsgroup names).

df = pd.DataFrame({'label', 'text'})

(11314, 2)

We’ll convert this into a binary classification problem by selecting only 2 out of the 20 labels present in the dataset. We will select labels 1 and 10 which correspond to ‘’ and ‘’, respectively.

df = df[df['label'].isin([1,10])]
df = df.reset_index(drop = True)

Let’s have a quick look at the target distribution.

10    600
1     584
Name: label, dtype: int64

The distribution looks pretty even. Accuracy would be a good evaluation metric to use in this case.


Data Preprocessing

It’s always a good practice to feed clean data to your models, especially when the data comes in the form of unstructured text. Let’s clean our text by retaining only alphabets and removing everything else.

df['text'] = df['text'].str.replace("[^a-zA-Z]", " ")

Now, we will get rid of the stopwords from our text data. If you have never used stopwords before, then you will have to download them from the nltk package as I’ve shown below:

import nltk'stopwords')

from nltk.corpus import stopwords 
stop_words = stopwords.words('english')
# tokenization 
tokenized_doc = df['text'].apply(lambda x: x.split())

# remove stop-words 
tokenized_doc = tokenized_doc.apply(lambda x: [item for item in x if item not in stop_words])

# de-tokenization 
detokenized_doc = [] 
for i in range(len(df)): 
    t = ' '.join(tokenized_doc[i]) 

df['text'] = detokenized_doc

Now let’s split our cleaned dataset into training and validation sets in a 60:40 ratio.

from sklearn.model_selection import train_test_split

# split data into training and validation set
df_trn, df_val = train_test_split(df, stratify = df['label'], test_size = 0.4, random_state = 12)
df_trn.shape, df_val.shape
((710, 2), (474, 2))


Before proceeding further, we’ll need to prepare our data for the language model and for the classification model separately. The good news? This can be done quite easily using the fastai library:

# Language model data
data_lm = TextLMDataBunch.from_df(train_df = df_trn, valid_df = df_val, path = "")

# Classifier model data
data_clas = TextClasDataBunch.from_df(path = "", train_df = df_trn, valid_df = df_val, vocab=data_lm.train_ds.vocab, bs=32)


Fine-Tuning the Pre-Trained Model and Making Predictions

We can use the data_lm object we created earlier to fine-tune a pre-trained language model. We can create a learner object, ‘learn’, that will directly create a model, download the pre-trained weights, and be ready for fine-tuning:

learn = language_model_learner(data_lm, pretrained_model=URLs.WT103, drop_mult=0.7)

The one cycle and cyclic momentum allows the model to be trained on higher learning rates and converge faster. The one cycle policy provides some form of regularisation. We won’t go into the depth of how this works as this article is about learning the implementation. However, if you wish to know more about one cycle policy, then feel free to refer to this excellent paper by Leslie Smith – “A disciplined approach to neural network hyper-parameters: Part 1 — learning rate, batch size, momentum, and weight decay”.

# train the learner object with learning rate = 1e-2
learn.fit_one_cycle(1, 1e-2)

Total time: 00:09

epoch train_loss valid_loss accuracy
1 7.803613 6.306118 0.139369


We will save this encoder to use it for classification later.


Let’s now use the data_clas object we created earlier to build a classifier with our fine-tuned encoder.

learn = text_classifier_learner(data_clas, drop_mult=0.7)

We will again try to fit our model.

learn.fit_one_cycle(1, 1e-2)

Total time: 00:32

epoch train_loss valid_loss accuracy
1 0.534962 0.377784 0.907173


Wow! We got a whopping increase in the accuracy and even the validation loss is far less than the training loss. It is a pretty outstanding performance on a small dataset. You can even get the predictions for the validation set out of the learner object by using the below code:

# get predictions
preds, targets = learn.get_preds()

predictions = np.argmax(preds, axis = 1)
pd.crosstab(predictions, targets)
col_0 0 1
0 181 1
1 53 239



What’s Next?

With the emergence of methods like ULMFiT, we are moving towards more generalizable NLP systems. These models would be able to perform multiple tasks at once. Moreover, these models would not be limited just to the English language, but to several other languages spoken across the globe.

We also have upcoming techniques like ELMo, a new word embedding technique, and BERT, a new language representation model designed to pre-train deep bidirectional representations by jointly conditioning on both left and right context in all layers. These techniques have already achieved state-of-the-art results on many NLP tasks. Hence, the golden period for NLP has just arrived and it is here to stay.


End Notes

I hope you found this article helpful. However, there are still a lot more things to explore in ULMFiT using the fastai library which I encourage you guys to go after. If you have any recommendations/suggestions, then feel free to let me know in the comments section below. Also, try to use ULMFiT on different problems and domains of your choice and see how the results pan out.

Code: You can find the complete code here.

Thanks for reading and happy learning!

About the Author

Prateek Joshi
Prateek Joshi

Data Scientist at Analytics Vidhya with multidisciplinary academic background. Experienced in machine learning, NLP, graphs & networks. Passionate about learning and applying data science to solve real world problems.

Our Top Authors

Download Analytics Vidhya App for the Latest blog/Article

24 thoughts on "Tutorial on Text Classification (NLP) using ULMFiT and fastai Library in Python"

Abhay Kumar
Abhay Kumar says: November 29, 2018 at 1:07 pm
Nice article. Thanks for sharing. Reply
Ashwin Perti
Ashwin Perti says: November 29, 2018 at 4:35 pm
really such a nice article Reply
Prateek Joshi
Prateek Joshi says: November 29, 2018 at 6:43 pm
Thanks Ashwin! Reply
Chris says: November 30, 2018 at 4:47 am
Nice tutorial. I just walked through it, but I wondered why you removed stop words? I think there is a belief in NLP that it's always good to remove stop words, but this is often not true. I tried re-running the tutorial but skipped the remove stop words part and I got a 2.4% increase in accuracy. I thought you might want to try that and see if you see the same increase. Reply
Prateek Joshi
Prateek Joshi says: November 30, 2018 at 11:27 am
I'm Glad you liked this tutorial. Yes, you are right that removing stop words does not always help. However, it might work for another dataset and that is why I have included it in this article. Reply
Abhay Kumar
Abhay Kumar says: November 30, 2018 at 2:33 pm
Hey, I was able to run succesfully on Google collab. But I am not able to run same code with required library installed on my local machine. It gives following error for line below ` learn = language_model_learner(data_lm, pretrained_model=URLs.WT103, drop_mult=0.7)` ` Traceback (most recent call last): File "", line 114, in learn = language_model_learner(data_lm, pretrained_model=URLs.WT103, drop_mult=0.7) File "/home/abhay/mml/venv-nerapi/lib/python3.6/site-packages/fastai/text/", line 135, in language_model_learner model_path = untar_data(pretrained_model, data=False) File "/home/abhay/mml/venv-nerapi/lib/python3.6/site-packages/fastai/", line 108, in untar_data, 'r:gz').extractall(dest.parent) File "/home/abhay/mml/venv-nerapi/lib/python3.6/", line 1587, in open return func(name, filemode, fileobj, **kwargs) File "/home/abhay/mml/venv-nerapi/lib/python3.6/", line 1641, in gzopen t = cls.taropen(name, mode, fileobj, **kwargs) File "/home/abhay/mml/venv-nerapi/lib/python3.6/", line 1617, in taropen return cls(name, mode, fileobj, **kwargs) File "/home/abhay/mml/venv-nerapi/lib/python3.6/", line 1480, in __init__ self.firstmember = File "/home/abhay/mml/venv-nerapi/lib/python3.6/", line 2310, in next raise ReadError("empty file") tarfile.ReadError: empty file ` What could be possible reason. Please help me with this. Thanks Reply
Gasto says: December 02, 2018 at 1:13 pm
congrats on the post! have you seen a more advance example with text classification? other pre-train model, or a grid search? All the best! Reply
Ravichandran Annaswamy
Ravichandran Annaswamy says: December 02, 2018 at 6:41 pm
Awesome tutorial Reply
Prateek Joshi
Prateek Joshi says: December 02, 2018 at 9:25 pm
Thanks!! Reply
M Jahangeer Qureshi
M Jahangeer Qureshi says: December 03, 2018 at 1:06 pm
It runs fine for the given data but I seem to be running into problems when I try a different dataset, can you help me out here? Reply
Sachin Kalsi
Sachin Kalsi says: December 03, 2018 at 2:41 pm
How we can extend this to multi-label classification problems? Reply
Prateek Joshi
Prateek Joshi says: December 03, 2018 at 4:04 pm
Hi Sachin, This code will work for multi-label classification as well. Reply
Sandhiya says: December 05, 2018 at 5:23 pm
Hi, Thanks for the article. I am trying to do a text classification problem. Is there a way in which I can initialize some other pre-trained model apart from the WikiText103 or any language model that I have trained? learn = language_model_learner(data_lm, pretrained_model=URLs.WT103, drop_mult=0.7) Please help. Reply
Prateek Joshi
Prateek Joshi says: December 05, 2018 at 5:33 pm
Since I'd need more details to solve your problem, it would be better if you can post this query on Reply
Prateek Joshi
Prateek Joshi says: December 05, 2018 at 5:45 pm
Hi, Yes, we can use different pre-trained language models. I'd suggest that you follow discussions related to "Language Model Zoo" at Reply
anass nasserallah
anass nasserallah says: December 07, 2018 at 4:34 am
so helpful,thank you a lot for helping people.I have a question about : 1- how can I detect language in a phrase 2-the semantic similarity between two sentences or two different language words 3- Mesure d'importance d'une phrase ou d'un mot dans le texte. Reply
M Jahangeer Qureshi
M Jahangeer Qureshi says: December 07, 2018 at 4:50 pm
I have posted the error over here. Reply
SHAIK ABDUL K KHALANDAR BASHA says: December 20, 2018 at 4:45 pm
vary new learning today ...keep sharing Reply
Doug says: February 06, 2019 at 11:09 pm
As everyone has already stated...great article! It's amazing how transfer learning is changing NLP. Jeremy and Rachel have done some great work on ULMFit and your clear example is a great demonstration Prateek! What I'm missing, and I don't know why I can't understand this but, is on your last bit of code on getting the predictions. It's unclear to me on what you're returning here. Those are supposed to be prediction values from the validation set, but I don't know how to read that. Can you clarify what col and rows 0,1 represent? What are those numbers? Also, what if I wanted to test on a new entry and get a prediction back? For example, what if I wanted to pass in: "I have to get new skates as the season is about to begin", which would predict as the correct class. Could you please give a code example of this Prateek? Reply
doug says: February 06, 2019 at 11:14 pm
I can't believe I didn't try: learn.predict('I have to get new skates as the season is about to begin') Please disregard the last portion of my previous post. However, I am still confused on your last bit of code on getting the predictions per my earlier question: What I'm missing, and I don't know why I can't understand this but, is on your last bit of code on getting the predictions. It's unclear to me on what you're returning here. Those are supposed to be prediction values from the validation set, but I don't know how to read that. Can you clarify what col and rows 0,1 represent? What are those numbers? Reply
Prateek Joshi
Prateek Joshi says: February 07, 2019 at 3:58 pm
It is a confusion matrix. col_0 represents actual class and row_0 represents predictions. Reply
Jerry says: March 02, 2019 at 12:02 pm
Great article. I've been trying to run this similar code for the full 20 newsgroup but have been getting subpar results. Is there any pointer on how I can improve the accuracy. I've been running the language model fit_one_cycle one layer at a time and similarly with classifier model. That got me closed to 70% accuracy. Do you know any other ways to improve the accuracy? Any ideas would be great. Thanks a gain for such an awesome article. Reply
Rednivrug says: March 25, 2019 at 4:03 pm
change language_model_learner(data_lm, pretrained_model=URLs.WT103, drop_mult=0.7) to language_model_learner(data_lm, AWD_LSTM, drop_mult=0.7) then it will work. Reply
Emile says: May 17, 2019 at 2:27 pm
Thanks so much for this great article ! Any ideas of how to find the tokens that contribute the most to the classification of each class ? Thanks again ! Reply

Leave a Reply Your email address will not be published. Required fields are marked *