A Step-by-Step NLP Guide to Learn ELMo for Extracting Features from Text

Prateek Joshi 21 Feb, 2024
11 min read


I work on different Natural Language Processing (NLP) problems (the perks of being a data scientist!). Each NLP problem is a unique challenge in its own way. That’s just a reflection of how complex, beautiful and wonderful the human language is.

But one thing has always been a thorn in an NLP practitioner’s mind is the inability (of machines) to understand the true meaning of a sentence. Yes, I’m talking about context. Traditional NLP techniques and frameworks were great when asked to perform basic tasks. Things quickly went south when we tried to add context to the situation.

The NLP landscape has significantly changed in the last 18 months or so. NLP frameworks like Google’s BERT and Zalando’s Flair are able to parse through sentences and grasp the context in which they were written.

Embeddings from Language Models (ELMo)

One of the biggest breakthroughs in this regard came thanks to ELMo, a state-of-the-art NLP framework developed by AllenNLP. By the time you finish this article, you too will have become a big ELMo fan – just as I did.

In this article, we will explore ELMo (Embeddings from Language Models) and use it to build a mind-blowing NLP model using Python on a real-world dataset.

Note: This article assumes you are familiar with the different types of word embeddings and LSTM architecture. You can refer to the below articles to learn more about the topics:

What is ELMo?

No, the ELMo we are referring to isn’t the character from Sesame Street! A classic example of the importance of context.

ELMo is a deep contextualized word in vectors or embeddings. These word embeddings are helpful in achieving state-of-the-art (SOTA) results in several NLP tasks:

NLP scientists globally have started using ELMo for various NLP tasks, both in research as well as the industry. You must check out the original ELMo research paper here – https://arxiv.org/pdf/1802.05365.pdf.

I don’t usually ask people to read research papers because they can often come across as heavy and complex but I’m making an exception for ELMo. This one is a really cool explanation of how ELMo was designed.

How Does ELMo Works?

Let’s get an intuition of how ELMo works underneath before we implement it in Python. Why is this important?

Well, picture this. You’ve successfully copied the ELMo code from GitHub into Python and managed to build a model on your custom text data. You get average results so you need to improve the model. How will you do that if you don’t understand the architecture of ELMo? What parameters will you tweak if you haven’t studied about it?

This line of thought applies to all machine learning algorithms. You need not get into their derivations but you should always know enough to play around with them and improve your model.

Now, let’s come back to how ELMo works.

As I mentioned earlier, ELMo word vectors are computed on top of a two-layer bidirectional language model (biLM). This biLM model has two layers stacked together. Each layer has 2 passes — forward pass and backward pass:

ELMo structure
  • The architecture above uses a character-level convolutional neural network (CNN) to represent words of a text string into raw word vectors
  • These raw word vectors act as inputs to the first layer of biLM
  • ELMo enhance the semantic understanding of sentences
  • The forward pass contains information about a certain word and the context (other words) before that word
  • The backward pass contains information about the word and the context after it
  • This pair of information, from the forward and backward pass, forms the intermediate word vectors
  • These intermediate word vectors are fed into the next layer of biLM
  • The final representation (ELMo) is the weighted sum of the raw word vectors and the 2 intermediate word vectors

As the input to the biLM is computed from characters rather than words, it captures the inner structure of the word. For example, the biLM will be able to figure out that terms like beauty and beautiful are related at some level without even looking at the context they often appear in. Sounds incredible!

How is ELMo Different from Other Word Embeddings?

Unlike traditional word embeddings such as word2vec and GLoVe, the ELMo vector assigned to a token or word is actually a function of the entire sentence containing that word. Therefore, the same word can have different word vectors under different contexts.

I can imagine you asking – how does knowing that help me deal with NLP problems? Let me explain this using an example.

Suppose we have a couple of sentences:

  1. I read the book yesterday.
  2. Can you read the letter now?

Take a moment to ponder the difference between these two. The verb “read” in the first sentence is in the past tense. And the same verb transforms into present tense in the second sentence. This is a case of Polysemy wherein a word could have multiple meanings or senses.

Language is such a wonderfully complex thing.

Traditional word embeddings come up with the same vector for the word “read” in both the sentences. Hence, the system would fail to distinguish between the polysemous words. These word embeddings just cannot grasp the context in which the word was used.

ELMo word vectors successfully address this issue. ELMo word representations take the entire input sentence into equation for calculating the word embeddings. Hence, the term “read” would have different ELMo vectors under different context. ELMo compares to models like GPT, it would be beneficial to refer to additional sources or articles that specifically address the comparison between these different NLP models

Implementation: ELMo for Text Classification in Python

And now the moment you have been waiting for – implementing ELMo in Python! Let’s take this step-by-step.

Understanding the Problem Statement

The first step towards dealing with any data science challenge is defining the problem statement. It forms the base for our future actions.

For this article, we already have the problem statement in hand:

Sentiment analysis remains one of the key problems that has seen extensive application of natural language processing (NLP). This time around, given the tweets from customers about various tech firms who manufacture and sell mobiles, computers, laptops, etc., the task is to identify if the tweets have a negative sentiment towards such companies or products.

It is clearly a binary text classification task wherein we have to predict the sentiments from the extracted tweets.

About the Dataset

Here’s a breakdown of the dataset we have:

  • The train set contains 7,920 tweets
  • The test set contains 1,953 tweets

You can download the dataset from this page. Note that you will have to register or sign-in to do so.

Caution: Most profane and vulgar terms in the tweets have been replaced with “$&@*#”. However, please note that the dataset might still contain text that could be considered profane, vulgar, or offensive.

Alright, let’s fire up our favorite Python IDE and get coding!

Import Libraries

Import the libraries we’ll be using throughout our notebook:

Read and Inspect the Data

# read data
train = pd.read_csv("train_2kmZucJ.csv")
test = pd.read_csv("test_oJQbWVk.csv")

train.shape, test.shape

Output: ((7920, 3), (1953, 2))

The train set has 7,920 tweets while the test set has only 1,953. Now let’s check the class distribution in the train set:

train['label'].value_counts(normalize = True)


0    0.744192
1    0.255808
Name: label, dtype: float64

Here, 1 represents a negative tweet while 0 represents a non-negative tweet.

Let’s take a quick look at the first 5 rows in our train set:

Python Code:

We have three columns to work with. The column ‘tweet’ is the independent variable while the column ‘label’ is the target variable.

Text Cleaning and Preprocessing

We would have a clean and structured dataset to work with in an ideal world. But things are not that simple in NLP (yet).

We need to spend a significant amount of time cleaning the data to make it ready for the model building stage. Feature extraction from the text becomes easy and even the features contain more information. You’ll see a meaningful improvement in your model’s performance the better your data quality becomes.

So let’s clean the text we’ve been given and explore it.

There seem to be quite a few URL links in the tweets. They are not telling us much (if anything) about the sentiment of the tweet so let’s remove them.

We have used Regular Expressions (or RegEx) to remove the URLs.

Note: You can learn more about Regex in this article.

We’ll go ahead and do some routine text cleaning now.

I’d also like to normalize the text, aka, perform text normalization. This helps in reducing a word to its base form. For example, the base form of the words ‘produces’, ‘production’, and ‘producing’ is ‘product’. It happens quite often that multiple forms of the same word are not really that important and we only need to know the base form of that word.

We will lemmatize (normalize) the text by leveraging the popular spaCy library.

Lemmatize tweets in both the train and test sets:

train['clean_tweet'] = lemmatization(train['clean_tweet'])
test['clean_tweet'] = lemmatization(test['clean_tweet'])

Let’s have a quick look at the original tweets vs our cleaned ones:


Check out the above columns closely. The tweets in the ‘clean_tweet’ column appear to be much more legible than the original tweets.

However, I feel there is still plenty of scope for cleaning the text. I encourage you to explore the data as much as you can and find more insights or irregularities in the text.

Brief Intro to TensorFlow Hub

Wait, what does TensorFlow have to do with our tutorial?

TensorFlow Hub is a library that enables transfer learning by allowing the use of many machine learning models for different tasks. ELMo is one such example. That’s why we will access ELMo via TensorFlow Hub in our implementation.

TensorFlow Hub

Before we do anything else though, we need to install TensorFlow Hub. You must install or upgrade your TensorFlow package to at least 1.7 to use TensorFlow Hub:

$ pip install "tensorflow>=1.7.0"
$ pip install tensorflow-hub

Preparing ELMo Vectors

We will now import the pretrained ELMo model. A note of caution – the model is over 350 mb in size so it might take you a while to download this.

import tensorflow_hub as hub
import tensorflow as tf

elmo = hub.Module("https://tfhub.dev/google/elmo/2", trainable=True)

I will first show you how we can get ELMo vectors for a sentence. All you have to do is pass a list of string(s) in the object elmo.


TensorShape([Dimension(1), Dimension(8), Dimension(1024)])

The output is a 3 dimensional tensor of shape (1, 8, 1024):

  • The first dimension of this tensor represents the number of training samples. This is 1 in our case
  • The second dimension represents the maximum length of the longest string in the input list of strings. Since we have only 1 string in our input list, the size of the 2nd dimension is equal to the length of the string – 8
  • The third dimension is equal to the length of the ELMo vector

Hence, every word in the input sentence has an ELMo vector of size 1024.

Let’s go ahead and extract ELMo vectors for the cleaned tweets in the train and test datasets. However, to arrive at the vector representation of an entire tweet, we will take the mean of the ELMo vectors of constituent terms or tokens of the tweet.

Let’s define a function for doing this:

You might run out of computational resources (memory) if you use the above function to extract embeddings for the tweets in one go. As a workaround, split both train and test set into batches of 100 samples each. Then, pass these batches sequentially to the function elmo_vectors( ).

I will keep these batches in a list:

list_train = [train[i:i+100] for i in range(0,train.shape[0],100)]
list_test = [test[i:i+100] for i in range(0,test.shape[0],100)]

Now, we will iterate through these batches and extract the ELMo vectors. Let me warn you, this will take a long time.

# Extract ELMo embeddings
elmo_train = [elmo_vectors(x['clean_tweet']) for x in list_train]
elmo_test = [elmo_vectors(x['clean_tweet']) for x in list_test]

Once we have all the vectors, we can concatenate them back to a single array:

elmo_train_new = np.concatenate(elmo_train, axis = 0)
elmo_test_new = np.concatenate(elmo_test, axis = 0)

I would advice you to save these arrays as it took us a long time to get the ELMo vectors for them. We will save them as pickle files:

Use the following code to load them back:

Model Building and Evaluation

Let’s build our NLP model with ELMo!

We will use the ELMo vectors of the train dataset to build a classification model. Then, we will use the model to make predictions on the test set. But before all of that, split elmo_train_new into training and validation set to evaluate our model prior to the testing phase.

Since our objective is to set a baseline score, we will build a simple logistic regression model using ELMo vectors as features:

Prediction time! First, on the validation set:

preds_valid = lreg.predict(xvalid)

We will evaluate our model by the F1 score metric since this is the official evaluation metric of the contest.

f1_score(yvalid, preds_valid)

Output: 0.789976

The F1 score on the validation set is pretty impressive. Now let’s proceed and make predictions on the test set:

# make predictions on test set
preds_test = lreg.predict(elmo_test_new)

Prepare the submission file which we will upload on the contest page:

These predictions give us a score of 0.875672 on the public leaderboard. That is frankly pretty impressive given that we only did fairly basic text preprocessing and used a very simple model. Imagine what the score could be with more advanced techniques. Try them out on your end and let me know the results!

What else we can do with ELMo?

We just saw firsthand how effective ELMo can be for text classification. If coupled with a more sophisticated model, it would surely give an even better optimization. The application of ELMo is not limited just to the task of text classification. You can use it whenever you have to vectorize text data.

Below are a few more NLP tasks where we can utilize ELMo:

  • Machine Translation
  • Language Modeling
  • Text Summarization
  • Named Entity Recognition
  • Question-Answering Systems


ELMo is undoubtedly a significant progress in NLP and is here to stay. Given the sheer pace at which research in NLP is progressing, other new state-of-the-art word embeddings have also emerged in the last few months, like Google BERT and Falando’s Flair. Exciting times ahead for NLP practitioners!

I strongly encourage you to use ELMo on other datasets and experience the performance boost yourself. If you have any questions or want to share your experience with me and the community, please do so in the comments section below. You should also check out the below NLP related resources if you’re starting out in this field:

Frequently Asked Questions

Q1. What is a bag of words in NLP?

A. A bag of words is a representation technique in Natural Language Processing (NLP) where the text is represented as an unordered set of words, disregarding grammar and word order.

Q2. How does a bidirectional LSTM differ from a unidirectional LSTM in NLP?

A. A bidirectional Long Short-Term Memory (LSTM) processes input data in both forward and backward directions, capturing context from both past and future, whereas a unidirectional LSTM processes data only in one direction.

Q3. What is an encoder in the context of NLP?

A. In NLP, an encoder is a component in sequence-to-sequence models that transforms input data into a fixed-dimensional representation, often used in tasks like machine translation.

Q4. Explain fine-tuning in NLP.

A. Fine-tuning in NLP refers to the process of adjusting a pre-trained model on a specific task or domain to enhance its performance for a particular application.

Q5. What does model architecture refer to in NLP?

A. Model architecture in NLP refers to the overall structure and design of a neural network, specifying the arrangement of layers, connections, and operations within the model.

Prateek Joshi 21 Feb, 2024

Data Scientist at Analytics Vidhya with multidisciplinary academic background. Experienced in machine learning, NLP, graphs & networks. Passionate about learning and applying data science to solve real world problems.

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers


Sanjoy Datta
Sanjoy Datta 11 Mar, 2019

This line in the lemmatization(texts) function is not working: s = [token.lemma_ for token in nlp(i)] name 'nlp is not defined' Have run all the code upto this function. Pls advise.

Sangamesh K S
Sangamesh K S 11 Mar, 2019


Subash 11 Mar, 2019

Wonderful article. Thanks. Can you point me to a resource like yours where ELMo/BERT/ULMFiT/or any others is used in NER and /or Text Summarization?

Shan 18 Mar, 2019

Hi.. Thanks for introducing to a concept. Its a nice and interesting article. I am getting the following errors, while executing: # Extract ELMo embeddings elmo_train = [elmo_vectors(x['clean_tweet']) for x in list_train] elmo_test = [elmo_vectors(x['clean_tweet']) for x in list_test **Errors** UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[node module_2_apply_default_1/bilm/CNN_1/Conv2D_6 (defined at /usr/local/lib/python3.6/dist- packages/tensorflow_hub/native_module.py:517) ]] May be its version compatibilty issue. I was wondering, if you can guide regarding exact pointers and code to resolve the issue. Thanks

Saumit 20 Mar, 2019

# import spaCy's language model nlp = spacy.load('en', disable=['parser', 'ner']) # function to lemmatize text def lemmatization(texts): output = [] for i in texts: s = [token.lemma_ for token in nlp(i)] output.append(' '.join(s)) return output Here error occured : OSError Traceback (most recent call last) in 1 # import spaCy's language model ----> 2 nlp = spacy.load('en', disable=['parser', 'ner']) 3 4 # function to lemmatize text 5 def lemmatization(texts): ~\Anaconda3\lib\site-packages\spacy\__init__.py in load(name, **overrides) 20 if depr_path not in (True, False, None): 21 deprecation_warning(Warnings.W001.format(path=depr_path)) ---> 22 return util.load_model(name, **overrides) 23 24 ~\Anaconda3\lib\site-packages\spacy\util.py in load_model(name, **overrides) 134 elif hasattr(name, "exists"): # Path or Path-like to model data 135 return load_model_from_path(name, **overrides) --> 136 raise IOError(Errors.E050.format(name=name)) 137 138 OSError: [E050] Can't find model 'en'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

vamsi 25 Mar, 2019

Thanks for the post. I have a doubt in the output from the pretrained elmo model. The output vectors depend on the text you want to get elmo vectors for. I mean , considering the above example, you split the data into 100 batches each. Consider only 1st batch whose output might be Y. If you split this batch into two batches , whose output will be Y1 and Y2. let Y3 be after concatenation of Y1 and Y2. Now Y3 won't be equal to Y. Why is it like this ? If I had taken 1000 batches each in the above example, I would have got an another result. Please explain

Nazish 26 Mar, 2019

Hey, sorry to be so plain, I need help regarding data set. When I browse that page shared in content, that page doesn't show any data set. Help me fix this Thanks

bharath 02 Apr, 2019

Great Presentation !!!!

Bikash Gyawali
Bikash Gyawali 17 Apr, 2019

Hi, Thanks for the nice article. Does the embeddings obtained for a word within a sentence different based on the how earlier/later that sentence was seen in a batch? In other words, does elmo keep the history (context) of previous sentences in predicting the embeddings of any words in the current sentence? If so, do you need to initialise elmo each time for a new sentence -- i.e. if you wanted to get word embeddings using only the context of the sentence it appears in??

Swapnil 02 May, 2019

In your clean text I see -PRON- .... so I guess you are replacing pronouns with this tag. But is it used to make vectors as well ?? Doesn't make sense that way. Other question I have is about need of Lemmatisation and Stemming. Do we really need to do this as ELMo can interpret them differently.

Richa Sharma
Richa Sharma 09 May, 2019

Hi Thanks for sharing such a great post. Is there any ELMO pretrained model to work for Hindi text.

Richa 15 May, 2019

Hi Thanks for prompt reply. Can we use "weight file" and "option file", which is trained on pytorch framework, into our model which is built on tensorflow framework.

N.B.Phuoc 30 May, 2019

Hi! Your post is really useful for me!! But do you have any demo about datasets that have 2 sentences and one label like MRPC . I'm just a newbie and i'm trying to learn ELMo (at basic) for my thesis . Thanks anyway!!

Malik A. Rumi
Malik A. Rumi 02 Jun, 2019

Is the custom text data (that is the same as a corpus, isn't it?) a collection of documents to be iterated through (it doesn't seem like it, given what I've read, but I'm not sure, so I'm asking), or a single document composed of the text extracted from a large collection? If it is the latter, does pre-processing require teaching the algorithm any kind of distinction between, or recognizing the boundaries of, each original document? Do you throw out headings, subheadings, tables of contents, etc? On the one hand it seems you are maintaining the integrity of a sentence in order to create the vectors, but on the other, you are stripping out the punctuation, right? Please explain that process detail. How is that different from 'bag of words' ? Thank you.

Nguyễn Bá Phước
Nguyễn Bá Phước 04 Jun, 2019

Do you have any demo using ELMo with 2 sentence datasets like MRPC .!!!

Nazim Shaikh
Nazim Shaikh 20 Jun, 2019

Hi Prateek, Great practical explanation on Elmo. Thanks a lot Although while trying to extract elmo vectors batch wise, i seem to get below error "Can't convert 'text': Expected string, got 0 object of type 'Series' instead." May you please with above I haven't made in changes to your code. I am trying to run as it is. I am stuck at this.

Pranav Hari
Pranav Hari 26 Jun, 2019

Hey, can we find most similar words using Elmo Word Embeddings. Similar to how gensim provides a most_similar() in their word2vec package? And this was a great and lucid tutorial on ELMo

Badal 26 Jun, 2019

Can we train the model on our own corpus?

bharath 28 Jun, 2019

Great Post !!!

Dan 26 Jul, 2019

Hi Prateek, Great post, thanks a lot for taking your time to write it up! I have a somewhat naive question. I’m using ELMo vectors for a classification task, so my pipeline is roughly clean text, extract vectors, apply feedforward nn. What I’m worried about is whether the “extracting vectors” step will be constant, as it seems like we’re tweaking the weights of the ELMo model every time we’re extracting vectors? If I, say, set Trainable = False when loading in the ELMo model, would I then be able to use it freely on new data without any fear of training my model on test data? Thanks in advance.

Harshali Patil
Harshali Patil 26 Jul, 2019

Hi, this post really helped. Thanks. How can i use this elmo vectors with lstm model. Do you have any example?

Chetan Ambi
Chetan Ambi 04 Aug, 2019

Hi Prateek - Thank you for this article. I am trying this in Kaggle kernels, but when running below code, kernels getting restarted. Any thoughts? # Extract ELMo embeddings elmo_train = [elmo_vectors(x['clean_tweet']) for x in list_train] elmo_test = [elmo_vectors(x['clean_tweet']) for x in list_test]

poornima 08 Aug, 2019

sir, can we train the model using NLP to access the folder which consists of different format files such as .png, .jpeg, .csv, by passing a text message as 'show me the image', the model should such all the image format files and gives me the output. how can we do that please give me advice

Atiq 06 Sep, 2019

can we find most similar words using Elmo Word Embeddings pretrained model. Similar to how gensim provides a most_similar() in their word2vec package? Good tutorial on ELMo

Sujoy Sarkar
Sujoy Sarkar 25 Sep, 2019

Hi, Can we use the word embeddings directly for NLP task instead of taking mean to prepare sentence level embedding?

Elli Valla
Elli Valla 21 Oct, 2019

Hey! Thank you for this great article. Tried to load the ELMo model, but got stuck with this error: RuntimeError: variable_scope module_2/ was unused but the corresponding name_scope was already taken. It seems that there's something broken in the TF2 version. Any kind of help is much appreciated.