We tend to look through language and not realize how much power language has.
Language is such a powerful medium of communication. We have the ability to build projects from scratch using the nuances of language. It’s what drew me to Natural Language Processing (NLP) in the first place.
I’m amazed by the vast array of tasks I can perform with NLP – text summarization, generating completely new pieces of text, predicting what word comes next (Google’s autofill), among others. Do you know what is common among all these NLP tasks?
They are all powered by language models! Honestly, these language models are a crucial first step for most of the advanced NLP tasks.
In this article, you will learn about the bigram model, a foundational concept in natural language processing. We will explore what is a bigram, how it functions within the bigram language model, and provide a bigrams example to illustrate its practical application. By the end, you’ll have a clear understanding of how bigrams contribute to language prediction and text analysis.
Share
Rewrite
So, tighten your seatbelts and brush up your linguistic skills – we are heading into the wonderful world of Natural Language Processing!
Are you new to NLP? Confused about where to begin? You should check out this comprehensive course designed by experts with decades of industry experience:
“You shall know the nature of a word by the company it keeps.” – John Rupert Firth
A language model learns to predict the probability of a sequence of words. But why do we need to learn the probability of words? Let’s understand that with an example.
I’m sure you have used Google Translate at some point. We all use it to translate one language to another for varying reasons. This is an example of a popular NLP application called Machine Translation.
In Machine Translation, you take in a bunch of words from a language and convert these words into another language. Now, there can be many potential translations that a system might give you and you will want to compute the probability of each of these translations to understand which one is the most accurate.
In the above example, we know that the probability of the first sentence will be more than the second, right? That’s how we arrive at the right translation.
This ability to model the rules of a language as a probability gives great power for NLP related tasks. Language models are used in speech recognition, machine translation, part-of-speech tagging, parsing, Optical Character Recognition, handwriting recognition, information retrieval, and many other daily tasks.
There are primarily two types of Language Models:
These models use traditional statistical techniques like N-grams, Hidden Markov Models (HMM) and certain linguistic rules to learn the probability distribution of words
These are new players in the NLP town and have surpassed the statistical language models in their effectiveness. They use different kinds of Neural Networks to model language
Now that you have a pretty good idea about Language Models, let’s start building one!
An N-gram is a sequence of N tokens (or words).
Let’s understand N-gram with an example. Consider the following sentence:
“I love reading blogs about data science on Analytics Vidhya.”
A 1-gram (or unigram) is a one-word sequence. For the above sentence, the unigrams would simply be: “I”, “love”, “reading”, “blogs”, “about”, “data”, “science”, “on”, “Analytics”, “Vidhya”.
A bigram language model is a type of statistical language model that predicts the probability of a word in a sequence based on the previous word. It considers pairs of consecutive words (bigrams) and estimates the likelihood of encountering a specific word given the preceding word in a text or sentence.
A 2-gram (or bigram) is a two-word sequence of words, like “I love”, “love reading”, or “Analytics Vidhya”. And a 3-gram (or trigram) is a three-word sequence of words like “I love reading”, “about data science” or “on Analytics Vidhya”.
Fairly straightforward stuff!
An N-gram language model predicts the probability of a given N-gram within any sequence of words in the language. If we have a good N-gram model, we can predict p(w | h) – what is the probability of seeing the word w given a history of previous words h – where the history contains n-1 words.
We must estimate this probability to construct an N-gram model.
We compute this probability in two steps:
The chain rule of probability is:
p(w1...ws) = p(w1) . p(w2 | w1) . p(w3 | w1 w2) . p(w4 | w1 w2 w3) ..... p(wn | w1...wn-1)
So what is the chain rule? It tells us how to compute the joint probability of a sequence by using the conditional probability of a word given previous words.
But we do not have access to these conditional probabilities with complex conditions of up to n-1 words. So how do we proceed?
This is where we introduce a simplification assumption. We can assume for all conditions, that:
p(wk | w1...wk-1) = p(wk | wk-1)
Here, we approximate the history (the context) of the word wk by looking only at the last word of the context. This assumption is called the Markov assumption. (We used it here with a simplified context of length 1 – which corresponds to a bigram model – we could use larger fixed-sized histories in general).
Now that we understand what an N-gram is, let’s build a basic language model using trigrams of the Reuters corpus. Reuters corpus is a collection of 10,788 news documents totaling 1.3 million words. We can build a language model in a few lines of code using the NLTK package:
Python Code:
# code courtesy of https://nlpforhackers.io/language-models/
from nltk.corpus import reuters
from nltk import bigrams, trigrams
from collections import Counter, defaultdict
import nltk
nltk.download('reuters')
nltk.download('punkt')
# Create a placeholder for model
model = defaultdict(lambda: defaultdict(lambda: 0))
# Count frequency of co-occurance
for sentence in reuters.sents():
for w1, w2, w3 in trigrams(sentence, pad_right=True, pad_left=True):
model[(w1, w2)][w3] += 1
# Let's transform the counts to probabilities
for w1_w2 in model:
total_count = float(sum(model[w1_w2].values()))
for w3 in model[w1_w2]:
model[w1_w2][w3] /= total_count
print(dict(model['today', 'the']))
The code above is pretty straightforward. We first split our text into trigrams with the help of NLTK and then calculate the frequency in which each combination of the trigrams occurs in the dataset.
We then use it to calculate probabilities of a word, given the previous two words. That’s essentially what gives us our Language Model!
Let’s make simple predictions with this language model. We will start with two simple words – “today the”. We want our model to tell us what will be the next word:
So we get predictions of all the possible words that can come next with their respective probabilities. Now, if we pick up the word “price” and again make a prediction for the words “the” and “price”:
If we keep following this process iteratively, we will soon have a coherent sentence! Here is a script to play around with generating a random piece of text using our n-gram model:
# code courtesy of https://nlpforhackers.io/language-models/
import random
# starting words
text = ["today", "the"]
sentence_finished = False
while not sentence_finished:
# select a random probability threshold
r = random.random()
accumulator = .0
for word in model[tuple(text[-2:])].keys():
accumulator += model[tuple(text[-2:])][word]
# select words that are above the probability threshold
if accumulator >= r:
text.append(word)
break
if text[-2:] == [None, None]:
sentence_finished = True
print (' '.join([t for t in text if t]))
And here is some of the text generated by our model:
Pretty impressive! Even though the sentences feel slightly off (maybe because the Reuters dataset is mostly news), they are very coherent given the fact that we just created a model in 17 lines of Python code and a really small dataset.
This is the same underlying principle which the likes of Google, Alexa, and Apple use for language modeling.
N-gram based language models do have a few drawbacks:
“Deep Learning waves have lapped at the shores of computational linguistics for several years now, but 2015 seems like the year when the full force of the tsunami hit the major Natural Language Processing (NLP) conferences.” – Dr. Christopher D. Manning
Deep Learning has been shown to perform really well on many NLP tasks like Text Summarization, Machine Translation, etc. and since these tasks are essentially built upon Language Modeling, there has been a tremendous research effort with great results to use Neural Networks for Language Modeling.
We can essentially build two kinds of language models – character level and word level. And even under each category, we can have many subcategories based on the simple fact of how we are framing the learning problem. We will be taking the most straightforward approach – building a character-level language model.
Does the above text seem familiar? It’s the US Declaration of Independence! The dataset we will use is the text from this Declaration.
This is a historically important document because it was signed when the United States of America got independence from the British. I used this document as it covers a lot of different topics in a single space. It’s also the right size to experiment with because we are training a character-level language model which is comparatively more intensive to run as compared to a word-level language model.
The problem statement is to train a language model on the given text and then generate text given an input text in such a way that it looks straight out of this document and is grammatically correct and legible to read.
You can download the dataset from here. Let’s begin!
import numpy as np
import pandas as pd
from keras.utils import to_categorical
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import LSTM, Dense, GRU, Embedding
from keras.callbacks import EarlyStopping, ModelCheckpoint
You can directly read the dataset as a string in Python:
data_text = """The unanimous Declaration of the thirteen united States of America, When in the Course of human events, it becomes necessary for one people to dissolve the political bands which have connected them with another, and to assume among the powers of the earth, the separate and equal station to which the Laws of Nature and of Nature's God entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the separation.
We hold these truths to be self-evident, that all men are created equal, that they are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit of Happiness.--That to secure these rights, Governments are instituted among Men, deriving their just powers from the consent of the governed, --That whenever any Form of Government becomes destructive of these ends, it is the Right of the People to alter or to abolish it, and to institute new Government, laying its foundation on such principles and organizing its powers in such form, as to them shall seem most likely to effect their Safety and Happiness. Prudence, indeed, will dictate that Governments long established should not be changed for light and transient causes; and accordingly all experience hath shewn, that mankind are more disposed to suffer, while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, pursuing invariably the same Object evinces a design to reduce them under absolute Despotism, it is their right, it is their duty, to throw off such Government, and to provide new Guards for their future security.--Such has been the patient sufferance of these Colonies; and such is now the necessity which constrains them to alter their former Systems of Government. The history of the present King of Great Britain is a history of repeated injuries and usurpations, all having in direct object the establishment of an absolute Tyranny over these States. To prove this, let Facts be submitted to a candid world.
He has refused his Assent to Laws, the most wholesome and necessary for the public good.
He has forbidden his Governors to pass Laws of immediate and pressing importance, unless suspended in their operation till his Assent should be obtained; and when so suspended, he has utterly neglected to attend to them.
He has refused to pass other Laws for the accommodation of large districts of people, unless those people would relinquish the right of Representation in the Legislature, a right inestimable to them and formidable to tyrants only.
He has called together legislative bodies at places unusual, uncomfortable, and distant from the depository of their public Records, for the sole purpose of fatiguing them into compliance with his measures.
He has dissolved Representative Houses repeatedly, for opposing with manly firmness his invasions on the rights of the people.
He has refused for a long time, after such dissolutions, to cause others to be elected; whereby the Legislative powers, incapable of Annihilation, have returned to the People at large for their exercise; the State remaining in the mean time exposed to all the dangers of invasion from without, and convulsions within.
He has endeavoured to prevent the population of these States; for that purpose obstructing the Laws for Naturalization of Foreigners; refusing to pass others to encourage their migrations hither, and raising the conditions of new Appropriations of Lands.
He has obstructed the Administration of Justice, by refusing his Assent to Laws for establishing Judiciary powers.
He has made Judges dependent on his Will alone, for the tenure of their offices, and the amount and payment of their salaries.
He has erected a multitude of New Offices, and sent hither swarms of Officers to harrass our people, and eat out their substance.
He has kept among us, in times of peace, Standing Armies without the Consent of our legislatures.
He has affected to render the Military independent of and superior to the Civil power.
He has combined with others to subject us to a jurisdiction foreign to our constitution, and unacknowledged by our laws; giving his Assent to their Acts of pretended Legislation:
For Quartering large bodies of armed troops among us:
For protecting them, by a mock Trial, from punishment for any Murders which they should commit on the Inhabitants of these States:
For cutting off our Trade with all parts of the world:
For imposing Taxes on us without our Consent:
For depriving us in many cases, of the benefits of Trial by Jury:
For transporting us beyond Seas to be tried for pretended offences
For abolishing the free System of English Laws in a neighbouring Province, establishing therein an Arbitrary government, and enlarging its Boundaries so as to render it at once an example and fit instrument for introducing the same absolute rule into these Colonies:
For taking away our Charters, abolishing our most valuable Laws, and altering fundamentally the Forms of our Governments:
For suspending our own Legislatures, and declaring themselves invested with power to legislate for us in all cases whatsoever.
He has abdicated Government here, by declaring us out of his Protection and waging War against us.
He has plundered our seas, ravaged our Coasts, burnt our towns, and destroyed the lives of our people.
He is at this time transporting large Armies of foreign Mercenaries to compleat the works of death, desolation and tyranny, already begun with circumstances of Cruelty & perfidy scarcely paralleled in the most barbarous ages, and totally unworthy the Head of a civilized nation.
He has constrained our fellow Citizens taken Captive on the high Seas to bear Arms against their Country, to become the executioners of their friends and Brethren, or to fall themselves by their Hands.
He has excited domestic insurrections amongst us, and has endeavoured to bring on the inhabitants of our frontiers, the merciless Indian Savages, whose known rule of warfare, is an undistinguished destruction of all ages, sexes and conditions.
In every stage of these Oppressions We have Petitioned for Redress in the most humble terms: Our repeated Petitions have been answered only by repeated injury. A Prince whose character is thus marked by every act which may define a Tyrant, is unfit to be the ruler of a free people.
Nor have We been wanting in attentions to our Brittish brethren. We have warned them from time to time of attempts by their legislature to extend an unwarrantable jurisdiction over us. We have reminded them of the circumstances of our emigration and settlement here. We have appealed to their native justice and magnanimity, and we have conjured them by the ties of our common kindred to disavow these usurpations, which, would inevitably interrupt our connections and correspondence. They too have been deaf to the voice of justice and of consanguinity. We must, therefore, acquiesce in the necessity, which denounces our Separation, and hold them, as we hold the rest of mankind, Enemies in War, in Peace Friends.
We, therefore, the Representatives of the united States of America, in General Congress, Assembled, appealing to the Supreme Judge of the world for the rectitude of our intentions, do, in the Name, and by Authority of the good People of these Colonies, solemnly publish and declare, That these United Colonies are, and of Right ought to be Free and Independent States; that they are Absolved from all Allegiance to the British Crown, and that all political connection between them and the State of Great Britain, is and ought to be totally dissolved; and that as Free and Independent States, they have full Power to levy War, conclude Peace, contract Alliances, establish Commerce, and to do all other Acts and Things which Independent States may of right do. And for the support of this Declaration, with a firm reliance on the protection of divine Providence, we mutually pledge to each other our Lives, our Fortunes and our sacred Honor."""
We perform basic text preprocessing since this data does not have much noise. We lower case all the words to maintain uniformity and remove words with length less than 3:
import re
def text_cleaner(text):
# lower case text
newString = text.lower()
newString = re.sub(r"'s\b","",newString)
# remove punctuations
newString = re.sub("[^a-zA-Z]", " ", newString)
long_words=[]
# remove short word
for i in newString.split():
if len(i)>=3:
long_words.append(i)
return (" ".join(long_words)).strip()
# preprocess the text
data_new = text_cleaner(data_text)
Once the preprocessing is complete, it is time to create training sequences for the model.
The way this problem is modeled is we take in 30 characters as context and ask the model to predict the next character. Now, 30 is a number which I got by trial and error and you can experiment with it too. You essentially need enough characters in the input sequence that your model is able to get the context.
# create a character mapping index
chars = sorted(list(set(data_new)))
mapping = dict((c, i) for i, c in enumerate(chars))
def encode_seq(seq):
sequences = list()
for line in seq:
# integer encode line
encoded_seq = [mapping[char] for char in line]
# store
sequences.append(encoded_seq)
return sequences
# encode the sequences
sequences = encode_seq(sequences)
Let’s see how our training sequences look like:
Once the sequences are generated, the next step is to encode each character. This would give us a sequence of numbers.
# create a character mapping index
chars = sorted(list(set(data_new)))
mapping = dict((c, i) for i, c in enumerate(chars))
def encode_seq(seq):
sequences = list()
for line in seq:
# integer encode line
encoded_seq = [mapping[char] for char in line]
# store
sequences.append(encoded_seq)
return sequences
# encode the sequences
sequences = encode_seq(sequences)
So now, we have sequences like this:
Once we are ready with our sequences, we split the data into training and validation splits. This is because while training, I want to keep a track of how good my language model is working with unseen data.
from sklearn.model_selection import train_test_split
# vocabulary size
vocab = len(mapping)
sequences = np.array(sequences)
# create X and y
X, y = sequences[:,:-1], sequences[:,-1]
# one hot encode y
y = to_categorical(y, num_classes=vocab)
# create train and validation sets
X_tr, X_val, y_tr, y_val = train_test_split(X, y, test_size=0.1, random_state=42)
print('Train shape:', X_tr.shape, 'Val shape:', X_val.shape)
Checkout this article of Python Tutorial to Learn Data Science from Scratch
Time to build our language model!
I have used the embedding layer of Keras to learn a 50 dimension embedding for each character. This helps the model in understanding complex relationships between characters. I have also used a GRU layer as the base model, which has 150 timesteps. Finally, a Dense layer is used with a softmax activation for prediction.
# generate a sequence of characters with a language model
def generate_seq(model, mapping, seq_length, seed_text, n_chars):
in_text = seed_text
# generate a fixed number of characters
for _ in range(n_chars):
# encode the characters as integers
encoded = [mapping[char] for char in in_text]
# truncate sequences to a fixed length
encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
# predict character
yhat = model.predict_classes(encoded, verbose=0)
# reverse map integer to character
out_char = ''
for char, index in mapping.items():
if index == yhat:
out_char = char
break
# append to input
in_text += char
return in_text
Once the model has finished training, we can generate text from the model given an input sequence using the below code:
# generate a sequence of characters with a language model
def generate_seq(model, mapping, seq_length, seed_text, n_chars):
in_text = seed_text
# generate a fixed number of characters
for _ in range(n_chars):
# encode the characters as integers
encoded = [mapping[char] for char in in_text]
# truncate sequences to a fixed length
encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
# predict character
yhat = model.predict_classes(encoded, verbose=0)
# reverse map integer to character
out_char = ''
for char, index in mapping.items():
if index == yhat:
out_char = char
break
# append to input
in_text += char
return in_text
Let’s put our model to the test. In the video below, I have given different inputs to the model. Let’s see how it performs
Notice just how sensitive our language model is to the input text! Small changes like adding a space after “of” or “for” completely changes the probability of occurrence of the next characters because when we write space, we mean that a new word should start.
Additionally, when we do not give space, it tries to predict a word that will have these as starting characters (like “for” can mean “foreign”).
Also, note that almost none of the combinations predicted by the model exist in the original training data. So our model is actually building words based on its understanding of the rules of the English language and the vocabulary it has seen during training.
We have so far trained our own models to generate text, be it predicting the next word or generating some text with starting words. But that is just scratching the surface of what language models are capable of!
Leading research labs have trained much more complex language models on humongous datasets that have led to some of the biggest breakthroughs in the field of Natural Language Processing.
In February 2019, OpenAI started quite a storm through its release of a new transformer-based language model called GPT-2. GPT-2 is a transformer-based generative language model that was trained on 40GB of curated text from the internet.
You can read more about GPT-2 here:
So, let’s see GPT-2 in action!
Before we can start using GPT-2, let’s know a bit about the PyTorch-Transformers library. We will be using this library we will use to load the pre-trained models.
PyTorch-Transformers provides state-of-the-art pre-trained models for Natural Language Processing (NLP).
Most of the State-of-the-Art models require tons of training data and days of training on expensive GPU hardware which is something only the big technology companies and research labs can afford. But by using PyTorch-Transformers, now anyone can utilize the power of State-of-the-Art models!
Installing Pytorch-Transformers is pretty straightforward in Python. You can simply use pip install:
pip install pytorch-transformers
or if you are working on Colab:
!pip install pytorch-transformers
Since most of these models are GPU-heavy, I would suggest working with Google Colab for this part of the article.
Let’s build our own sentence completion model using GPT-2. We’ll try to predict the next word in the sentence:
“what is the fastest car in the _________”
I chose this example because this is the first suggestion that Google’s text completion gives. Here is the code for doing the same:
# Import required libraries
import torch
from pytorch_transformers import GPT2Tokenizer, GPT2LMHeadModel
# Load pre-trained model tokenizer (vocabulary)
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
# Encode a text inputs
text = "What is the fastest car in the"
indexed_tokens = tokenizer.encode(text)
# Convert indexed tokens in a PyTorch tensor
tokens_tensor = torch.tensor([indexed_tokens])
# Load pre-trained model (weights)
model = GPT2LMHeadModel.from_pretrained('gpt2')
# Set the model in evaluation mode to deactivate the DropOut modules
model.eval()
# If you have a GPU, put everything on cuda
tokens_tensor = tokens_tensor.to('cuda')
model.to('cuda')
# Predict all tokens
with torch.no_grad():
outputs = model(tokens_tensor)
predictions = outputs[0]
# Get the predicted next sub-word
predicted_index = torch.argmax(predictions[0, -1, :]).item()
predicted_text = tokenizer.decode(indexed_tokens + [predicted_index])
# Print the predicted word
print(predicted_text)
Here, we tokenize and index the text as a sequence of numbers and pass it to the GPT2LMHeadModel. This is the GPT2 model transformer with a language modeling head on top (linear layer with weights tied to the input embeddings).
Awesome! The model successfully predicts the next word as “world”. This is pretty amazing as this is what Google was suggesting. I recommend you try this model with different input sentences and see how it performs while predicting the next word in a sentence.
Now, we have played around by predicting the next word and the next character so far. Let’s take text generation to the next level by generating an entire paragraph from an input piece of text!
Let’s see what our models generate for the following input text:
Two roads diverged in a yellow wood, And sorry I could not travel both And be one traveler, long I stood And looked down one as far as I could To where it bent in the undergrowth;
This is the first paragraph of the poem “The Road Not Taken” by Robert Frost. Let’s put GPT-2 to work and generate the next paragraph of the poem.
We will be using the readymade script that PyTorch-Transformers provides for this task. Let’s clone their repository first:
!git clone https://github.com/huggingface/pytorch-transformers.git
Now, we just need a single command to start the model!
!python pytorch-transformers/examples/run_generation.py \
--model_type=gpt2 \
--length=100 \
--model_name_or_path=gpt2 \
Let’s see what output our GPT-2 model gives for the input text:
And with my little eyes full of hearth and perfumes, I saw the blue of Scotland, And this powerful lieeth close By wind's with profit and grief, And at this time came and passed by, At how often thro' places And always this path was fresh Through one winter down. And, stung by the wild storm, Appeared half-blind, yet in that gloomy castle.
Isn’t that crazy?! The output almost perfectly fits in the context of the poem and appears as a good continuation of the first paragraph of the poem.
Quite a comprehensive journey, wasn’t it? We discussed what language models are and how we can use them using the latest state-of-the-art NLP frameworks. And the end result was so impressive!
You should consider this as the beginning of your ride into language models. I encourage you to play around with the code I’ve showcased here. This will really help you build your own knowledge and skillset while expanding your opportunities in NLP.
And if you’re new to NLP and looking for a place to start, here is the perfect starting point:
Let me know if you have any queries or feedback related to this article in the comments section below. Happy learning!
A. Here’s an example of a bigram language model predicting the next word in a sentence: Given the phrase “I am going to”, the model may predict “the” with a high probability if the training data indicates that “I am going to” is often followed by “the”.
A. The formula for a bigram probability is: P(word | previous_word) = Count(previous_word, word) / Count(previous_word) where Count(previous_word, word) represents the number of occurrences of the bigram (previous_word, word), and Count(previous_word) is the count of the previous_word in the training data.
Excellent work !! I will be very interested to learn more and use this to try out applications of this program.
Hey Subra, Thanks for the feedback!
Great work sir kindly do some work related to image captioning or suggest something on that.
Hey Ruchika, Thanks for your comment. I'll try working with image captioning but for now, I am focusing on NLP specific projects!
Hello sir, I liked your article about how to build models. i would like to know about which machine learning and NLP models using python or neural networks would be best to use for auto data mapping from source dataset to target dataset. Thank you.