Top 4 Sentence Embedding Techniques using Python

Purva Huilgol Last Updated : 22 Oct, 2024
12 min read

Humans’ ability to understand nuances in a language is unmatchable. The perceptive human brain can easily understand humour, sarcasm, negative sentiment, and much more in a given sentence. The only criterion for this is that we have to know the language of the sentence. For instance, if someone commented on my article in Japanese, I certainly wouldn’t understand what the person was trying to say. This is the general rule. For effective communication, we must interact with the listener in a language he/she understands best.

For a machine to process and understand any text, we must represent this text in a language that the machine can understand. What language do you think machines understand best? Yes, it is that of numbers. A machine can only work with numbers, no matter what data we provide: video, audio, image, or text. That is why representing text as numbers or embedding text, as it is called, is one of the most actively researched topics.

In this article, I will cover the top four sentence embedding techniques with Python Code. Further, I limit the scope of this article to providing an overview of their architecture and how to implement these techniques in Python. We will take the basic use case of finding similar sentences given a sentence and demonstrate how to use such techniques for the same. I will begin with an overview of word and sentence embeddings.

What is Word Embedding?

The initial embedding techniques dealt with only words. Given a set of words, you would generate an embedding for each word. The simplest method was to one-hot encode the sequence of words provided so that each word was represented by one and other words by 0. While this effectively represented words and other simple text-processing tasks, it didn’t work on the more complex ones, such as finding similar words.

For example, if we search for a query: Best Italian restaurant in Delhi, we would like to get search results corresponding to Italian food, restaurants in Delhi and best. However, if we get a result saying: Top Italian food in Delhi, our simple method would fail to detect the similarity between ‘Best’ and ‘Top’ or between ‘food’ and ‘restaurant’.

This issue gave rise to what we now call word embeddings. Basically, a word embedding not only converts the word but also identifies its semantics and syntax to build a vector representation of this information. Some popular word embedding techniques include Word2Vec, GloVe, ELMo, FastText, etc.

The underlying concept is to use information from the words adjacent to the word. There has been path-breaking innovation in Word Embedding techniques, with researchers finding better ways to represent more information on the words and possibly scaling these to represent not only words but entire sentences and paragraphs.

Also Read: An Intuitive Understanding of Word Embeddings: From Count Vectors to Word2Vec

What is Sentence Embedding?

In NLP, sentence embedding refers to a numeric representation of a sentence in the form of a vector of real numbers, which encodes meaningful semantic information. It enables comparisons of sentence similarity by measuring the distance or similarity between these vectors. Techniques like Universal Sentence Encoder (USE) use deep learning models trained on large corpora to generate these embeddings, which find applications in tasks like text classification, clustering, and similarity matching.

What if we could work directly with individual sentences instead of dealing with individual words? In the case of large text, using only words would be very tedious, and we would be limited by the information we can extract from the word embeddings.

Suppose we encounter a sentence like ‘I don’t like crowded places’ and a few sentences later read ‘However, I like one of the world’s busiest cities, New York’. How can the machine infer between ‘crowded places’ and ‘busy cities’?

Clearly, word embedding would fall short here, so we use Sentence Embedding. Sentence embedding techniques represent entire sentences and their semantic information as vectors. This helps the machine understand the context, intention, and other nuances in the entire text.

Sentence Embedding Models

Sentence embedding models are designed to encapsulate a sentence’s semantic essence within a fixed-length vector. Unlike traditional Bag-of-Words (BoW) representations or one-hot encoding, sentence embeddings capture context, meaning, and relationships between words. This transformation is crucial for enabling machines to grasp the subtleties of human language.

Methods of Sentence Embedding

Several methods are employed to generate sentence embeddings:

  1. Averaging Word Embeddings: This approach takes the average word embeddings within a sentence. While simple, it may not capture complex contextual nuances.
  2. Pre-trained Models like BERT: Models like BERT (Bidirectional Encoder Representations from Transformers) have revolutionized sentence embeddings. BERT-based models consider each word’s context in a sentence, resulting in rich and contextually aware embeddings.
  3. Neural Network-Based Approaches: Skip-Thought vectors and InferSent are examples of neural network-based sentence embedding models. They are trained to predict the surrounding sentences, encouraging them to understand sentence semantics.

Noteworthy Sentence Embedding Models

  1. BERT (Bidirectional Encoder Representations from Transformers): BERT has set a benchmark in sentence embeddings, offering pre-trained models for various NLP tasks. Its bidirectional attention and contextual understanding make it a prominent choice.
  2. RoBERTa: An evolution of BERT, RoBERTa fine-tunes its training methodology, achieving state-of-the-art performance in multiple NLP tasks.
  3. USE (Universal Sentence Encoder): Developed by Google, USE generates embeddings for text that can be used for various applications, including cross-lingual tasks.

Sentence Embedding Libraries

Like word embedding, sentence embedding is a popular research area. Its interesting techniques break the barrier and help machines understand our language.

  1. Doc2Vec
  2. SentenceBERT
  3. InferSent
  4. Universal Sentence Encoder

We assume you have prior knowledge of word embeddings and other fundamental NLP concepts. Before continuing, I recommend you read the following articles-

Now, let us begin!

We will first set up some basic libraries and define our list of sentences. The following steps will help you do so-

Step 1:

Firstly, import the libraries and download ‘punkt

import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
import numpy as np

Step 2:

Then, we define our list of sentences. You can use a larger list (it is best to use a list of sentences for easier processing of each sentence)

sentences = ["I ate dinner.", 
       "We had a three-course meal.", 
       "Brad came to dinner with us.",
       "He loves fish tacos.",
       "In the end, we all felt like we ate too much.",
       "We all agreed; it was a magnificent evening."]

Step 3:

We will also keep  a tokenized version of these sentences

Python Code:

import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
import numpy as np


sentences = ["I ate dinner.", "We had a three-course meal.", "Brad came to dinner with us.", "He loves fish tacos.","In the end, we all felt like we ate too much.","We all agreed; it was a magnificent evening."]

# Tokenization of each document
tokenized_sent = []
for s in sentences:
    tokenized_sent.append(word_tokenize(s.lower()))
print(tokenized_sent)

Step 4:

Finally, we define a function which returns the cosine similarity between 2 vectors

def cosine(u, v):
    return np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))

Let us start by exploring the Sentence Embedding techniques one by one.

Doc2Vec

An extension of Word2Vec, Doc2Vec embedding is one of the most popular techniques. Introduced in 2014, it is an unsupervised algorithm that adds to the Word2Vec model by introducing another ‘paragraph vector’. There are two ways to add the paragraph vector to the model.

1. PVDM (Distributed Memory version of Paragraph Vector): We assign a paragraph vector sentence while sharing word vectors among all sentences. Then, we either average or concatenate the (paragraph vector and word vector) to get the final sentence representation. If you notice, it is an extension of the Continuous Bag-of-Word type of Word2Vec, where we predict the next word given a set of words. It is just that in PVDM, we predict the next sentence given a set of sentences.

2. PVDOBW( Distributed Bag of Words version of Paragraph Vector): Just lime PVDM, PVDOBW is another extension, this time of the Skip-gram type. Here, we just sample random words from the sentence and make the model predict which sentence it came from(a classification task).

The paper’s authors recommend combining both but state that PVDM is usually more than enough for most tasks.

Step 1:

We will use Gensim to demonstrate how to use Doc2Vec. We already have a list of sentences. We will first import the model and other libraries and then build a tagged sentence corpus. Each sentence is now represented as a TaggedDocument containing a list of the words in it and a tag associated with it.

# import
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
tagged_data = [TaggedDocument(d, [i]) for i, d in enumerate(tokenized_sent)]
tagged_data

Step 2:

We then train the model with the parameters:

## Train doc2vec model
model = Doc2Vec(tagged_data, vector_size = 20, window = 2, min_count = 1, epochs = 100)

'''
vector_size = Dimensionality of the feature vectors.
window = The maximum distance between the current and predicted word within a sentence.
min_count = Ignores all words with total frequency lower than this.
alpha = The initial learning rate.
'''

## Print model vocabulary
model.wv.vocab

Step 3:

We now use a new test sentence and find the top 5 most similar sentences from our data. We will also display them in order of decreasing similarity. The infer_vector method returns the vectorized form of the test sentence(including the paragraph vector). The most_similar method returns similar sentences.

test_doc = word_tokenize("I had pizza and pasta".lower())
test_doc_vector = model.infer_vector(test_doc)
model.docvecs.most_similar(positive = [test_doc_vector])

'''
positive = List of sentences that contribute positively.
'''

SentenceBERT

The leader among the pack, SentenceBERT, was introduced in 2018 and immediately took the pole position for Sentence Embeddings. At the heart of this BERT-based model, there are four key concepts:

  • Attention
  • Transformers
  • BERT
  • Siamese Network

Sentence-BERT uses a Siamese network-like architecture to provide two sentences as input. These two sentences are then passed to BERT models and a pooling layer to generate their embeddings. The embeddings for the pair of sentences are then used as inputs to calculate the cosine similarity.

We can install Sentence BERT using:

!pip install sentence-transformers

Step 1:

We will then load the pre-trained BERT model. Many other pre-trained models are available. The full list of models is here.

from sentence_transformers import SentenceTransformer
sbert_model = SentenceTransformer('bert-base-nli-mean-tokens')

Step 2:

We will then encode the provided sentences. We can also display the sentence vectors(just uncomment the code below)

sentence_embeddings = model.encode(sentences)

#print('Sample BERT embedding vector - length', len(sentence_embeddings[0]))
#print('Sample BERT embedding vector - note includes negative values', sentence_embeddings[0])

 

Step 3:

Then we will define a test query and encode it as well:

query = "I had pizza and pasta"
query_vec = model.encode([query])[0]

Step 4:

We will then compute the cosine similarity using scipy. We will retrieve the similarity values between the sentences and our test query:

for sent in sentences:
  sim = cosine(query_vec, model.encode([sent])[0])
  print("Sentence = ", sent, "; similarity = ", sim)

There you go; we have obtained the similarity between the sentences in our text and our test sentence. A crucial point to note is that SentenceBERT is pretty slow if you want to train it from scratch.

InferSent

Presented by Facebook AI Research in 2018, InferSent is a supervised sentence embedding technique. The main feature of this model is that it is trained on Natural language Inference(NLI) data, specifically, the SNLI (Stanford Natural Language Inference) dataset. It consists of 570k human-generated English sentence pairs, manually labelled with one of the three categories – entailment, contradiction, or neutral.

Just like SentenceBERT, we encode a pair of sentences to generate the actual sentence embeddings. Then, extract the relations between these embeddings using:

  • concatenation
  • element-wise product
  • absolute element-wise difference.

The output vector of these operations is then fed to a classifier that classifies the vector into one of the 3 above-defined categories. The paper proposes various encoder architectures, mainly concentrated around GRUs, LSTMs, and BiLSTMs.

Another essential feature is that InferSent uses GloVe vectors for pre-trained word embeddings. A more recent version of InferSent, InferSent2, uses fastText.

Let’s see how the Sentence Similarity task works using InferSent. We will use PyTorch for this, so please make sure that you have the latest version installed.

Step 1:

As mentioned above, there are two versions of InferSent. Version 1 uses GLovE, while version 2 uses fastText vectors. You can work with any model (I have used version 2). Thus, we download the InferSent Model and the pre-trained Word Vectors. Please save the models.py file from here and store it in your working directory.

We also need to save the trained model and pre-trained GLoVe word vectors. According to the code below, our working directory should have an ‘encoders folder and a GLoVe folder. The encoder folder will have our model, while the GloVe folder should have the word vectors:

! mkdir encoder
! curl -Lo encoder/infersent2.pkl https://dl.fbaipublicfiles.com/infersent/infersent2.pkl
  
! mkdir GloVe
! curl -Lo GloVe/glove.840B.300d.zip http://nlp.stanford.edu/data/glove.840B.300d.zip
! unzip GloVe/glove.840B.300d.zip -d GloVe/

Then we load our model and our word embeddings:

from models import InferSent
import torch

V = 2
MODEL_PATH = 'encoder/infersent%s.pkl' % V
params_model = {'bsize': 64, 'word_emb_dim': 300, 'enc_lstm_dim': 2048,
                'pool_type': 'max', 'dpout_model': 0.0, 'version': V}
model = InferSent(params_model)
model.load_state_dict(torch.load(MODEL_PATH))

W2V_PATH = '/content/GloVe/glove.840B.300d.txt'
model.set_w2v_path(W2V_PATH)

Step 2:

Then, we build the vocabulary from the list of sentences that we defined at the beginning:

model.build_vocab(sentences, tokenize=True)

Step 3:

Like before, we have the test query, which we use InferSent to encode and generate an embedding for.

query = "I had pizza and pasta"
query_vec = model.encode(query)[0]
query_vec

Step 4:

Finally, we compute the cosine similarity of this query with each sentence in our text:

similarity = []
for sent in sentences:
  sim = cosine(query_vec, model.encode([sent])[0])
  print("Sentence = ", sent, "; similarity = ", sim)

Universal Sentence Encoder

The Universal Sentence Encoder is one of the most well-performing sentence embedding techniques. Google proposed it, so it should come as no surprise to anybody. The key feature here is that we can use it for Multi-task learning.

This means that the sentence embeddings we generate can be used for multiple tasks like sentiment analysis, text classification, sentence similarity, etc., and the results of these asks are then fed back to the model to get even better sentence vectors than before.

The most interesting part is that this encoder is based on two encoder models, and we can use either of the two:

  • Transformer
  • Deep Averaging Network (DAN)

Both of these models can take a word or a sentence as input and generate embeddings for the same. The following is the basic flow:

  1. Tokenize the sentences after converting them to lowercase
  2. Depending on the type of encoder, the sentence gets converted to a 512-dimensional vector
    • If we use the transformer, it is similar to the encoder module of the transformer architecture and uses the self-attention mechanism.
    • The DAN option first computes the unigram and bigram embeddings and averages them to get a single embedding. This is then passed to a deep neural network for a final sentence embedding of 512 dimensions.
  3. These sentence embeddings are then used for various unsupervised and supervised tasks, such as Skipthoughts, NLI, etc. The trained model is then reused to generate a new 512-dimensional sentence embedding.

USE Embedding Process

To start using the USE embedding, we first need to install TensorFlow and TensorFlow hub:

!pip3 install --upgrade tensorflow-gpu
# Install TF-Hub.
!pip3 install tensorflow-hub

Step 1: Firstly, we will import the following necessary libraries:

import tensorflow as tf
import tensorflow_hub as hub
import numpy as np

Step 2: The model is available to us via the TFHub. Let’s load the model:

module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" 
model = hub.load(module_url)
print ("module %s loaded" % module_url)

Step 3: We will then generate embeddings for our sentence list and query. This is as simple as just passing the sentences to the model:

sentence_embeddings = model(sentences)
query = "I had pizza and pasta"
query_vec = model([query])[0]

Step 4: Finally, we will compute the similarity between our test query and the list of sentences:

module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" 
model = hub.load(module_url)
print ("module %s loaded" % module_url)

Conclusion

To conclude, we saw the top 4 sentence embedding techniques in NLP and the basic codes to use for finding text similarity. I urge you to take up a larger dataset and try these models out on this dataset for other NLP tasks. Also, this is just a basic code to calculate sentence similarity. For a proper model, you must preprocess these sentences and transform them into embeddings.

I have also given an overview of the architecture, and I can’t wait to explore more about how sentence embedding techniques will help machines understand our language better!

Moreover, this article does not say there are no other popular models. Some honourable mentions include FastSent, Skip-thought, Quick-thought, Word Movers Embedding, etc. If you have tried these or any other model, please share it with us in the comments below!

Frequently Asked Questions

Q1. What are the methods of sentence embedding?

A. Sentence embedding methods include averaging word embeddings, using pre-trained models like BERT, and neural network-based approaches like Skip-Thought vectors.

Q2. What is an example of a word embedding?

A. An example of word embedding is Word2Vec, which represents words as continuous vector values, capturing semantic relationships.

Q3. What is the best sentence embedding model?

A. The best sentence embedding model can vary based on the task, but BERT-based models like RoBERTa and T5 are often considered among the top choices.

Q4. What is the difference between sentence encoding and sentence embedding?

A. Sentence encoding focuses on representing a sentence in a fixed-size vector, while sentence embedding aims to capture contextual and semantic information in a continuous vector representation.

I’m a data lover who enjoys finding hidden patterns and turning them into useful insights. As the Manager - Content and Growth at Analytics Vidhya, I help data enthusiasts learn, share, and grow together.Thanks for stopping by my profile - hope you found something you liked :)

Responses From Readers

Clear

KABIR
KABIR

Hi while running this code i am getting completely opposite similaries my output for all the four looks strange this is the output for the Universal Sentence Encoder and i am using from scipy.spatial.distance import cosine To check the similarity Sentence = I ate dinner. ; similarity = 0.5313358306884766 Sentence = We had a three-course meal. ; similarity = 0.6435693204402924 Sentence = Brad came to dinner with us. ; similarity = 0.7966105192899704 Sentence = He loves fish tacos. ; similarity = 0.834845632314682 Sentence = In the end, we all felt like we ate too much. ; similarity = 0.8501257747411728 Sentence = We all agreed; it was a magnificent evening. ; similarity = 0.9415640830993652

streamline55
streamline55

One of the most informative introductions to sentence embedding available at this moment. Congratulations on writing such a clear and concise intro!

Tamojit Das
Tamojit Das

I think there is a typo in command while you were trying to install the infersent model--- you did it for fast-text model(infersent 2) while it should be infersent 1 (for Glove); scince you installed the glove pretrained vector dataset.

Muhammad Shahzad
Muhammad Shahzad

Hello, I have a question about the universal serial encoder. How it would deal with the out of vocabulary words? Looking forward to hear back from you.

RK
RK

Hello, thank you for this great article. I have one question, for the Universal Sentence Encoder, the code that you have provided; what is being implemented? Transformer or DAN?

Congratulations, You Did It!
Well Done on Completing Your Learning Journey. Stay curious and keep exploring!

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details