Measuring Text Similarity Using BERT

Mrinal Singh Walia 29 May, 2021 • 6 min read

This article was published as a part of the Data Science Blogathon

BERT is too kind — so this article will be touching on BERT and sequence relationships!

Abstract

A significant portion of NLP relies on the connection in highly-dimensional spaces. Typically an NLP processing will take any text, prepare it to generate a tremendous vector/array rendering said text — then make certain transformations.

It’s a highly-dimensional charm. At an exceptional level, there’s not much extra to it. We require to understand what is following in detail and execute this in Python too! So, let’s get incited.

Introduction

Sentence similarity is one of the most explicit examples of how compelling a highly-dimensional spell can be.

The thesis is this:

Take a line of sentence, transform it into a vector.
Take various other penalties, and change them into vectors.
Spot sentences with the shortest distance (Euclidean) or tiniest angle (cosine similarity) among them.
We instantly get a standard of semantic similarity connecting sentences.

How BERT Helps?

BERT, as we previously stated — is a special MVP of NLP. And a massive part of this is underneath BERTs capability to embed the essence of words inside densely bound vectors.

We call them dense vectors because each value inside the vector has a value and has a purpose for holding that value — this is in contradiction to sparse vectors. Hence, as one-hot encoded vectors where the preponderance of proceedings is 0.

BERT is skilled at generating those dense vectors, and all encoder layer (there are numerous) outputs a collection of dense vectors.

Text similarity | BERT — https://arxiv.org/pdf/1810.04805.pdf

For the BERT support, this will be a vector comprising 768 digits. Those 768 values have our mathematical representation of a particular token — which we can practice as contextual message embeddings.

Unit vector denoting each token (product by each encoder) is indeed watching tensor (768 by the number of tickets).

We can use these tensors and convert them to generate semantic designs of the input sequence. We can next take our similarity metrics and measure the corresponding similarity linking separate lines.

The easiest and most regularly extracted tensor is the last_hidden_state tensor, conveniently yield by the BERT model.

Of course, this is a moderately large tensor — at 512×768 — and we need a vector to implement our similarity measures.

To do this, we require to turn our last_hidden_states tensor to a vector of 768 tensors.

Building The Vector

For us to transform our last_hidden_states tensor into our desired vector — we use a mean pooling method.

Each of these 512 tokens has separate 768 values. This pooling work will take the average of all token embeddings and consolidate them into a unique 768 vector space, producing a ‘sentence vector’.

At the very time, we can’t just exercise the mean activation as is. We lack to estimate null padding tokens (which we should not hold).

Implementation

That’s noted on the theory and logic following the process, but how do we employ this in certainty?

We’ll describe two approaches — the comfortable way and the slightly more complicated way.

Method1: Sentence-Transformers

The usual straightforward approach for us to perform everything we just included is within the sentence; transformers library, which covers most of this rule into a few lines of code.

First, we install sentence-transformers utilizing pip install sentence-transformers. This library uses HuggingFace’s transformers behind the pictures — so we can genuinely find sentence-transformers models here.

We’ll be getting used to the best-base-no-mean-tokens model, which executes the very logic we’ve reviewed so far.

(It also utilizes 128 input tokens, willingly than 512).

Let’s generate some sentences, initialize our representation, and encode the lines of words:

#Write some lines to encode (sentences 0 and 2 are both ideltical):
sen = [
    "Three years later, the coffin was still full of Jello.",
    "The fish dreamed of escaping the fishbowl and into the toilet where he saw his friend go.",
    "The person box was packed with jelly many dozens of months later.",
    "He found a leprechaun in his walnut shell."
]
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-base-nli-mean-tokens')
#Encoding:
sen_embeddings = model.encode(sen)
sen_embeddings.shape

Output: (4, 768)

Great, we now own four-sentence embeddings, each holding 768 values.

Now, something we do is use those embeddings and discover the cosine similarity linking each. So for line 0:

Three years later, the coffin was still full of Jello.

We can locate the most comparable sentence applying:

from sklearn.metrics.pairwise import cosine_similarity
#let's calculate cosine similarity for sentence 0:
cosine_similarity(
    [sentence_embeddings[0]],
    sentence_embeddings[1:]
)

Output: array([[0.33088914, 0.7219258 , 0.5548363 ]], dtype=float32)

Index Sentence Similarity

1 “The fish dreamed of escaping the fishbowl and into the toilet where he saw his friend go.” 0.3309

2 “The person box was packed with jelly many dozens of months later.” 0.7219

3 “He found a leprechaun in his walnut shell.” 0.5547

Index	Sentence	Similarity
1	“The fish dreamed of escaping the fishbowl and into the toilet where he saw his friend go.”	0.3309
2	“The person box was packed with jelly many dozens of months later.”	0.7219
3	“He found a leprechaun in his walnut shell.”	0.5547

Now, here is the extra convenient and more intellectual approach.

Method2: Transformers And PyTorch

Before arriving at the second strategy, it is worth seeing that it does the identical thing as the above, but at one level more below.

We want to achieve our transformation to the last_hidden_state to produce the sentence embedding with this plan. For this, we work the mean pooling operation.

Additionally, before the mean pooling operation, we need to design last_hidden_state; here is the code for it:

from transformers import AutoTokenizer, AutoModel
import torch
#nitialize our model and tokenizer:
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/bert-base-nli-mean-tokens')
model = AutoModel.from_pretrained('sentence-transformers/bert-base-nli-mean-tokens')
###Tokenize the sentences like before:
sent = [
    "Three years later, the coffin was still full of Jello.",
    "The fish dreamed of escaping the fishbowl and into the toilet where he saw his friend go.",
    "The person box was packed with jelly many dozens of months later.",
    "He found a leprechaun in his walnut shell."
]
# initialize dictionary: stores tokenized sentences
token = {'input_ids': [], 'attention_mask': []}
for sentence in sent:
    # encode each sentence, append to dictionary
    new_token = tokenizer.encode_plus(sentence, max_length=128,
                                       truncation=True, padding='max_length',
                                       return_tensors='pt')
    token['input_ids'].append(new_token['input_ids'][0])
    token['attention_mask'].append(new_token['attention_mask'][0])
# reformat list of tensors to single tensor
token['input_ids'] = torch.stack(token['input_ids'])
token['attention_mask'] = torch.stack(token['attention_mask'])

#Process tokens through model:
output = model(**token)
output.keys()

Output: odict_keys([‘last_hidden_state’, ‘pooler_output’])

#The dense vector representations of text are contained within the outputs 'last_hidden_state' tensor
embeddings = outputs.last_hidden_state
embeddings

embeddings.shape

Output: torch.Size([4, 128, 768])

After writing our dense vector embeddings, we want to produce a mean pooling operation to form a single vector encoding, i.e., sentence embedding).

To achieve this mean pooling operation, we will require multiplying all values in our embeddings tensor by its corresponding attention_mask value to neglect non-real tokens.

# To perform this operation, we first resize our attention_mask tensor:
att_mask = tokens['attention_mask']
att_mask.shape

output: torch.Size([4, 128])

mask = att_mask.unsqueeze(-1).expand(embeddings.size()).float()
mask.shape

Output: torch.Size([4, 128, 768])

mask_embeddings = embeddings * mask
mask_embeddings.shape

Output: torch.Size([4, 128, 768])

#Then we sum the remained of the embeddings along axis 1:
summed = torch.sum(mask_embeddings, 1)
summed.shape

Output: torch.Size([4, 768])

#Then sum the number of values that must be given attention in each position of the tensor:
summed_mask = torch.clamp(mask.sum(1), min=1e-9)
summed_mask.shape

Output: torch.Size([4, 768])

mean_pooled = summed / summed_mask
mean_pooled

Once we possess our dense vectors, we can compute the cosine similarity among each — which is the likewise logic we used previously:

from sklearn.metrics.pairwise import cosine_similarity
#Let's calculate cosine similarity for sentence 0:
# convert from PyTorch tensor to numpy array
mean_pooled = mean_pooled.detach().numpy()
# calculate
cosine_similarity(
    [mean_pooled[0]],
    mean_pooled[1:]
)

Output: array([[0.3308891 , 0.721926 , 0.55483633]], dtype=float32)

Index Sentence Similarity

1 “The fish dreamed of escaping the fishbowl and into the toilet where he saw his friend go.” 0.3309

2 “The person box was packed with jelly many dozens of months later.” 0.7219

3 “He found a leprechaun in his walnut shell.” 0.5548

Index	Sentence	Similarity
1	“The fish dreamed of escaping the fishbowl and into the toilet where he saw his friend go.”	0.3309
2	“The person box was packed with jelly many dozens of months later.”	0.7219
3	“He found a leprechaun in his walnut shell.”	0.5548

We return around the identical results — the only distinction being that the cosine similarity for index three has slipped from 0.5547 to 0.5548 — an insignificant variation due to rounding.

End Notes

That’s all for this introduction to mapping the semantic similarity of sentences using BERT reviewing sentence-transformers and a lower-level explanation with Python-PyTorch and transformers.
I hope you’ve relished the article. Let me know if you hold any questions or suggestions via LinkedIn or in the remarks below.
Thanks for reading!

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Mrinal Singh Walia 29 May 2021

Advanced Deep Learning NLP Text Unstructured Data

Measuring Text Similarity Using BERT

Abstract

Introduction

How BERT Helps?

Building The Vector

Implementation

Method1: Sentence-Transformers

Method2: Transformers And PyTorch

End Notes

Frequently Asked Questions

Responses From Readers

Write for us

Natural Language Processing

Measuring Text Similarity Using BERT

Abstract

Introduction

How BERT Helps?

Building The Vector

Implementation

Method1: Sentence-Transformers

Method2: Transformers And PyTorch

End Notes

Frequently Asked Questions

Responses From Readers

Write for us

Natural Language Processing

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP