# Measuring Text Similarity Using BERT

*This article was published as a part of the Data Science Blogathon*

**BERT is too kind — so this article will be touching on BERT and sequence relationships!**

__Abstract__

__Abstract__*A significant portion of **NLP** relies on the connection in highly-dimensional spaces. Typically an* NLP* processing will take any text, prepare it to generate a tremendous vector/array rendering said text — then make certain transformations. *

*It’s a highly-dimensional charm.** At an exceptional level, there’s not much extra to it. We require to understand what is following in detail and execute this in Python too! So, let’s get incited. *

__Introduction__

__Introduction__Sentence similarity is one of the most explicit examples of how compelling a highly-dimensional spell can be.

__The thesis is this:__

*Take a line of**sentence**, transform it into a**vector**.**Take various other**penalties**, and change them into**vectors**.**Spot sentences with the**shortest distance (Euclidean)**or**tiniest angle (cosine similarity)**among them.**We instantly get a standard of**semantic similarity connecting sentences.*

__How BERT Helps?__

__How BERT Helps?__BERT, as we previously stated — is a special MVP of NLP. And a massive part of this is underneath BERTs capability to **embed the essence of words inside densely bound vectors.**

We call them ** dense vectors** because

*each value inside the vector has a value and has a purpose for holding that value*— this is in contradiction to

**. Hence, as**

*sparse*vectors**one-hot encoded vectors**where the preponderance of proceedings is

**0**.

BERT is skilled at generating those dense vectors, and all encoder layer (there are numerous) outputs a collection of dense vectors.

For the **BERT** support, this will be a vector comprising **768 digits.** Those **768 values** have our mathematical representation of a particular **token **— which we can *practice as contextual message embeddings.*

Unit vector denoting each token (*product by each encoder*) is indeed watching tensor (*768 by the number of tickets)*.

**We can use these tensors** and convert them to **generate semantic designs** of the **input sequence**. We can next take our *similarity metrics and measure the corresponding similarity linking separate lines.*

The *easiest *and most *regularly extracted tensor *is the ** last_hidden_state tensor**, conveniently yield by the

**BERT model**.

Of course, this is a moderately large tensor — at** 512×768** — and we *need a vector to implement our similarity measures.*

To do this, we require to turn our ** last_hidden_states tensor** to a

**vector of 768 tensors.**

__Building The Vector__

__Building The Vector__For us to transform our ** last_hidden_states tensor** into our desired vector — we use a

**mean pooling**method.

Each of these 512 tokens has separate 768 values. This pooling work will take the average of all token embeddings and consolidate them into a unique 768 vector space, producing a ‘sentence vector’.

At the very time, we can’t just exercise the **mean activation** as is. We lack to *estimate null padding tokens* (which we should not hold).

__Implementation__

__Implementation__

That’s noted on the theory and logic following the process, but how do we employ this in certainty?

We’ll describe two approaches — the comfortable way and the slightly more complicated way.

__Method1: Sentence-Transformers__

__Method1: Sentence-Transformers__

The usual straightforward approach for us to perform everything we just included is within the sentence; **transformers library**, which covers most of this rule into a* few lines of code.*

- First, we
*install***sentence-transformers**utilizing. This library uses__pip install sentence-transformers__behind the pictures — so we can genuinely find*HuggingFace’s transformers**sentence-transformers*models**here****.**

- We’ll be getting used to the
model, which*best-base-no-mean-tokens**executes the very logic we’ve reviewed so far.*

*(It also utilizes 128 input tokens, willingly than 512).*

Let’s generate some sentences, initialize our representation, and encode the lines of words:

#Write some lines to encode (sentences 0 and 2 are both ideltical): sen = [ "Three years later, the coffin was still full of Jello.", "The fish dreamed of escaping the fishbowl and into the toilet where he saw his friend go.", "The person box was packed with jelly many dozens of months later.", "He found a leprechaun in his walnut shell." ] from sentence_transformers import SentenceTransformer model = SentenceTransformer('bert-base-nli-mean-tokens') #Encoding: sen_embeddings = model.encode(sen) sen_embeddings.shape

(4, 768)Output:

Great, we now own **four-sentence embeddings**, each holding *768 values.*

Now, something we do is *use those embeddings* and **discover the cosine similarity linking each**. *So for line 0:*

*Three years later, the coffin was still full of Jello.*

We can locate the *most comparable sentence applying:*

from sklearn.metrics.pairwise import cosine_similarity #let's calculate cosine similarity for sentence 0: cosine_similarity( [sentence_embeddings[0]], sentence_embeddings[1:])

array([[0.33088914, 0.7219258 , 0.5548363 ]], dtype=float32)Output:

Index Sentence Similarity 1“The fish dreamed of escaping the fishbowl and into the toilet where he saw his friend go.”0.33092“The person box was packed with jelly many dozens of months later.”0.72193“He found a leprechaun in his walnut shell.”0.5547

Now, here is the extra convenient and more intellectual approach.

__Method2: Transformers And PyTorch__

__Method2: Transformers And PyTorch__

Before arriving at the second strategy, it is worth seeing that it does the identical thing as the above, but at one level more below.

- We want to achieve our transformation to the
to produce the sentence embedding with this plan. For this,*last_hidden_state**we work the mean pooling operation.*

- Additionally, before the
**mean pooling operation**, we need to design;*last_hidden_state**here is the code for it:*

from transformers import AutoTokenizer, AutoModel import torch #nitialize our model and tokenizer: tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/bert-base-nli-mean-tokens') model = AutoModel.from_pretrained('sentence-transformers/bert-base-nli-mean-tokens') ###Tokenize the sentences like before: sent = [ "Three years later, the coffin was still full of Jello.", "The fish dreamed of escaping the fishbowl and into the toilet where he saw his friend go.", "The person box was packed with jelly many dozens of months later.", "He found a leprechaun in his walnut shell." ] # initialize dictionary: stores tokenized sentences token = {'input_ids': [], 'attention_mask': []} for sentence in sent: # encode each sentence, append to dictionary new_token = tokenizer.encode_plus(sentence, max_length=128, truncation=True, padding='max_length', return_tensors='pt') token['input_ids'].append(new_token['input_ids'][0]) token['attention_mask'].append(new_token['attention_mask'][0]) # reformat list of tensors to single tensor token['input_ids'] = torch.stack(token['input_ids']) token['attention_mask'] = torch.stack(token['attention_mask'])

#Process tokens through model: output = model(**token) output.keys()

odict_keys([‘last_hidden_state’, ‘pooler_output’])Output:

#The dense vector representations of text are contained within the outputs 'last_hidden_state' tensor embeddings = outputs.last_hidden_state embeddings

`embeddings.shape`

torch.Size([4, 128, 768])Output:

After writing our **dense vector embeddings**, we want to produce a ** mean pooling operation** to form a

**single vector encoding**, i.e.,

**sentence embedding**).

To achieve this **mean pooling operation**, we will *require multiplying all values in our **embeddings tensor **by its corresponding **attention_mask **value* to neglect non-real tokens.

```
# To perform this operation, we first resize our attention_mask tensor:
att_mask = tokens['attention_mask']
att_mask.shape
```

torch.Size([4, 128])output:

mask = att_mask.unsqueeze(-1).expand(embeddings.size()).float() mask.shape

torch.Size([4, 128, 768])Output:

mask_embeddings = embeddings * mask mask_embeddings.shape

torch.Size([4, 128, 768])Output:

#Then we sum the remained of the embeddings along axis 1: summed = torch.sum(mask_embeddings, 1) summed.shape

torch.Size([4, 768])Output:

#Then sum the number of values that must be given attention in each position of the tensor: summed_mask = torch.clamp(mask.sum(1), min=1e-9) summed_mask.shape

torch.Size([4, 768])Output:

mean_pooled = summed / summed_mask mean_pooled

Once we possess our **dense vectors**, we can compute the **cosine similarity** among each — *which is the likewise logic we used previously:*

from sklearn.metrics.pairwise import cosine_similarity #Let's calculate cosine similarity for sentence 0: # convert from PyTorch tensor to numpy array mean_pooled = mean_pooled.detach().numpy() # calculate cosine_similarity( [mean_pooled[0]], mean_pooled[1:] )

array([[0.3308891 , 0.721926 , 0.55483633]], dtype=float32)Output:

Index Sentence Similarity 1“The fish dreamed of escaping the fishbowl and into the toilet where he saw his friend go.”0.33092“The person box was packed with jelly many dozens of months later.”0.72193“He found a leprechaun in his walnut shell.”0.5548

*We return around the identical results — the only distinction being that the cosine similarity for index three has slipped from 0.5547 to 0.5548 — an insignificant variation due to rounding.*

__End Notes__

__End Notes__

That’s all for this introduction to mapping the semantic similarity of sentences using BERT reviewing sentence-transformers and a lower-level explanation with Python-PyTorch and transformers.

I hope you’ve relished the article. Let me know if you hold any questions or suggestions via LinkedIn or in the remarks below.

Thanks for reading!

*The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.*