Neha Seth — Updated On August 1st, 2022
Beginner NLP Python Technique Text Unsupervised

Introduction

One of the applications of Natural Language Processing is auto-correction and spellings checks. All of us have encountered this that if we type an incorrect or typo in the Google search engine, then the engine automatically corrects it and suggests the right word in its place. How does the engine do that? How does it know what word we wanted to write or ask? That is what we will be covering in this article. The methods available to check this, and the implementation of each method in Python.

String similarity image

 

Table of Contents

  • String Similarity
  • Hamming Distance
  • Normalized Hamming Distance
  • Levenshtein Distance
  • Matrix Method for Levenshtein Distance
  • Summary

String Similarity

The search engine is able to autocorrect the spellings by checking the similarity between the strings. The way to check the similarity between any data point or groups is by calculating the distance between those data points. In textual data as well, we check the similarity between the strings by calculating the distance between one text to another text.

There are various algorithms available to calculate the distance between texts. Here, we will be looking at two such methods: Hamming Distance and Levenshtein Distance, which fall under the category of edit distance-based. The string similarity is also used for speech recognition and language translation.

 

Hamming Distance

Hamming Distance, named after the American mathematician, is the simplest algorithm for calculating string similarity. It checks the similarity by comparing the changes in the number of positions between the two strings. The method compares each and every letter of one string with the corresponding letter of the other string.

Let’s take these two words: TIME and MINE. Every letter of the two strings will be compared in the following way:

String similarity hamming distance
  • The first two letters T and M are different so the number of positions in which the two strings are different is 1 up until now.

  • The next two letters ‘I’ are the same, hence the number of positions at which two strings are different is 0.

  • Now, the next two letters M and N are again different so the number of positions of two strings being different is 1, and

  • The last two letters E are the same so the number of positions of two strings different is 0.

This way the Hamming distance is 2 = 1 + 0 + 1 + 0. It is the total number of positions different between two strings at each character’s place. In short, the number of unequal characters is equal to the Hamming distance.

The Hamming distance can range anywhere between 0 and any integer value, even equal to the length of the string. For this, we can also normalize the value by taking the ratio of the Hamming distance to the length of the string in the following manner:

Normalized Hamming Distance = Hamming Distance/ length of the string

 

Normalized Hamming distance gives the percentage to which the two strings are dissimilar. The normalized Hamming distance for the above TIME and MINE example is: 2/4 = 0.50, hence 50% of these two characters are not similar. A lower value of Normalized Hamming distance means the two strings are more similar.

In Python, we can implement using the Scipy library, which has the direct Hamming distance function. The codes are available in my GitHub Repository.

Python Code:

As we have seen above, this metric can only be used when the strings are of the same length. Hence, we need a more robust algorithm that can be used for words or strings having different lengths and therefore, will move to the next method: Levenshtein Distance.

 

Levenshtein Distance

Levenshtein distance is the most frequently used algorithm. It was founded by the Russian scientist, Vladimir Levenshtein to calculate the similarities between two strings. This is also known as the Edit distance-based algorithm as it computes the number of edits required to transform one string to another. The edits count the following as an operation:

  • Insertion of a character
  • Deletion of a character
  • Substitution of a character

More the number of operations, less is the similarity between the two strings.

Levenshtein distance works in the following manner, let’s take the two words CLOCK and CLONE:

Here, the number of edits required are two as the first three letters: C, L, O of the two words are the same. Now, we need to substitute C in place of N and K in place of E to transform the word CLOCK to CLONE. Therefore, here the Levenshtein distance is 2.

Let’s check if we can find the distance using Levenshtein’s algorithm for our original example of MAN and WOMAN, where the lengths of the string are unequal:

String similarity levenshtein distance

To convert the word MAN to WOMAN, we would need to do two edits of inserting the first two letters W and O. Hence, the Levenshtein distance here is 2. The Levenshtein algorithm can be used to measure the similarity between the strings of different lengths as well.

Up until now, we have calculated the edits directly by looking at the words. However, it would be difficult at times to get the minimum number of edits required to transform the words just by looking at them. Hence, there is another method to compute the Levenshtein distance by the matrix method.

 

Matrix Method for Levenshtein Distance

The matrix form to calculate Levenshtein distance follow as:

Firstly, we’ll calculate for the CLOCK-CLONE example, where the length of the strings is equal:

Here, I have written the words CLOCK and CLONE in row and column, respectively. We can interchange this order of the words as well, which will not impact the end result. We would need to write each of the letters of the respective words as a separate element of the matrix.

 

Step 1:

As an initial step, we first mark the letters starting from 1 to the length of the string. So here, both CLOCK and CLONE are marked from 1 to 5:

Matrix Method for Levenshtein Distance

The number 0 represents a null string meaning it has no alphabets or letters in it. It reflects the number of edits required to convert the word to the null string. 

Here, to convert CLOCK to a null string the number of edits needed is five. Similarly, the number of edits required to convert CLONE to a null string will also be five in this case.

To further break it down, on a granular level, if looking at the word CLOCK:

  • The number of edits required to transform the first letter C to a null string (that is 0) will be 1, hence the number under the letter C is marked as 1.

  • The number of edits required to convert the second letter L to the null string will be 2, so the number under L is 2.

  • Similarly, the number 3 under the letter O represents that the number of edits required to convert the third letter O to the null string will be 3 and so on.

In the same manner, the word CLONE can be broken from 1 to 5.

 

Step 2:

We will fill the other values of the matrix depending on how many edits are required to convert each letter of the string to another or by the surrounding values.

Method 1:

We have looked above at how to convert each letter of the word to another word. Just to quickly recap this, we’ll take each letter of the row and column in the following way:

i) Starting from the letter C of both the row and column, as both these letters are the same, we would not need to do any edits. Therefore, the value corresponding to this will be 0.

ii) Moving in the same row having the letter C:

  • CL of the first word CLOCK -> to convert to C of the second word CLONE: need one edit. That is to delete the letter L from CL to make it C. Hence, the Levenshtein distance between CL to C is 1.

  • CLO of the first word CLOCK -> to convert to C of the second word CLONE would need two edits.

  • In the same manner, CLOC of the first word CLOCK -> to convert to C of the second word CLONE would need three edits and so on.

In this way, by looking at each letter of the first-word corresponding to the letters of the next word, we can fill the first row like below:

method 2

iii) In a similar fashion, moving on we’ll take the next rows CL of the word CLONE to transform to the word CLOCK:

  • CL of the word CLONE -> to convert to C of the word CLOCK: need one edit. That is to delete the letter L from CL to make it C. Hence, the Levenshtein distance between CL to C is 1.

  • CL of the word CLONE -> to convert to CL of the word CLOCK, would not need any edits. So, the Levenshtein distance between CL and CL is 1.

  • CL of the word CLONE -> to convert to CLO of the word CLOCK, would need one edit.

  • CL of the word CLONE -> to convert to CLOC of the word CLOCK, would need two edits and so on.

In this way, the second row gets filled in the following manner:

clock

The remaining rows can be filled in the above way as well resulting in:

clock method 1

 

Method 2:

Now, there is another way to find the number of edits required by looking at the surrounding values:

  1. When the last letter of both the strings are the same: then we take the

minimum of (the diagonal value, first corresponding element +1, second corresponding element + 1)

  1. When the letters of the strings are not the same: then we take the minimum of the surrounding values + 1

We’ll look at the workings of each of these two below:

Let’s say in the first case: when we take the first letter of both the words C, which is the same. Here,

string similarity method 2
  • The diagonal element is 0.
  • The first corresponding element is 1, and
  • The second corresponding element is also 1.

So, the output becomes: min(0, 1+1, 1+1) = 0

Hence, the number of edits or the Levenshtein distance from C to C is 0.

string similarity levnshtein

Now, we will see how many edits are required to convert the first two letters CL of the first word CLOCK and to the first letter C of the word string CLONE:

  • Here, the last letters of the words CL and C are not the same.
  • The surrounding values here are: 0,1,2
  • Here, will take the minimum of the surrounding values + 1:

Hence, the output becomes = min(0,1,2) + 1 = 1

Therefore, we need one edit from CL to make it C and hence, the Levenshtein distance between CL and C is 1.

CL C

Let’s do it a couple of more times, where the last letters of the words are not the same to drive the point home:

To convert the first three letters CLO of the first word CLOCK and to the first letter C of the word string CLONE:

  • The surrounding values here are: 1,2,3
  • Here, will take the minimum of the surrounding values + 1:

So, the output becomes = min(1,2,3) + 1 = 1 + 1 = 2

Therefore, the number of edits required to transform CLO to C is 2. Hence, the Levenshtein distance between CLO and C is 2.

Levenshtein distance between CLO and C

Now, to convert the first four letters CLOC of the first word CLOCK and to the first letter C of the word string CLONE:

  • The surrounding values here are: 2,3,4
  • Again, will take the minimum of the surrounding values + 1:

Hence, the output becomes = min(2,3,4) + 1 = 2 + 1 = 3

The number of edits required to transform CLOC to C is 3. Therefore, the Levenshtein distance between CLOC and C is 3.

We’ll look at it one more time via the surrounding values method, where the last letters of the words are the same. Let’s say, we have reached the matrix point for CLO at both the respective row and column:

CLO at both the respective row and column

Now, will be using:

minimum of (the diagonal value, first corresponding element +1, second corresponding element + 1) to calculate the distance.

Here,

  • The diagonal element is 0.
  • The first corresponding element is 1, and
  • The second corresponding element is 1.

Hence, the output becomes: min(0, 1+1, 1+1) = 0

Therefore, the Levenshtein distance from CLO to CLO is 0.

In this way, the entire matrix can be filled as below. The last element represents the Levenshtein distance to convert the word CLOCK to CLONE, which is 2.

CLOCK to CLONE

Another noticeable thing about the matrix form of computation is we can know what all edits are required to convert the words:

  1. Whenever the current element is greater than 0, then we look at how we got this value. Here, the Levenshtein distance between CLOCK and CLONE is 2, which we got by taking the minimum of the surrounding values + 1 which is min(1,2,2) + 1 = 1 + 1 = 2.

  2. This value is dependent on the diagonal values and when moving upwards on the diagonal values then it represents Substitution. Here, we have to substitute K with E.

  3. Moving up the diagonal, the value is 1, which is greater than 0. We got 1 by taking the minimum of the surrounding values + 1 which is min(0,1,1) + 1 = 0 + 1 = 1. Here, we would need to substitute C with N.

  4. Again moving upwards on the diagonal, the elements are CLO on both the sides and the diagonal is 0, which means no edits are required. Similarly, for CL and C.

  5. On moving in the left direction from the last element 2, then that is Deletion. We would need to delete two letters from CLOCK to make it CLONE.

Hence, the total number of edits required to transform CLOCK to CLONE is 2.

The Levenshtein matrix for the second example of MAN-WOMAN, where the length of the string is not equal:

MAN-WOMAN

The Levenshtein distance to transform the word WOMAN to word MAN is 2.

In Python, there is no pre-written function to compute Levenshtein distance, so we define a custom function to implement it.

Step 1:

Using the NumPy library, define the matrix, its shape, and the initial values in the matrix are all 0. We will fill the matrix based on the distance calculation going forward.

Length of the matrix =  length of the strings + 1 because we add an extra row and column for the null string.

python code

Step 2:

We fill the first column and the first row with the index of the characters from the first string and second string, respectively:

first string and second string

Step 3:

Fill the values of the matrix-based upon if the characters of the strings are the same or different:

matrix-based upon

The Levenshtein distance for the above examples are as follows:

Levenshtein distance code

Summary

The Hamming distance tells the number of positions in which the two strings are different. Normalized Hamming distance is the ratio of the Hamming distance to the length of the string. A lower value of Normalized Hamming distance suggests the two strings are more similar.

The Levenshtein distance can be computed to find the distance between the two strings and also find the number of edits. It can be found using the surrounding values in a matrix form and also be used for strings with different lengths. To summarize:

  1. Firstly, mark the number of words from 1 to the length of the string. The number 0 is the null string reflecting the number of edits required to convert the word to the null string.

  2. When the last letter of both the strings are the same: then we take the

minimum of (the diagonal value, first corresponding element +1, second corresponding element + 1)

  1. When the letters of the strings are not the same: then we take the minimum of the surrounding values + 1

The last element of the Levenshtein matrix represents the Levenshtein distance between the strings. Moving upwards on the diagonal represents the Substitution of the elements and moving in the left direction is the Deletion of the elements.

Hope now you have a good idea of how these distance methods work. You may reach out to me on my LinkedIn: linkedin.com/in/neha-seth-69771111.

Thank You.

Happy Learning!

One thought on "A Simple Guide to Metrics for Calculating String Similarity"

Anurag Patel
Anurag Patel says: February 24, 2021 at 7:11 pm
Nice article. Very well explained. Keep it up. Reply

Leave a Reply Your email address will not be published. Required fields are marked *