Learn everything about Analytics

Home » Beginner’s Guide to Image and Text Similarity

Beginner’s Guide to Image and Text Similarity

This article was published as a part of the Data Science Blogathon.

Introduction

After my latest published article about satellite image analysis “Image Analysis and Mapping in Earth Engine Using NDVI, now it is another article about image analysis again. Unlike the previous article, this article discusses general image analysis, not satellite image analysis. The goal of this discussion is to detect whether two products are the same or not. Each of the two products has image and text names. If the pair of products have similar or the same images or text names, that means that the two products are the same. The data comes from a competition held in Kaggle.

There are 4 basic packages used in this script: NumPy, pandas, matplotlib, and seaborn. There are also other specific packages. “Image” loads and shows image data. “imagehash” computes the similarity of two images. “fuzzywuzzy” detects the similarity of two texts. The package “metric” computes the accuracy score of the true label and predicted label.

# import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image
import imagehash
from fuzzywuzzy import fuzz
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

 

Image Similarity

The similarity of the two images is detected using the package “imagehash”. If two images are identical or almost identical, the imagehash difference will be 0. Two images are more similar if the imagehash difference is closer to 0.

Comparing the similarity of two images using imagehash consists of 5 steps. (1) The images are converted into greyscale. (2) The image sizes are reduced to be smaller, for example, into 8×8 pixels by default. (3) The average value of the 64 pixels is computed. (4)The 64 pixels are checked whether they are bigger than the average value. Now, each of the 64 pixels has a boolean value of true or false. (5) Imagehash difference is the number of different values between the two images. Please observe the below illustration.

Image_1 (average: 71.96875)

48

20

34

40

40

32

30

32

34

210

38

50

42

41

230

40

47

230

33

44

34

50

245

50

43

230

46

50

36

34

250

30

30

200

190

38

41

240

39

39

38

7

200

210

220

240

50

48

48

8

45

43

47

37

37

47

10

8

6

5

6

6

5

5

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

TRUE

FALSE

FALSE

FALSE

FALSE

TRUE

FALSE

FALSE

TRUE

FALSE

FALSE

FALSE

FALSE

TRUE

FALSE

FALSE

TRUE

FALSE

FALSE

FALSE

FALSE

TRUE

FALSE

FALSE

TRUE

TRUE

FALSE

FALSE

TRUE

FALSE

FALSE

FALSE

FALSE

TRUE

TRUE

TRUE

TRUE

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

 

Image_2 (average: 78.4375)

41

20

39

43

34

39

30

32

35

195

44

46

35

48

232

40

30

243

38

31

34

46

213

50

49

227

44

33

35

224

230

30

46

203

225

44

46

181

184

40

38

241

247

220

228

210

36

38

42

8

35

39

47

31

41

21

3

12

10

18

24

21

6

17

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

TRUE

FALSE

FALSE

FALSE

FALSE

TRUE

FALSE

FALSE

TRUE

FALSE

FALSE

FALSE

FALSE

TRUE

FALSE

FALSE

TRUE

FALSE

FALSE

FALSE

TRUE

TRUE

FALSE

FALSE

TRUE

TRUE

FALSE

FALSE

TRUE

TRUE

FALSE

FALSE

TRUE

TRUE

TRUE

TRUE

TRUE

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

The imagehash difference of the two images/matrices above is 3. It means that there are 3 pixels with different boolean values. The two images are relatively similar.

For more clarity, let’s examine imagehash applied to the following 3 pairs of images. The first pair consists of two same images and the imagehash difference is 0. The second pair compares two similar images. The second image (image_b) is actually an edited version of the first image (image_a). The imagehash difference is 6. The last pair shows the comparison of two totally different images. The imagehash difference is 30, which is the farthest from 0.

image similarity and text
Fig. 1 imagehash
# First pair
hash1 = imagehash.average_hash(Image.open('D: /image_a.jpg'))
hash2 = imagehash.average_hash(Image.open('D:/ image_a.jpg'))
diff = hash1 - hash2
print(diff)
# 0
# Second pair
hash1 = imagehash.average_hash(Image.open('D: /image_a.jpg'))
hash2 = imagehash.average_hash(Image.open('D:/ image_b.jpg'))
diff = hash1 - hash2
print(diff)
# 6
# Third pair
hash1 = imagehash.average_hash(Image.open('D: /image_a.jpg'))
hash2 = imagehash.average_hash(Image.open('D:/ image_c.jpg'))
diff = hash1 - hash2
print(diff)
# 30

Here is how the average imagehash looks like

>imagehash.average_hash(Image.open('D:/image_a.jpg'))
array([[ True,  True,  True,  True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True,  True,  True,  True],
       [False,  True, False, False, False, False, False, False],
       [ True,  True, False, False, False, False, False, False],
       [False, False, False,  True, False, False, False, False],
       [False, False, False,  True, False, False, False, False],
       [False, False, False, False, False, False, False, False]])

>imagehash.average_hash(Image.open('D:/image_b.jpg'))
array([[ True,  True,  True,  True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True,  True,  True,  True],
       [False,  True,  True,  True,  True, False, False, False],
       [ True,  True,  True, False, False, False, False, False],
       [ True,  True, False, False, False, False, False, False],
       [False, False, False,  True, False, False, False, False],
       [False, False, False,  True, False, False, False, False],
       [False, False, False, False, False, False, False, False]])

>imagehash.average_hash(Image.open('D:/image_c.png'))
array([[False, False, False, False, False, False, False, False],
       [ True,  True,  True,  True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True,  True,  True,  True],
       [False, False, False, False,  True, False, False, False],
       [False, False, False, False, False, False, False, False]])

 

Text Similarity

Text similarity can be assessed using Natural Language Processing (NLP). There are 4 ways to compare the similarity of a pair of texts provided by “fuzzywuzzy” package. The function of this package returns an integer value from 0 to 100. The higher value means the higher similarity.

1. fuzz.ratio – is the most simple comparison of the texts. The fuzz.ratio value of “blue shirt” and “blue shirt.” is 95. It means that the two texts are similar or almost the same, but the dot makes them a bit different

from fuzzywuzzy import fuzz
fuzz.ratio('blue shirt','blue shirt.')
#95

The measurement is based on the Levenshtein distance (named after Vladimir Levenshtein). Levenshtein distance measures how similar two texts are. It measures the number of minimum edits, such as inserting, deleting, or substituting, a text into another text. The text “Blue shirt” requires only 1 editing away to be “blue shirt.”. It only needs a single dot to be the same. Hence, the Levenshtein distance is “1”. The fuzz.ratio is calculated with this equation (len(a) + len(b) – lev)/( (len(a) + len(b), where len(a) and len(b) are the lengths of the first and second text, and lev is the Levenshtein distance. The ratio is (10 + 11 – 1)/(10 + 11) = 0.95 or 95%.

2. fuzz.partial_ratio – can detect if a text is a part of another text. But, it cannot detect if the text is in a different order. The example below shows that “blue shirt” is a part of “clean blue shirt” so that the fuzz.partial_ratio is 100. The fuzz.ratio returns the value 74 because it only detects that there is much difference between the two texts.

print(fuzz.ratio('blue shirt','clean blue shirt.'))
#74
print(fuzz.partial_ratio('blue shirt','clean blue shirt.'))
#100

3. Token_Sort_Ratio – can detect if a text is a part of another text although they are in a different order. Fuzz.token_sort_ratio returns 100 for the text “clean hat and blue shirt” and “blue shirt and clean hat” because they actually mean the same thing, but are in reverse order.

print(fuzz.ratio('clean hat and blue shirt','blue shirt and clean hat'))
#42
print(fuzz.partial_ratio('clean hat and blue shirt','blue shirt and clean hat'))
#42
print(fuzz.token_sort_ratio('clean hat and blue shirt','blue shirt and clean hat'))
#100

4. Token_Set_Ratio – can detect the text-similarity accounting for the partial text, text order, and different text lengths. It can detect that the text “clean hat” and “blue shirt” is part of the text “People want to wear a blue shirt and clean hat” in a different order. In this study, we only use “Token_Set_Ratio” as it is the most suitable.

print(fuzz.ratio('clean hat and blue shirt','People want to wear blue shirt and clean hat'))
#53
print(fuzz.partial_ratio('clean hat and blue shirt','People want to wear blue shirt and clean hat'))
#62
print(fuzz.token_sort_ratio('clean hat and blue shirt','People want to wear blue shirt and clean hat'))
#71
print(fuzz.token_set_ratio('clean hat and blue shirt','People want to wear blue shirt and clean hat'))
#100

The following cell will load the training dataset and add features of hash as well as token set ratio.

# load training set
trainingSet = pd.read_csv('D:/new_training_set.csv', index_col=0).reset_index()
# Compute imagehash difference
hashDiff = []
for i in trainingSet.index:
    hash1 = imagehash.average_hash(Image.open(path_img + trainingSet.iloc[i,2]))
    hash2 = imagehash.average_hash(Image.open(path_img + trainingSet.iloc[i,4]))
    diff = hash1 - hash2
    hashDiff.append(diff)
trainingSet = trainingSet.iloc[:-1,:]
trainingSet['hash'] = hashDiff
# Compute token_set_ratio
Token_tes = []
for i in trainingSet.index:
    TokenSet = fuzz.token_set_ratio(trainingSet.iloc[i,1], trainingSet.iloc[i,3])
    TokenSet = (i, TokenSet)
    Token_tes.append(TokenSet)
dfToken = pd.DataFrame(Token_tes)
trainingSet['Token'] = dfToken

Below is the illustration of the training dataset. It is actually not the original dataset because the original dataset is not in the English language. I create another data in English for understanding. Each row has two products. The columns “text_1” and “image_1” belong to the first product. The columns “text_2” and “image_2” belong to the second product. “Label” defines whether the pairing products are the same (1) or not (0). Notice that there are other two columns: “hash” and “tokenSet”. These two columns are generated, not from the original dataset, but from the above code.

index text_1 image_1 text_2 image_2 Label hash tokenSet
0 Blue shirt Gdsfdfs.jpg Blue shirt. Safsfs.jpg 1 6 100
1 Clean hat Fsdfsa.jpg Clean trousers Yjdgfbs.jpg 0 25 71
2 mouse Dfsdfasd.jpg mouse Fgasfdg.jpg 0 30 100
. . . . . . . . . . . . . . . . . . . . . . . .

 

Applying Machine Learning

Now, we know that lower Imagehash difference and higher Token_Set_Ratio indicates that a pair of products are more likely to be the same. The lowest value of imagehash is 0 and the highest value of Token_Set_Ratio is 100. But, the question is how much the thresholds are. To set the thresholds, we can use the Decision Tree Classifier.

A Machine Learning of Decision Tree model is created using the training dataset. The Machine Learning algorithm will find the pattern of imagehash difference and the token set ratio of identical and different products. The Decision Tree is visualized for the cover image of this article. The code below builds a Decision Tree model with Python. (But, the visualization for the cover image is the Decision Tree generated using R because, in my opinion, R visualizes Decision Tree more nicely). Then, it will predict the training dataset again. Finally, we can get the accuracy.

# Create decision tree classifier: hash and token set
Dtc = DecisionTreeClassifier(max_depth=4) 
Dtc = Dtc.fit(trainingSet.loc[:,['hash', 'tokenSet']],
              trainingSet.loc[:,'Label'])
Prediction2 = Dtc.predict(trainingSet.loc[:,['hash', 'tokenSet']])
metrics.accuracy_score(trainingSet.loc[:,'Label'], Prediction2)

The Decision Tree is used to predict the classification of the training dataset again. The accuracy is 0.728. In other words, 72.8% of the training dataset is predicted correctly.

From the Decision Tree, we can extract the information that if the Imagehash difference is smaller than 12, the pair of products are categorized to be identical. If the Imagehash difference is bigger than or equal to 12, we need to check the Token_Set_Ratio value. The Token_Set_Ratio lower than 97 confirms that the pair of products are different. If else, check whether the Imagehash difference value again. If the imagehash difference is bigger than or equal to 22, then the products are identical. Otherwise, the products are different.

Apply to test dataset

Now, we will load the test dataset, generate the Imagehash difference and Token_Set_Ratio, and finally predict whether each product pair matches.

# path to image
path_img = 'D:/test_img/'
# load test set
test = pd.read_csv('D:/new_test_set.csv', index_col=0).reset_index()
# hashDiff list
hashDiff = []
# Compute image difference
for i in test.index[:100]:
    hash1 = imagehash.average_hash(Image.open(path_img + test.iloc[i,2]))
    hash2 = imagehash.average_hash(Image.open(path_img + test.iloc[i,4]))
    diff = hash1 - hash2
    hashDiff.append(diff)
test['hash'] = hashDiff
# Token_set list
Token_set = []
# Compute text difference using token set
for i in test.index:
    TokenSet = fuzz.token_set_ratio(test.iloc[i,1], test.iloc[i,3])
    Token_set.append(TokenSet)
test['token'] = Token_set

After computing the Imagehash difference and Token_Set_ratio, the next thing to do is to apply the Decision Tree for the product match detection.

# Detecting product match
test['labelPredict'] = np.where(test['hash']<12, 1,
                               np.where(test['token']<97, 0,
                                        np.where(test['hash']>=22, 0, 1)))
# or
test['labelPredict'] = Dtc.predict(test[['hash','token']])
index text_1 image_1 text_2 image_2 hash tokenSet labelPredict
0 pen Fdfgsdfhg.jpg ballpoint Adxsea.jpg 8 33 1
1 harddisk Sgytueyuyt.jpg a nice Harddisk Erewbva.jpg 20 100 1
2 eraser Sadssadad.jpg stationary Safdfgs.jpg 25 25 0
. . . . . . . . . . . . . . . . . . . . . . . .

The above table is the illustration of the final result. The focus of this article is to demonstrate how to predict whether two images and two texts are similar or the same. You may find out that the Machine Learning model used is quite simple and there is no hyperparameter-tuning or training and test data splitting. Applying other Machine Learning, such as tree-based ensemble methods, can increase the accuracy. But it is not our discussion focus here. If you are interested to learn other tree-based Machine Learning more accurate than Decision Tree, please find an article here.

 

About Author

Connect with me here https://www.linkedin.com/in/rendy-kurnia/

The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion. 

You can also read this article on our Mobile APP Get it on Google Play