Beginner’s Guide to Image and Text Similarity

Rendyk 20 Oct, 2022 • 8 min read

This article was published as a part of the Data Science Blogathon.

Introduction

After my latest published article about satellite image analysis “Image Analysis and Mapping in Earth Engine Using NDVI“, now it is another article about image analysis again. Unlike the previous article, this article discusses general image analysis, not satellite image analysis. The goal of this discussion is to detect whether two products are the same or not. Each of the two products has image and text names. If the pair of products have similar or the same images or text names, that means that the two products are the same. The data comes from a competition held in Kaggle.

There are 4 basic packages used in this script: NumPy, pandas, matplotlib, and seaborn. There are also other specific packages. “Image” loads and shows image data. “imagehash” computes the similarity of two images. “fuzzywuzzy” detects the similarity of two texts. The package “metric” computes the accuracy score of the true label and predicted label.

# import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image
import imagehash
from fuzzywuzzy import fuzz
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

Image Similarity

The similarity of the two images is detected using the package “imagehash”. If two images are identical or almost identical, the imagehash difference will be 0. Two images are more similar if the imagehash difference is closer to 0.

Comparing the similarity of two images using imagehash consists of 5 steps. (1) The images are converted into greyscale. (2) The image sizes are reduced to be smaller, for example, into 8×8 pixels by default. (3) The average value of the 64 pixels is computed. (4)The 64 pixels are checked whether they are bigger than the average value. Now, each of the 64 pixels has a boolean value of true or false. (5) Imagehash difference is the number of different values between the two images. Please observe the below illustration.

Image_1 (average: 71.96875)

48	20	34	40	40	32	30	32
34	210	38	50	42	41	230	40
47	230	33	44	34	50	245	50
43	230	46	50	36	34	250	30
30	200	190	38	41	240	39	39
38	7	200	210	220	240	50	48
48	8	45	43	47	37	37	47
10	8	6	5	6	6	5	5

FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE
FALSE	TRUE	FALSE	FALSE	FALSE	FALSE	TRUE	FALSE
FALSE	TRUE	FALSE	FALSE	FALSE	FALSE	TRUE	FALSE
FALSE	TRUE	FALSE	FALSE	FALSE	FALSE	TRUE	FALSE
FALSE	TRUE	TRUE	FALSE	FALSE	TRUE	FALSE	FALSE
FALSE	FALSE	TRUE	TRUE	TRUE	TRUE	FALSE	FALSE
FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE
FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE

Image_2 (average: 78.4375)

41	20	39	43	34	39	30	32
35	195	44	46	35	48	232	40
30	243	38	31	34	46	213	50
49	227	44	33	35	224	230	30
46	203	225	44	46	181	184	40
38	241	247	220	228	210	36	38
42	8	35	39	47	31	41	21
3	12	10	18	24	21	6	17

FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE
FALSE	TRUE	FALSE	FALSE	FALSE	FALSE	TRUE	FALSE
FALSE	TRUE	FALSE	FALSE	FALSE	FALSE	TRUE	FALSE
FALSE	TRUE	FALSE	FALSE	FALSE	TRUE	TRUE	FALSE
FALSE	TRUE	TRUE	FALSE	FALSE	TRUE	TRUE	FALSE
FALSE	TRUE	TRUE	TRUE	TRUE	TRUE	FALSE	FALSE
FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE
FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE

The imagehash difference of the two images/matrices above is 3. It means that there are 3 pixels with different boolean values. The two images are relatively similar.

For more clarity, let’s examine imagehash applied to the following 3 pairs of images. The first pair consists of two same images and the imagehash difference is 0. The second pair compares two similar images. The second image (image_b) is actually an edited version of the first image (image_a). The imagehash difference is 6. The last pair shows the comparison of two totally different images. The imagehash difference is 30, which is the farthest from 0.

image similarity and text — Fig. 1 imagehash

# First pair
hash1 = imagehash.average_hash(Image.open('D: /image_a.jpg'))
hash2 = imagehash.average_hash(Image.open('D:/ image_a.jpg'))
diff = hash1 - hash2
print(diff)
# 0

# Second pair
hash1 = imagehash.average_hash(Image.open('D: /image_a.jpg'))
hash2 = imagehash.average_hash(Image.open('D:/ image_b.jpg'))
diff = hash1 - hash2
print(diff)
# 6

# Third pair
hash1 = imagehash.average_hash(Image.open('D: /image_a.jpg'))
hash2 = imagehash.average_hash(Image.open('D:/ image_c.jpg'))
diff = hash1 - hash2
print(diff)
# 30

Here is how the average imagehash looks like

>imagehash.average_hash(Image.open('D:/image_a.jpg'))
array([[ True,  True,  True,  True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True,  True,  True,  True],
       [False,  True, False, False, False, False, False, False],
       [ True,  True, False, False, False, False, False, False],
       [False, False, False,  True, False, False, False, False],
       [False, False, False,  True, False, False, False, False],
       [False, False, False, False, False, False, False, False]])

>imagehash.average_hash(Image.open('D:/image_b.jpg'))
array([[ True,  True,  True,  True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True,  True,  True,  True],
       [False,  True,  True,  True,  True, False, False, False],
       [ True,  True,  True, False, False, False, False, False],
       [ True,  True, False, False, False, False, False, False],
       [False, False, False,  True, False, False, False, False],
       [False, False, False,  True, False, False, False, False],
       [False, False, False, False, False, False, False, False]])

>imagehash.average_hash(Image.open('D:/image_c.png'))
array([[False, False, False, False, False, False, False, False],
       [ True,  True,  True,  True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True,  True,  True,  True],
       [False, False, False, False,  True, False, False, False],
       [False, False, False, False, False, False, False, False]])

Text Similarity

Text similarity can be assessed using Natural Language Processing (NLP). There are 4 ways to compare the similarity of a pair of texts provided by “fuzzywuzzy” package. The function of this package returns an integer value from 0 to 100. The higher value means the higher similarity.

1. fuzz.ratio – is the most simple comparison of the texts. The fuzz.ratio value of “blue shirt” and “blue shirt.” is 95. It means that the two texts are similar or almost the same, but the dot makes them a bit different

The measurement is based on the Levenshtein distance (named after Vladimir Levenshtein). Levenshtein distance measures how similar two texts are. It measures the number of minimum edits, such as inserting, deleting, or substituting, a text into another text. The text “Blue shirt” requires only 1 editing away to be “blue shirt.”. It only needs a single dot to be the same. Hence, the Levenshtein distance is “1”. The fuzz.ratio is calculated with this equation (len(a) + len(b) – lev)/( (len(a) + len(b), where len(a) and len(b) are the lengths of the first and second text, and lev is the Levenshtein distance. The ratio is (10 + 11 – 1)/(10 + 11) = 0.95 or 95%.

2. fuzz.partial_ratio – can detect if a text is a part of another text. But, it cannot detect if the text is in a different order. The example below shows that “blue shirt” is a part of “clean blue shirt” so that the fuzz.partial_ratio is 100. The fuzz.ratio returns the value 74 because it only detects that there is much difference between the two texts.

print(fuzz.ratio('blue shirt','clean blue shirt.'))
#74
print(fuzz.partial_ratio('blue shirt','clean blue shirt.'))
#100

3. Token_Sort_Ratio – can detect if a text is a part of another text although they are in a different order. Fuzz.token_sort_ratio returns 100 for the text “clean hat and blue shirt” and “blue shirt and clean hat” because they actually mean the same thing, but are in reverse order.

print(fuzz.ratio('clean hat and blue shirt','blue shirt and clean hat'))
#42
print(fuzz.partial_ratio('clean hat and blue shirt','blue shirt and clean hat'))
#42
print(fuzz.token_sort_ratio('clean hat and blue shirt','blue shirt and clean hat'))
#100

4. Token_Set_Ratio – can detect the text-similarity accounting for the partial text, text order, and different text lengths. It can detect that the text “clean hat” and “blue shirt” is part of the text “People want to wear a blue shirt and clean hat” in a different order. In this study, we only use “Token_Set_Ratio” as it is the most suitable.

print(fuzz.ratio('clean hat and blue shirt','People want to wear blue shirt and clean hat'))
#53
print(fuzz.partial_ratio('clean hat and blue shirt','People want to wear blue shirt and clean hat'))
#62
print(fuzz.token_sort_ratio('clean hat and blue shirt','People want to wear blue shirt and clean hat'))
#71
print(fuzz.token_set_ratio('clean hat and blue shirt','People want to wear blue shirt and clean hat'))
#100

The following cell will load the training dataset and add features of hash as well as token set ratio.

# load training set
trainingSet = pd.read_csv('D:/new_training_set.csv', index_col=0).reset_index()
# Compute imagehash difference
hashDiff = []
for i in trainingSet.index:
    hash1 = imagehash.average_hash(Image.open(path_img + trainingSet.iloc[i,2]))
    hash2 = imagehash.average_hash(Image.open(path_img + trainingSet.iloc[i,4]))
    diff = hash1 - hash2
    hashDiff.append(diff)
trainingSet = trainingSet.iloc[:-1,:]
trainingSet['hash'] = hashDiff
# Compute token_set_ratio
Token_tes = []
for i in trainingSet.index:
    TokenSet = fuzz.token_set_ratio(trainingSet.iloc[i,1], trainingSet.iloc[i,3])
    TokenSet = (i, TokenSet)
    Token_tes.append(TokenSet)
dfToken = pd.DataFrame(Token_tes)
trainingSet['Token'] = dfToken

Below is the illustration of the training dataset. It is actually not the original dataset because the original dataset is not in the English language. I create another data in English for understanding. Each row has two products. The columns “text_1” and “image_1” belong to the first product. The columns “text_2” and “image_2” belong to the second product. “Label” defines whether the pairing products are the same (1) or not (0). Notice that there are other two columns: “hash” and “tokenSet”. These two columns are generated, not from the original dataset, but from the above code.

index	text_1	image_1	text_2	image_2	Label	hash	tokenSet
0	Blue shirt	Gdsfdfs.jpg	Blue shirt.	Safsfs.jpg	1	6	100
1	Clean hat	Fsdfsa.jpg	Clean trousers	Yjdgfbs.jpg	0	25	71
2	mouse	Dfsdfasd.jpg	mouse	Fgasfdg.jpg	0	30	100
. . .	. . .	. . .	. . .	. . .	. . .	. . .	. . .

Applying Machine Learning

Now, we know that lower Imagehash difference and higher Token_Set_Ratio indicates that a pair of products are more likely to be the same. The lowest value of imagehash is 0 and the highest value of Token_Set_Ratio is 100. But, the question is how much the thresholds are. To set the thresholds, we can use the Decision Tree Classifier.

A Machine Learning of Decision Tree model is created using the training dataset. The Machine Learning algorithm will find the pattern of imagehash difference and the token set ratio of identical and different products. The Decision Tree is visualized for the cover image of this article. The code below builds a Decision Tree model with Python. (But, the visualization for the cover image is the Decision Tree generated using R because, in my opinion, R visualizes Decision Tree more nicely). Then, it will predict the training dataset again. Finally, we can get the accuracy.

# Create decision tree classifier: hash and token set
Dtc = DecisionTreeClassifier(max_depth=4) 
Dtc = Dtc.fit(trainingSet.loc[:,['hash', 'tokenSet']],
              trainingSet.loc[:,'Label'])
Prediction2 = Dtc.predict(trainingSet.loc[:,['hash', 'tokenSet']])
metrics.accuracy_score(trainingSet.loc[:,'Label'], Prediction2)

The Decision Tree is used to predict the classification of the training dataset again. The accuracy is 0.728. In other words, 72.8% of the training dataset is predicted correctly.

From the Decision Tree, we can extract the information that if the Imagehash difference is smaller than 12, the pair of products are categorized to be identical. If the Imagehash difference is bigger than or equal to 12, we need to check the Token_Set_Ratio value. The Token_Set_Ratio lower than 97 confirms that the pair of products are different. If else, check whether the Imagehash difference value again. If the imagehash difference is bigger than or equal to 22, then the products are identical. Otherwise, the products are different.

Apply to test dataset

Now, we will load the test dataset, generate the Imagehash difference and Token_Set_Ratio, and finally predict whether each product pair matches.

# path to image
path_img = 'D:/test_img/'
# load test set
test = pd.read_csv('D:/new_test_set.csv', index_col=0).reset_index()
# hashDiff list
hashDiff = []
# Compute image difference
for i in test.index[:100]:
    hash1 = imagehash.average_hash(Image.open(path_img + test.iloc[i,2]))
    hash2 = imagehash.average_hash(Image.open(path_img + test.iloc[i,4]))
    diff = hash1 - hash2
    hashDiff.append(diff)
test['hash'] = hashDiff
# Token_set list
Token_set = []
# Compute text difference using token set
for i in test.index:
    TokenSet = fuzz.token_set_ratio(test.iloc[i,1], test.iloc[i,3])
    Token_set.append(TokenSet)
test['token'] = Token_set

After computing the Imagehash difference and Token_Set_ratio, the next thing to do is to apply the Decision Tree for the product match detection.

# Detecting product match
test['labelPredict'] = np.where(test['hash']<12, 1,
                               np.where(test['token']<97, 0,
                                        np.where(test['hash']>=22, 0, 1)))
# or
test['labelPredict'] = Dtc.predict(test[['hash','token']])

index	text_1	image_1	text_2	image_2	hash	tokenSet	labelPredict
0	pen	Fdfgsdfhg.jpg	ballpoint	Adxsea.jpg	8	33	1
1	harddisk	Sgytueyuyt.jpg	a nice Harddisk	Erewbva.jpg	20	100	1
2	eraser	Sadssadad.jpg	stationary	Safdfgs.jpg	25	25	0
. . .	. . .	. . .	. . .	. . .	. . .	. . .	. . .

The above table is the illustration of the final result. The focus of this article is to demonstrate how to predict whether two images and two texts are similar or the same. You may find out that the Machine Learning model used is quite simple and there is no hyperparameter-tuning or training and test data splitting. Applying other Machine Learning, such as tree-based ensemble methods, can increase the accuracy. But it is not our discussion focus here. If you are interested to learn other tree-based Machine Learning more accurate than Decision Tree, please find an article here.