rendyk — May 5, 2021
This article was published as a part of the Data Science Blogathon.

## Introduction

After my latest published article about satellite image analysis “Image Analysis and Mapping in Earth Engine Using NDVI, now it is another article about image analysis again. Unlike the previous article, this article discusses general image analysis, not satellite image analysis. The goal of this discussion is to detect whether two products are the same or not. Each of the two products has image and text names. If the pair of products have similar or the same images or text names, that means that the two products are the same. The data comes from a competition held in Kaggle.

There are 4 basic packages used in this script: NumPy, pandas, matplotlib, and seaborn. There are also other specific packages. “Image” loads and shows image data. “imagehash” computes the similarity of two images. “fuzzywuzzy” detects the similarity of two texts. The package “metric” computes the accuracy score of the true label and predicted label.

```# import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image
import imagehash
from fuzzywuzzy import fuzz
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics```

## Image Similarity

The similarity of the two images is detected using the package “imagehash”. If two images are identical or almost identical, the imagehash difference will be 0. Two images are more similar if the imagehash difference is closer to 0.

Comparing the similarity of two images using imagehash consists of 5 steps. (1) The images are converted into greyscale. (2) The image sizes are reduced to be smaller, for example, into 8×8 pixels by default. (3) The average value of the 64 pixels is computed. (4)The 64 pixels are checked whether they are bigger than the average value. Now, each of the 64 pixels has a boolean value of true or false. (5) Imagehash difference is the number of different values between the two images. Please observe the below illustration.

Image_1 (average: 71.96875)

 48 20 34 40 40 32 30 32 34 210 38 50 42 41 230 40 47 230 33 44 34 50 245 50 43 230 46 50 36 34 250 30 30 200 190 38 41 240 39 39 38 7 200 210 220 240 50 48 48 8 45 43 47 37 37 47 10 8 6 5 6 6 5 5
 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Image_2 (average: 78.4375)

 41 20 39 43 34 39 30 32 35 195 44 46 35 48 232 40 30 243 38 31 34 46 213 50 49 227 44 33 35 224 230 30 46 203 225 44 46 181 184 40 38 241 247 220 228 210 36 38 42 8 35 39 47 31 41 21 3 12 10 18 24 21 6 17
 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

The imagehash difference of the two images/matrices above is 3. It means that there are 3 pixels with different boolean values. The two images are relatively similar.

For more clarity, let’s examine imagehash applied to the following 3 pairs of images. The first pair consists of two same images and the imagehash difference is 0. The second pair compares two similar images. The second image (image_b) is actually an edited version of the first image (image_a). The imagehash difference is 6. The last pair shows the comparison of two totally different images. The imagehash difference is 30, which is the farthest from 0.

```# First pair
hash1 = imagehash.average_hash(Image.open('D: /image_a.jpg'))
hash2 = imagehash.average_hash(Image.open('D:/ image_a.jpg'))
diff = hash1 - hash2
print(diff)
# 0```
```# Second pair
hash1 = imagehash.average_hash(Image.open('D: /image_a.jpg'))
hash2 = imagehash.average_hash(Image.open('D:/ image_b.jpg'))
diff = hash1 - hash2
print(diff)
# 6```
```# Third pair
hash1 = imagehash.average_hash(Image.open('D: /image_a.jpg'))
hash2 = imagehash.average_hash(Image.open('D:/ image_c.jpg'))
diff = hash1 - hash2
print(diff)
# 30```

Here is how the average imagehash looks like

```>imagehash.average_hash(Image.open('D:/image_a.jpg'))
array([[ True,  True,  True,  True,  True,  True,  True,  True],
[ True,  True,  True,  True,  True,  True,  True,  True],
[ True,  True,  True,  True,  True,  True,  True,  True],
[False,  True, False, False, False, False, False, False],
[ True,  True, False, False, False, False, False, False],
[False, False, False,  True, False, False, False, False],
[False, False, False,  True, False, False, False, False],
[False, False, False, False, False, False, False, False]])```

```>imagehash.average_hash(Image.open('D:/image_b.jpg'))
array([[ True,  True,  True,  True,  True,  True,  True,  True],
[ True,  True,  True,  True,  True,  True,  True,  True],
[False,  True,  True,  True,  True, False, False, False],
[ True,  True,  True, False, False, False, False, False],
[ True,  True, False, False, False, False, False, False],
[False, False, False,  True, False, False, False, False],
[False, False, False,  True, False, False, False, False],
[False, False, False, False, False, False, False, False]])```

```>imagehash.average_hash(Image.open('D:/image_c.png'))
array([[False, False, False, False, False, False, False, False],
[ True,  True,  True,  True,  True,  True,  True,  True],
[ True,  True,  True,  True,  True,  True,  True,  True],
[ True,  True,  True,  True,  True,  True,  True,  True],
[ True,  True,  True,  True,  True,  True,  True,  True],
[ True,  True,  True,  True,  True,  True,  True,  True],
[False, False, False, False,  True, False, False, False],
[False, False, False, False, False, False, False, False]])```

## Text Similarity

Text similarity can be assessed using Natural Language Processing (NLP). There are 4 ways to compare the similarity of a pair of texts provided by “fuzzywuzzy” package. The function of this package returns an integer value from 0 to 100. The higher value means the higher similarity.

1. fuzz.ratio – is the most simple comparison of the texts. The fuzz.ratio value of “blue shirt” and “blue shirt.” is 95. It means that the two texts are similar or almost the same, but the dot makes them a bit different

```from fuzzywuzzy import fuzz
fuzz.ratio('blue shirt','blue shirt.')
#95```

The measurement is based on the Levenshtein distance (named after Vladimir Levenshtein). Levenshtein distance measures how similar two texts are. It measures the number of minimum edits, such as inserting, deleting, or substituting, a text into another text. The text “Blue shirt” requires only 1 editing away to be “blue shirt.”. It only needs a single dot to be the same. Hence, the Levenshtein distance is “1”. The fuzz.ratio is calculated with this equation (len(a) + len(b) – lev)/( (len(a) + len(b), where len(a) and len(b) are the lengths of the first and second text, and lev is the Levenshtein distance. The ratio is (10 + 11 – 1)/(10 + 11) = 0.95 or 95%.

2. fuzz.partial_ratio – can detect if a text is a part of another text. But, it cannot detect if the text is in a different order. The example below shows that “blue shirt” is a part of “clean blue shirt” so that the fuzz.partial_ratio is 100. The fuzz.ratio returns the value 74 because it only detects that there is much difference between the two texts.

```print(fuzz.ratio('blue shirt','clean blue shirt.'))
#74
print(fuzz.partial_ratio('blue shirt','clean blue shirt.'))
#100```

3. Token_Sort_Ratio – can detect if a text is a part of another text although they are in a different order. Fuzz.token_sort_ratio returns 100 for the text “clean hat and blue shirt” and “blue shirt and clean hat” because they actually mean the same thing, but are in reverse order.

```print(fuzz.ratio('clean hat and blue shirt','blue shirt and clean hat'))
#42
print(fuzz.partial_ratio('clean hat and blue shirt','blue shirt and clean hat'))
#42
print(fuzz.token_sort_ratio('clean hat and blue shirt','blue shirt and clean hat'))
#100```

4. Token_Set_Ratio – can detect the text-similarity accounting for the partial text, text order, and different text lengths. It can detect that the text “clean hat” and “blue shirt” is part of the text “People want to wear a blue shirt and clean hat” in a different order. In this study, we only use “Token_Set_Ratio” as it is the most suitable.

```print(fuzz.ratio('clean hat and blue shirt','People want to wear blue shirt and clean hat'))
#53
print(fuzz.partial_ratio('clean hat and blue shirt','People want to wear blue shirt and clean hat'))
#62
print(fuzz.token_sort_ratio('clean hat and blue shirt','People want to wear blue shirt and clean hat'))
#71
print(fuzz.token_set_ratio('clean hat and blue shirt','People want to wear blue shirt and clean hat'))
#100```

The following cell will load the training dataset and add features of hash as well as token set ratio.

```# load training set
# Compute imagehash difference
hashDiff = []
for i in trainingSet.index:
hash1 = imagehash.average_hash(Image.open(path_img + trainingSet.iloc[i,2]))
hash2 = imagehash.average_hash(Image.open(path_img + trainingSet.iloc[i,4]))
diff = hash1 - hash2
hashDiff.append(diff)
trainingSet = trainingSet.iloc[:-1,:]
trainingSet['hash'] = hashDiff
# Compute token_set_ratio
Token_tes = []
for i in trainingSet.index:
TokenSet = fuzz.token_set_ratio(trainingSet.iloc[i,1], trainingSet.iloc[i,3])
TokenSet = (i, TokenSet)
Token_tes.append(TokenSet)
dfToken = pd.DataFrame(Token_tes)
trainingSet['Token'] = dfToken```

Below is the illustration of the training dataset. It is actually not the original dataset because the original dataset is not in the English language. I create another data in English for understanding. Each row has two products. The columns “text_1” and “image_1” belong to the first product. The columns “text_2” and “image_2” belong to the second product. “Label” defines whether the pairing products are the same (1) or not (0). Notice that there are other two columns: “hash” and “tokenSet”. These two columns are generated, not from the original dataset, but from the above code.

 index text_1 image_1 text_2 image_2 Label hash tokenSet 0 Blue shirt Gdsfdfs.jpg Blue shirt. Safsfs.jpg 1 6 100 1 Clean hat Fsdfsa.jpg Clean trousers Yjdgfbs.jpg 0 25 71 2 mouse Dfsdfasd.jpg mouse Fgasfdg.jpg 0 30 100 . . . . . . . . . . . . . . . . . . . . . . . .

## Applying Machine Learning

Now, we know that lower Imagehash difference and higher Token_Set_Ratio indicates that a pair of products are more likely to be the same. The lowest value of imagehash is 0 and the highest value of Token_Set_Ratio is 100. But, the question is how much the thresholds are. To set the thresholds, we can use the Decision Tree Classifier.

A Machine Learning of Decision Tree model is created using the training dataset. The Machine Learning algorithm will find the pattern of imagehash difference and the token set ratio of identical and different products. The Decision Tree is visualized for the cover image of this article. The code below builds a Decision Tree model with Python. (But, the visualization for the cover image is the Decision Tree generated using R because, in my opinion, R visualizes Decision Tree more nicely). Then, it will predict the training dataset again. Finally, we can get the accuracy.

```# Create decision tree classifier: hash and token set
Dtc = DecisionTreeClassifier(max_depth=4)
Dtc = Dtc.fit(trainingSet.loc[:,['hash', 'tokenSet']],
trainingSet.loc[:,'Label'])
Prediction2 = Dtc.predict(trainingSet.loc[:,['hash', 'tokenSet']])
metrics.accuracy_score(trainingSet.loc[:,'Label'], Prediction2)```

The Decision Tree is used to predict the classification of the training dataset again. The accuracy is 0.728. In other words, 72.8% of the training dataset is predicted correctly.

From the Decision Tree, we can extract the information that if the Imagehash difference is smaller than 12, the pair of products are categorized to be identical. If the Imagehash difference is bigger than or equal to 12, we need to check the Token_Set_Ratio value. The Token_Set_Ratio lower than 97 confirms that the pair of products are different. If else, check whether the Imagehash difference value again. If the imagehash difference is bigger than or equal to 22, then the products are identical. Otherwise, the products are different.

## Apply to test dataset

Now, we will load the test dataset, generate the Imagehash difference and Token_Set_Ratio, and finally predict whether each product pair matches.

```# path to image
path_img = 'D:/test_img/'
# hashDiff list
hashDiff = []
# Compute image difference
for i in test.index[:100]:
hash1 = imagehash.average_hash(Image.open(path_img + test.iloc[i,2]))
hash2 = imagehash.average_hash(Image.open(path_img + test.iloc[i,4]))
diff = hash1 - hash2
hashDiff.append(diff)
test['hash'] = hashDiff
# Token_set list
Token_set = []
# Compute text difference using token set
for i in test.index:
TokenSet = fuzz.token_set_ratio(test.iloc[i,1], test.iloc[i,3])
Token_set.append(TokenSet)
test['token'] = Token_set```

After computing the Imagehash difference and Token_Set_ratio, the next thing to do is to apply the Decision Tree for the product match detection.

```# Detecting product match
test['labelPredict'] = np.where(test['hash']<12, 1,
np.where(test['token']<97, 0,
np.where(test['hash']>=22, 0, 1)))
# or
test['labelPredict'] = Dtc.predict(test[['hash','token']])```
 index text_1 image_1 text_2 image_2 hash tokenSet labelPredict 0 pen Fdfgsdfhg.jpg ballpoint Adxsea.jpg 8 33 1 1 harddisk Sgytueyuyt.jpg a nice Harddisk Erewbva.jpg 20 100 1 2 eraser Sadssadad.jpg stationary Safdfgs.jpg 25 25 0 . . . . . . . . . . . . . . . . . . . . . . . .

The above table is the illustration of the final result. The focus of this article is to demonstrate how to predict whether two images and two texts are similar or the same. You may find out that the Machine Learning model used is quite simple and there is no hyperparameter-tuning or training and test data splitting. Applying other Machine Learning, such as tree-based ensemble methods, can increase the accuracy. But it is not our discussion focus here. If you are interested to learn other tree-based Machine Learning more accurate than Decision Tree, please find an article here. 