Rishit Arya 4-Year B.Tech. Chemical Engineering — Published On April 4, 2022

This article was published as a part of the Data Science Blogathon.

## Overview

• Clustering is an unsupervised machine learning algorithm that basically groups similar things together.
• Recommendation Engines is a fundamental application of clustering.
• We will build a Collaborative filtering Book recommendation system and compare flat vs hierarchical clustering; which works better?

## Introduction

Unsupervised Machine learning algorithm has always fascinated me with their power and different fields of applications. Clustering, the most famous Unsupervised Machine learning algorithm, has seen some recent developments. With the recent development of HDBSCAN (Hierarchical Density-based spatial clustering with the application of noise), I always wondered why there was a need to develop a hierarchical and density-based algorithm, what are advantages of hierarchical over flat and vice-versa, and which works better for our machine learning projects like Recommendation systems.

So here is the python implementation of a Book Recommendation system comparing the two Flat and hierarchical clustering algorithms.

## What is Clustering?

Clustering is an unsupervised machine learning technique that finds similarities between data points and groups them into clusters. It is used to draw references and determine the underlying structure within the data. There are various clustering algorithms available, and they are mainly divided into two groups:

1. Centroid-based / parametric
2. Density-based

Centroid based algorithm identifies clusters by their respective centres(centroid). They make implicit and explicit assumptions about the shape of the cluster. Generally, the centroid based algorithms assume clusters to be a multi-dimensional spheres. These algorithms work well with small amounts of data.

Density-based algorithms determine clusters where there is a high density of data points. Unlike centroid-based methods, these do not assign every point to a cluster; instead, points in the less dense regions are grouped as noise.

Furthermore, these algorithms are divided as:

1. Flat

Flat clustering gives you a single grouping or partitioning of data. These require you to have a prior understanding of the clusters as we have to set the resolution parameter. Examples of flat clustering algorithms are K-means(resolution parameter k), DBSCAN(resolution parameter eps)

1. Hierarchical

Hierarchical clustering gives you a sort of nested relationship between the data. It doesn’t require you to have prior knowledge of the cluster as it creates a kind of natural hierarchy over the clusters. These algorithms assume each point as a cluster to group every point in a single cluster. A dendrogram generally represents this hierarchy.

There are two approaches to building such a hierarchy,

1. Agglomerative (Bottom-Up)
2. Divisive (Top-Down)

Applications of Clustering

Clustering is used in varied fields of applications. Following are some common application platforms that can be used to implement clustering:

• Customer and market segmentation
• Social Network Analysis
• Recommendation Systems

Here we will build a book recommendation engine and compare k-means(Flat) and Agglomerative Clustering(Hierarchical) clustering for the application.

## Book Recommendation System

We will build a collaborative filtering recommendation system where we will cluster users based on similarities in book likings. Step-wise implementation is given below.

Importing required libraries

```import pandas as pd
import numpy as np
from IPython.display import Image,display
from IPython.core.display import HTML
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score ,silhouette_samples
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.cm as cm
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import StandardScaler
from scipy.cluster.hierarchy import linkage, dendrogram, cut_tree```

The data

We have three data frames containing information related to Books, Users, and ratings.

Data Source

```Books = pd.read_csv('Books.csv',low_memory=False) ```Users = pd.read_csv('Users.csv',low_memory=False) ```Ratings = pd.read_csv('Ratings.csv') Cleaning the data

On checking some features

`Books['Year-Of-Publication'].describe()`
```count    271357.000000
mean       1959.760817
std         257.994226
min           0.000000
25%        1989.000000
50%        1995.000000
75%        2000.000000
max        2050.000000
Name: Year-Of-Publication, dtype: float64```

The min values suggest we have some invalid values for the year of publication. Hence we have to trim some values.

`Books = Books[(Books['Year-Of-Publication']>=1950) & (Books['Year-Of-Publication']<=2016)]`

Similarly

`Users.Age.describe()`
```count    278858.000000
mean         34.751434
std          11.202003
min           0.000000
25%          29.000000
50%          34.751434
75%          35.000000
max         244.000000
Name: Age, dtype: float64```

We can see some invalid numbers there hence we will again trim it off.

`Users = Users[(Users.Age>=15) & (Users.Age<=100)]`

Now keeping only the books that are common in all the data frames

```print("shape before cleaning:",Ratings.shape)
Ratings = Ratings[Ratings['ISBN'].isin(list(Books['ISBN'].unique()))]
Ratings = Ratings[Ratings['User-ID'].isin(list(Users['User-ID'].unique()))]
print("shape after cleaning:",Ratings.shape)```
```shape before cleaning: (1149780, 3)
shape after cleaning: (1003113, 3)```

Preparing the data

We have to define whether a person likes or dislikes the book for our filtering. Criteria for that will be: If a person rates a book more than their average rating, they like the book. We are doing so because clustering users based on ratings can be problematic as we can’t expect users to maintain scale uniformity.

```# Taking the mean of rating given by each user
User_rating_mean = Ratings.groupby('User-ID')['Book-Rating'].mean()
user_rating = Ratings.set_index('User-ID')
user_rating['mean_rating'] = User_rating_mean
user_rating.reset_index(inplace=True)
# Keeping the books in which users "likes" the book
user_rating = user_rating[user_rating['Book-Rating'] > user_rating['mean_rating']]
# Initializing a dummy variable for future use
user_rating['is_fav'] = 1
print(user_rating.shape)
Now creating a crosstab so that each users ends up as a row and the books as columns
df = pd.pivot_table(user_rating,index='User-ID',columns='ISBN',values='is_fav')
df.fillna(value=0,inplace=True) This is how our data looks right now.

We will use PCA for dimensionality reduction as the data shape is impractical to use right now.

```pca = PCA(n_components=3)
pca.fit(df)
pca_fit = pca.transform(df)
pca_fit = pd.DataFrame(pca_fit,index=df.index)```

Applying the algorithm

Finding the Right k

We will use the elbow method to find the right k(no. of clusters). For that, we will make a line graph of k vs the Total sum of squared errors from the respective centres, and the point at we see an elbow forming is chosen as the optimal value of k.

```TSS = []
for i in range(2,26):
km = KMeans(n_clusters=i,random_state=0)
km.fit(pca_fit)
TSS.append(km.inertia_)
plt.plot(range(2,26),TSS,'-')``` As we can’t see a clear elbow, we have to use silhouette analysis to find the right k. Silhouette score is a metric for validation of the clustering. We will calculate the silhouette score for each trial of k.

```for n in [3,4,5,6,7,8]:
km = KMeans(n_clusters=n,random_state=0)
clusters = km.fit_predict(pca_fit)
silhouette_avg = silhouette_score(pca_fit, clusters)
print("For n_clusters =", n,
"The average silhouette_score is :", silhouette_avg)
silhouette_values = silhouette_samples(pca_fit, clusters)```
```For n_clusters = 3 The average silhouette_score is : 0.6995278360574746
For n_clusters = 4 The average silhouette_score is : 0.702669359125888
For n_clusters = 5 The average silhouette_score is : 0.5944974925329153
For n_clusters = 6 The average silhouette_score is : 0.6160073620527732
For n_clusters = 7 The average silhouette_score is : 0.5394956965235688
For n_clusters = 8 The average silhouette_score is : 0.5295789525646848```

Clearly, k=4 gives the best clustering.

Hence at k=4, we will get such clustering

```Kmeans_final = KMeans(n_clusters=4,random_state=0).fit(pca_fit)
df['cluster'] = Kmeans_final.labels_
fig = plt.figure()
ax = Axes3D(fig)
ax.scatter(pca_fit, pca_fit, pca_fit,c=df['cluster'],cmap=cmhot)
plt.title('Data points in 3D PCA axis', fontsize=20)
plt.show()``` 1. Agglomerative Clustering

Building the hierarchy

We will build the hierarchy and represent it using a dendrogram.

```mergings = linkage(pca_fit,method='ward')
dendrogram(mergings)
plt.show()``` The dendrogram suggests no. of clusters = 4 will give the best clustering hence we will cut the hierarchy tree where no. of clusters = 4 and check out the silhouette score for clustering.

```labels = cut_tree(mergings,n_clusters=4)
d = df.copy()
d['cluster'] = Kmeans_final.labels_
silhouette_avg = silhouette_score(pca_fit, labels)
print("For n_clusters =", 4,
"The average silhouette_score is :", silhouette_avg)
fig = plt.figure()
ax = Axes3D(fig)
ax.scatter(pca_fit, pca_fit, pca_fit,c=df['cluster'],cmap=cmhot)
plt.title('Data points in 3D PCA axis', fontsize=20)
plt.show()```
`For n_clusters = 4 The average silhouette_score is : 0.6643949568160475` ## Conclusion

K-means provide a better result here. We had around 5000 data points(users) in the cluster. Hence hierarchical clustering not only can become computationally expensive but does not necessarily provide better results. Also, for 1000 or more data points, the dendrogram becomes complex and hard to interpret. Though in K-means, we have to set the resolution parameter; once we get the right k, we can achieve the right clustering we need. • 