In the vast landscape of data exploration, where datasets sprawl like forests, hierarchical clustering acts as a guiding light, leading us through the dense thicket of information. Imagine a dendrogram, a visual representation of data relationships, branching out like a tree, revealing clusters and connections within the data. This is where machine learning meets the art of clustering, where Python serves as the wizard’s wand, casting spells of insight into the heart of datasets.
In this journey through the Python kingdom, we will unravel the mysteries of hierarchical clustering, exploring its intricacies and applications in data science. From dendrograms to distance matrices, from agglomerative to divisive clustering, we will delve deep into the techniques and methods that make hierarchical clustering a cornerstone of data analysis.
Join us as we embark on this adventure, where data points become nodes in a vast knowledge network, and clusters emerge like constellations in the night sky, guiding us toward the insights hidden within the data. Welcome to the world of hierarchical clustering in Python, where every cluster tells a story, and every dendrogram holds the key to unlocking the secrets of data science.
Quiz Time
Welcome to the quiz on Hierarchical Clustering! Test your knowledge about this clustering technique and its key concepts.
Hierarchical clustering is an unsupervised learning technique for grouping similar objects into clusters. It creates a hierarchy of clusters by merging or splitting them based on similarity measures. It uses a bottom-up approach or top-down approach to construct a hierarchical data clustering schema.
Clustering Hierarchical groups similar objects into a dendrogram. It merges similar clusters iteratively, starting with each data point as a separate cluster. This creates a tree-like structure that shows the relationships between clusters and their hierarchy.
The dendrogram from hierarchical clustering reveals the hierarchy of clusters at different levels, highlighting natural groupings in the data. It provides a visual representation of the relationships between clusters, helping to identify patterns and outliers, making it a valuable tool for exploratory data analysis. For example, let’s say we have the below points, and we want to cluster them into groups:
We can assign each of these points to a separate cluster:
Now, based on the similarity of these clusters, we can combine the most similar clusters together and repeat this process until only a single cluster is left:
We are essentially building a hierarchy of clusters. That’s why this algorithm is called hierarchical clustering. I will discuss how to decide the number of clusters later. For now, let’s look at the different types of hierarchical clustering.
Also Read: Python Interview Questions to Ace Your Next Job Interview in 2024
There are mainly two types of hierarchical clustering:
Let’s understand each type in detail.
We assign each point to an individual cluster in this technique. Suppose there are 4 data points. We will assign each of these points to a cluster and hence will have 4 clusters in the beginning:
Then, at each iteration, we merge the closest pair of clusters and repeat this step until only a single cluster is left:
We are merging (or adding) the clusters at each step, right? Hence, this type of clustering is also known as additive hierarchical clustering.
Divisive Clustering Hierarchical works in the opposite way. Instead of starting with n clusters (in case of n observations), we start with a single cluster and assign all the points to that cluster.
So, it doesn’t matter if we have 10 or 1000 data points. All these points will belong to the same cluster at the beginning:
Now, at each iteration, we split the farthest point in the cluster and repeat this process until each cluster only contains a single point:
We are splitting (or dividing) the clusters at each step, hence the name divisive hierarchical clustering.
Agglomerative Clustering is widely used in the industry and will be the article’s focus. Divisive hierarchical clustering will be a piece of cake once we have a handle on the agglomerative type
Also Read: Python Tutorial to Learn Data Science from Scratch
Here are some common applications of hierarchical clustering:
Here are some advantages and disadvantages of hierarchical clustering:
In Python, the scipy
and scikit-learn
libraries are often used to perform hierarchical clustering. Here’s how you can apply hierarchical clustering using Python:
numpy
for numerical operations, matplotlib
for plotting, and scipy.cluster.hierarchy
for hierarchical clustering.Here’s an example of hierarchical clustering using Python:
import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.cluster.hierarchy import fcluster
from sklearn.datasets import make_blobs
# Generate sample data
X, y = make_blobs(n_samples=100, centers=3, cluster_std=0.60, random_state=0)
# Compute the linkage matrix
Z = linkage(X, 'ward')
# Plot the dendrogram
plt.figure(figsize=(10, 7))
plt.title("Dendrogram")
plt.xlabel("Sample index")
plt.ylabel("Distance")
dendrogram(Z)
plt.show()
# Determine the clusters
max_d = 7.0 # this can be adjusted based on the dendrogram
clusters = fcluster(Z, max_d, criterion='distance')
# Plot the clusters
plt.figure(figsize=(10, 7))
plt.scatter(X[:, 0], X[:, 1], c=clusters, cmap='prism')
plt.title("Hierarchical Clustering")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()
Understanding the difference between supervised and unsupervised learning is important before we dive into the Clustering hierarchy. Let me explain this difference using a simple example.
Suppose we want to estimate the count of bikes that will be rented in a city every day:
Or, let’s say we want to predict whether a person on board the Titanic survived or not:
Let’s look at the figure below to understand this visually:
Here, y is our dependent or target variable, and X represents the independent variables. The target variable is dependent on X, also called a dependent variable. We train our model using the independent variables to supervise the target variable. Hence, the name supervised learning.
When training the model, we aim to generate a function that maps the independent variables to the desired target. Once the model is trained, we can pass new sets of observations, and the model will predict their target. This, in a nutshell, is supervised learning.
In these cases, we try to divide the entire data into a set of groups. These groups are known as clusters, and the process of making them is known as clustering.
This technique is generally used for clustering a population into different groups. A few common examples include segmenting customers, clustering similar documents, recommending similar songs or movies, etc.
There are many more applications of unsupervised learning. If you come across any interesting ones, feel free to share them in the comments section below!
Various algorithms help us make these clusters. The most commonly used clustering algorithms are K-means and Hierarchical clustering
We should first know how K-means works before we dive into hierarchical clustering. Trust me, it will make the concept of hierarchical clustering much easier.
Here’s a brief overview of how K-means works:
It is an iterative process. It will keep on running until the centroids of newly formed clusters do not change or the maximum number of iterations are reached.
But there are certain challenges with K-means. It always tries to make clusters of the same size. Also, we have to decide the number of clusters at the beginning of the algorithm. Ideally, we would not know how many clusters should we have, in the beginning of the algorithm and hence it a challenge with K-means.
This is a gap hierarchical clustering bridge with aplomb. It takes away the problem of having to pre-define the number of clusters. Sounds like a dream! So, let’s see what hierarchical clustering is and how it improves on K-means.
Hierarchical clustering and K-means are popular clustering algorithms but have different strengths and weaknesses. Here are some ways in which hierarchical clustering can improve on K-means:
Hierarchical Clustering:
K-means:
Hierarchical Clustering:
K-means:
Hierarchical Clustering:
K-means:
Hierarchical Clustering:
K-means:
Hierarchical Clustering:
K-means:
Hierarchical Clustering:
K-means:
Hierarchical Clustering:
K-means:
Let’s consider a practical example using hierarchical clustering and K-means on a simple dataset:
import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
# Generate sample data
X, y = make_blobs(n_samples=100, centers=3, cluster_std=0.60, random_state=0)
# Hierarchical Clustering
Z = linkage(X, 'ward')
plt.figure(figsize=(10, 7))
plt.title("Hierarchical Clustering Dendrogram")
dendrogram(Z)
plt.show()
# K-means Clustering
kmeans = KMeans(n_clusters=3, random_state=0).fit(X)
labels = kmeans.labels_
plt.figure(figsize=(10, 7))
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='prism')
plt.title("K-means Clustering")
plt.show()
We merge the most similar points or clusters in hierarchical clustering – we know this. Now the question is – how do we decide which points are similar and which are not? It’s one of the most important questions in clustering!
Here’s one way to calculate similarity – Take the distance between the centroids of these clusters. The points having the least distance are referred to as similar points and we can merge them. We can refer to this as a distance-based algorithm as well (since we are calculating the distances between the clusters).
In hierarchical clustering, we have a concept called a proximity matrix. This stores the distances between each point. Let’s take an example to understand this matrix and the steps to perform hierarchical clustering.
Suppose a teacher wants to divide her students into different groups. She has the marks scored by each student in an assignment and based on these marks, she wants to segment them into groups. There’s no fixed target here as to how many groups to have. Since the teacher does not know what type of students should be assigned to which group, it cannot be solved as a supervised learning problem. So, we will try to apply hierarchical clustering here and segment the students into different groups.
Let’s take a sample of 5 students:
First, we will create a proximity matrix which will tell us the distance between each of these points. Since we are calculating the distance of each point from each of the other points, we will get a square matrix of shape n X n (where n is the number of observations).
Let’s make the 5 x 5 proximity matrix for our example:
The diagonal elements of this matrix will always be 0 as the distance of a point with itself is always 0. We will use the Euclidean distance formula to calculate the rest of the distances. So, let’s say we want to calculate the distance between point 1 and 2:
√(10-7)^2 = √9 = 3
Similarly, we can calculate all the distances and fill the proximity matrix.
Step 1: First, we assign all the points to an individual cluster:
Different colors here represent different clusters. You can see that we have 5 different clusters for the 5 points in our data.
Step 2: Next, we will look at the smallest distance in the proximity matrix and merge the points with the smallest distance. We then update the proximity matrix:
Here, the smallest distance is 3 and hence we will merge point 1 and 2:
Let’s look at the updated clusters and accordingly update the proximity matrix:
Here, we have taken the maximum of the two marks (7, 10) to replace the marks for this cluster. Instead of the maximum, we can also take the minimum value or the average values as well. Now, we will again calculate the proximity matrix for these clusters:
Step 3: We will repeat step 2 until only a single cluster is left.
So, we will first look at the minimum distance in the proximity matrix and then merge the closest pair of clusters. We will get the merged clusters as shown below after repeating these steps:
We started with 5 clusters and finally had a single cluster. This is how agglomerative hierarchical clustering works. But the burning question remains—how do we decide the number of clusters? Let’s understand that in the next section.
Are you ready to finally answer this question that’s been hanging around since we started learning? To get the number of clusters for hierarchical clustering, we use an awesome concept called a Dendrogram.
A dendrogram is a tree-like diagram that records the sequences of merges or splits.
Let’s get back to the teacher-student example. Whenever we merge two clusters, a dendrogram will record the distance between them and represent it in graph form. Let’s see how a dendrogram looks:
We have the samples of the dataset on the x-axis and the distance on the y-axis. Whenever two clusters are merged, we will join them in this dendrogram, and the height of the join will be the distance between these points. Let’s build the dendrogram for our example:
Take a moment to process the above image. We started by merging sample 1 and 2 and the distance between these two samples was 3 (refer to the first proximity matrix in the previous section). Let’s plot this in the dendrogram:
Here, we can see that we have merged samples 1 and 2. The vertical line represents the distance between these samples. Similarly, we plot all the steps where we merged the clusters, and finally, we get a dendrogram like this:
We can visualize the steps of hierarchical clustering. The more the distance of the vertical lines in the dendrogram, the more the distance between those clusters.
Now, we can set a threshold distance and draw a horizontal line (Generally, we try to set the threshold so that it cuts the tallest vertical line). Let’s set this threshold as 12 and draw a horizontal line:
The number of clusters will be the number of vertical lines intersected by the line drawn using the threshold. In the above example, since the red line intersects 2 vertical lines, we will have 2 clusters. One cluster will have a sample (1,2,4) and the other will have a sample (3,5).
Time to get our hands dirty in Python!
We will be working on a wholesale customer segmentation problem. You can download the dataset using this link. The data is hosted on the UCI Machine Learning repository. This problem aims to segment the clients of a wholesale distributor based on their annual spending on diverse product categories, like milk, grocery, region, etc.
Let’s explore the data first and then apply Hierarchical Clustering to segment the clients.
Load the data and look at the first few rows:
There are multiple product categories – Fresh, Milk, Grocery, etc. The values represent the number of units each client purchases for each product. We aim to make clusters from this data to segment similar clients. We will, of course, use Hierarchical Clustering for this problem.
But before applying, we have to normalize the data so that the scale of each variable is the same. Why is this important? If the scale of the variables is not the same, the model might become biased towards the variables with a higher magnitude, such as fresh or milk (refer to the above table).
So, let’s first normalize the data and bring all the variables to the same scale:
Here, we can see that the scale of all the variables is almost similar. Now, we are good to go. Let’s first draw the dendrogram to help us decide the number of clusters for this particular problem:
The x-axis contains the samples and y-axis represents the distance between these samples. The vertical line with maximum distance is the blue line and hence we can decide a threshold of 6 and cut the dendrogram:
We have two clusters as this line cuts the dendrogram at two points. Let’s now apply hierarchical clustering for 2 clusters:
We can see the values of 0s and 1s in the output since we defined 2 clusters. 0 represents the points that belong to the first cluster and 1 represents points in the second cluster. Let’s now visualize the two clusters:
Awesome! We can visualize the two clusters here. This is how we can implement hierarchical clustering in Python.
In our journey, we’ve uncovered a powerful tool for unraveling the complexities of data relationships. From the conceptual elegance of dendrograms to their practical applications in diverse fields like biology, document analysis, cluster analysis, and customer segmentation, hierarchical cluster analysis emerges as a guiding light in the labyrinth of data exploration.
As we conclude this expedition, we stand at the threshold of possibility, where every cluster tells a story, and every dendrogram holds the key to unlocking the secrets of data science. In the ever-expanding landscape of Python and machine learning, hierarchical clustering stands as a stalwart companion, guiding us toward new horizons of discovery and understanding.
If you are still relatively new to data science, I highly recommend taking the Applied Machine Learning course. It is one of the most comprehensive end-to-end machine learning courses you will find anywhere. Hierarchical clustering is just one of the diverse topics we cover in the course.
What are your thoughts on hierarchical clustering? Do you feel there’s a better way to create clusters using less computational resources? Connect with me in the comments section below, and let’s discuss!
A. Hierarchical K clustering is a method of partitioning data into K clusters where each cluster contains similar data points organized in a hierarchical structure.
A. An example of a hierarchical cluster could be grouping customers based on their purchasing behavior, where clusters are formed based on similarities in purchasing patterns, leading to a hierarchical tree-like structure.
A. The two methods of hierarchical clustering are:
1. Agglomerative hierarchical clustering: It starts with each data point as a separate cluster and merges the closest clusters together until only one cluster remains.
2. Divisive hierarchical clustering: It begins with all data points in one cluster and recursively splits the clusters into smaller ones until each data point is in its cluster.
A. Hierarchical clustering of features involves clustering features or variables instead of data points. It identifies groups of similar features based on their characteristics, enabling dimensionality reduction or revealing underlying patterns in the data.