This article was published as a part of the Data Science Blogathon

**Hierarchical Clustering** is one of the most popular and useful clustering algorithms. It can be used as an alternative for partitioned clustering as there is no requirement of pre-specifying the number of clusters to be created.

Therefore it becomes necessary for every aspiring **Data Scientist **and **Machine Learning Engineer** to have a good knowledge of the Hierarchical Clustering Algorithm.

In this article, we will discuss the most important questions on the Hierarchical Clustering Algorithm** **which is helpful to get you a clear understanding of the Algorithm, and also for **Data Science Interviews,** which covers its very fundamental level to complex concepts.

** **

Clustering is an unsupervised machine learning technique that groups similar observations in a cluster(group of similar observations) such that the observations in the same group are more similar to each other than the observations in the other groups i.e observations in different clusters are dissimilar to each other.

** Image Source: Google Images**

** **

Hierarchical Clustering i.e, an unsupervised machine learning algorithm is used to group the unlabeled datasets into a single group, named, a cluster. Sometimes, it is also known as **Hierarchical cluster analysis (HCA)**.

In this algorithm, we try to create the hierarchy of clusters in the form of a tree, and this tree-shaped structure is known as the **Dendrogram**.

The two different types of Hierarchical Clustering technique are as follows:

**Agglomerative:** It is a bottom-up approach, in which the algorithm starts with taking all data points as single clusters and merging them until one cluster is left.

**Divisive:** It is just the opposite of the agglomerative algorithm as it is a top-down approach.

** Image Source: Google Images**

Initially, each data point is considered as an individual cluster in this technique. After each iteration, the similar clusters merge with other clusters and the merging will stop until one cluster or K clusters are formed.**The steps of the agglomerative algorithm are as follows:**

- Compute the proximity matrix.
- Let each data point be a cluster.
- Repeat this step: Combine the two closest clusters and accordingly update the proximity matrix.
- Until only a single cluster remains.

**For Example,**

Let’s say we have six observations named {A,B,C,D,E,F}.

**Step- 1: **In the first step, we compute the proximity of individual observations and consider all the six observations as individual clusters.

**Step- 2: **In this step, similar clusters are merged together and result in a single cluster.

For our example, we consider B, C, and D, E are similar clusters that are merged in this step. Now, we are remaining with four clusters named A, BC, DE, F.

**Step- 3:** We again compute the proximity of new clusters and merge the similar clusters to form new clusters A, BC, DEF.

**Step- 4:** Again, compute the proximity of the newly formed clusters. Now, the clusters named DEF and BC are similar and combine together to form a new cluster. Therefore, now we are left with two clusters named A, BCDEF.

**Step- 5: **Finally, all the clusters are combined together and form a single cluster and our procedure is completed for the given algorithm.

Therefore, the pictorial representation of the above example is shown below:

In simple words, Divisive Hierarchical Clustering is working in exactly the opposite way as Agglomerative Hierarchical Clustering. In Divisive Hierarchical Clustering, we consider all the data points as a single cluster, and after each iteration, we separate the data points from the cluster which are not similar. Each data point that is separated is considered as an individual cluster. Finally, we’ll be remaining with n number of clusters.Since we’re dividing the single clusters into n clusters, it is named Divisive Hierarchical Clustering. This technique is not much used in real-world applications.

The popular linkage methods used in Hierarchical Clustering are as follows:**Complete-linkage: **In this method, the distance between two clusters is defined as the maximum distance between two data points from each cluster.

**Single-linkage: **In this method, the distance between two clusters is defined as the minimum distance between two data points in each cluster.

**Average-linkage:** In this method, the distance between two clusters is defined as the average distance between each data point in one cluster to every data point in the other cluster.

**Centroid-linkage: **In this method, we find the centroid of cluster 1 and the centroid of cluster 2 and then calculate the distance between the two before merging.

- This approach can differentiate between non-elliptical shapes as long as the gap between the two clusters is not small.

**Cons of Single-linkage:**

- This approach cannot separate clusters properly if there is noise between clusters.

**Pros of Complete-linkage:**

- This approach gives well-separating clusters if there is some kind of noise present between clusters.

**Cons of Complete-Linkage:**

- This approach is biased towards globular clusters.
- It tends to break large clusters.

In the group average method, we take all the pairs of data points and calculate their similarities and find the average of all the similarities.Mathematically this can be written as,

**sim(C _{1}, C_{2}) = ∑ sim(P_{i}, P_{j})/|C_{1}|*|C_{2}|**

**where P _{i} ∈ C_{1} & P_{j} ∈ C_{2}**

** Image Source: Google Images**

__Pros of Group Average:__

- This approach does well in separating clusters if there is noise present between the clusters.

__Cons of Group Average:__

- This approach is biased towards globular clusters.

In this method, we find the similarity between clusters by calculating the sum of the square of the distances P_{i }and P_{j}.Mathematically this can be written as,

**sim(C _{1}, C_{2}) = ∑ (dist(P_{i}, P_{j}))²/|C_{1}|*|C_{2}|**

__Pros of Ward’s method:__

- This approach also does well in separating clusters if there are some of the noisy points or outliers present between the clusters.

**Cons of Ward’s method:**

- This approach is biased towards globular clusters.

A dendrogram is defined as a **tree-like structure** that is mainly used to store each step as a memory that the Hierarchical clustering algorithm performs. In the dendrogram plot, the Y-axis represents the Euclidean distances between the observations, and the X-axis represents all the observations present in the given dataset.

** Image Source: Google Images**

Please read the Wikipedia article for a better understanding of Dendrogram:

The working of the dendrogram can be easily understood using the given below diagram:

** Image Source: Google Images**

In the above diagram, the left part is showing how clusters are formed during the execution of the agglomerative clustering algorithm, and the right part is showing the corresponding dendrogram formed.

**Step by step working of the algorithm:**

**Step-1:** Firstly, the data points P2 and P3 merged together and form a cluster, correspondingly a dendrogram is created, which connects P2 and P3 with a rectangular shape. The height is decided according to the Euclidean distance between the data points.

**Step-2: **In the next step, P5 and P6 form a cluster, and the corresponding dendrogram is created. It is higher than the previous, as the Euclidean distance between P5 and P6 is a little bit greater than the P2 and P3.

**Step-3:** Again, two new dendrograms are formed that combine P1, P2, and P3 in one dendrogram, and P4, P5, and P6, in another dendrogram.

**Step-4: **At last, the final dendrogram is formed that combines all the data points together.

__Note:__ As per our requirement according to the problem statement, we can cut the dendrogram at any level.

** Image Source: Google Images**

A dendrogram can be understood as a column graph or a row graph. Some dendrograms are circular or have a fluid shape, but the software will usually produce a row or column graph. No matter what the shape, the basic graph consists of the following same parts:

- For arranging the Clades (branch), take the help of how similar (or dissimilar) they are. Clades that are having or close to the same height are similar to each other; whereas clades with different heights are dissimilar from each other, which implies that the more the value of the difference in height, the more will be dissimilar.
- Each clade has one or more leaves.

Now for the given diagram, we have:

- Leaves A, B, and C are more similar to each other than they are to leaves D, E, or F.
- Leaves D and E are more similar to each other than they are to leaves A, B, C, or F.
- Leaf F is substantially different from all of the other leaves due to different heights in the dendrogram.

Theoretically, a clade can have an infinite amount of leaves but the more leaves we have, the harder the dendrogram will be to read and understand with the help of naked eyes.

In a dendrogram, look for the largest vertical line which doesn’t cross any horizontal line.With the help of this line, we can draw a horizontal line and then, the points where this horizontal line cross over the various vertical lines, we count all those intersecting points, and then count of intersecting points is the ideal answer for the number of clusters the dataset can have.

**Time complexity: **Since we have to perform n iterations and in each iteration, we need to update the proximity matrix and also restore that matrix, therefore the time complexity is also very high. So, the time complexity is the order of the cube of n.

**Time complexity = O(n³) where n is the number of observations.**

The given situation happens due to either of the following reasons:

- Change in Proximity function
- Change in the number of data points or variables.

These changes will lead to different Clustering results and hence different dendrograms.

To take the decision for the number of clusters that can best depict different clusters can be chosen by carefully observing the dendrogram formed during the algorithm run.

The best choice of the number of clusters is the number of vertical lines in the dendrogram intersect by a horizontal line that can transverse the maximum distance vertically without intersecting a cluster.

In the above example, the best choice of the number of clusters will be 4 since the red horizontal line in the dendrogram below covers the maximum vertical distance AB.

Some of the popular approaches to stop combining the clusters are as follows:

When we don’t want to form, **for example**, 250 clusters, then we choose the K value. Therefore, we decide the number of clusters (say, the first five or six) required in the beginning, and we complete the process when we reach the value K. This is done so that we can put a limit on the incoming information.

It becomes very important especially when we are feeding it into another algorithm that probably requires three or four values.

** Possible challenges: **This approach only makes sense when you know the data well or you have some domain knowledge when you’re clustering with K clusters. But if you’re analyzing a brand new dataset, then you may not know how many clusters we are required.

We keep clustering until the next combination of clusters gives a bad cluster with a low cohesion setup. This implies that the points are so close to being in both the clusters that it doesn’t make sense to keep them together.

The diameter of a cluster is defined as the maximum distance between any pair of observations in the cluster.

We stop combining the clusters when the diameter of a new cluster formed exceeds the threshold. Moreover, we don’t want the two clusters to overlap as the diameter increases.

The radius of a cluster is defined as the maximum distance of a point from the centroid.

We stop combining the clusters when the radius of a new cluster formed exceeds the threshold(decided by the user itself according to the problem statement).

To find the optimal number of clusters, Silhouette Score is considered to be one of the popular approaches. This technique measures how close each of the observations in a cluster is to the observation in its neighboring clusters.Let a_{i} be the mean distance between an observation i and other observations in the cluster to which observation i assigned.

Let b_{i} be the minimum mean distance between an observation i and observation in other clusters.

__Conclusion:__

- The range of the Silhouette Scores is from -1 to +1. Higher the value of the Silhouette Score indicates observations are well clustered.
- Silhouette Score = 1 describes that the data point (i) is correctly and well-matched in the cluster assignment.

There are many measures to find the goodness of clusters but the most popular one is the **Dunn’s Index. **Dunn’s index is defined as the ratio of the minimum inter-cluster distances to the maximum intra-cluster diameter and the diameter of a cluster is calculated as the distance between its two furthermost points i.e, maximum distance from each other.

__Conclusion:__

- In order to have well separated and compact clusters, we should aim for a higher value of Dunn’s index.

** Image Source: Google Images**

- We can obtain the optimal number of clusters from the model itself, human intervention not required.
- Dendrograms help us in clear visualization, which is practical and easy to understand.

**Disadvantages of Hierarchical Clustering:**

- Not suitable for large datasets due to high time and space complexity.
- There is no mathematical objective for Hierarchical clustering.
- All the approaches to calculate the similarity between clusters has their own disadvantages.

*Thanks for reading!*

I hope you enjoyed the questions and were able to test your knowledge about the Hierarchical Clustering Algorithm.

If you liked this and want to know more, go visit my other articles on Data Science and Machine Learning by clicking on** **the** Link**

Please feel free to contact me** **on** Linkedin, Email.**

Something not mentioned or want to share your thoughts? Feel free to comment below And I’ll get back to you.

Currently, I am pursuing my Bachelor of Technology (B.Tech) in Computer Science and Engineering from the **Indian Institute of Technology Jodhpur(IITJ). **I am very enthusiastic about Machine learning, Deep Learning, and Artificial Intelligence.

*The media shown in this article on Sign Language Recognition are not owned by Analytics Vidhya and are used at the Author’s discretion.*

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Become a full stack data scientist
##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

Understanding Cost Function
Understanding Gradient Descent
Math Behind Gradient Descent
Assumptions of Linear Regression
Implement Linear Regression from Scratch
Train Linear Regression in Python
Implementing Linear Regression in R
Diagnosing Residual Plots in Linear Regression Models
Generalized Linear Models
Introduction to Logistic Regression
Odds Ratio
Implementing Logistic Regression from Scratch
Introduction to Scikit-learn in Python
Train Logistic Regression in python
Multiclass using Logistic Regression
How to use Multinomial and Ordinal Logistic Regression in R ?
Challenges with Linear Regression
Introduction to Regularisation
Implementing Regularisation
Ridge Regression
Lasso Regression

Introduction to Stacking
Implementing Stacking
Variants of Stacking
Implementing Variants of Stacking
Introduction to Blending
Bootstrap Sampling
Introduction to Random Sampling
Hyper-parameters of Random Forest
Implementing Random Forest
Out-of-Bag (OOB) Score in the Random Forest
IPL Team Win Prediction Project Using Machine Learning
Introduction to Boosting
Gradient Boosting Algorithm
Math behind GBM
Implementing GBM in python
Regularized Greedy Forests
Extreme Gradient Boosting
Implementing XGBM in python
Tuning Hyperparameters of XGBoost in Python
Implement XGBM in R/H2O
Adaptive Boosting
Implementing Adaptive Boosing
LightGBM
Implementing LightGBM in Python
Catboost
Implementing Catboost in Python

Introduction to Clustering
Applications of Clustering
Evaluation Metrics for Clustering
Understanding K-Means
Implementation of K-Means in Python
Implementation of K-Means in R
Choosing Right Value for K
Profiling Market Segments using K-Means Clustering
Hierarchical Clustering
Implementation of Hierarchial Clustering
DBSCAN
Defining Similarity between clusters
Build Better and Accurate Clusters with Gaussian Mixture Models

Introduction to Machine Learning Interpretability
Framework and Interpretable Models
model Agnostic Methods for Interpretability
Implementing Interpretable Model
Understanding SHAP
Out-of-Core ML
Introduction to Interpretable Machine Learning Models
Model Agnostic Methods for Interpretability
Game Theory & Shapley Values

Deploying Machine Learning Model using Streamlit
Deploying ML Models in Docker
Deploy Using Streamlit
Deploy on Heroku
Deploy Using Netlify
Introduction to Amazon Sagemaker
Setting up Amazon SageMaker
Using SageMaker Endpoint to Generate Inference
Deploy on Microsoft Azure Cloud
Introduction to Flask for Model
Deploying ML model using Flask