Adding Explainability to Clustering

Nibedita Dutta Last Updated : 26 May, 2022

6 min read

This article was published as a part of the Data Science Blogathon.

Introduction

The ability to explain decisions is increasingly becoming important across businesses. Explainable AI is no longer just an optional add-on when using ML algorithms for corporate decision making. While there are a lot of techniques that have been developed for supervised algorithms, development on explaining unsupervised techniques has been relatively lesser.

Clustering is an unsupervised algorithm that is used for determining the intrinsic groups present in unlabelled data. For instance, a B2C business might be interested in finding segments in its customer base. Clustering is hence used commonly for different use-cases like customer segmentation, market segmentation, pattern recognition, search result clustering etc. Some standard clustering techniques are K-means, DBSCAN, Hierarchical clustering amongst other methods.

Bringing Explainability To Clustering

Clusters created using techniques like Kmeans are often not easy to decipher because it is difficult to determine why a particular row of data is classified in a particular bucket. Knowing these boundary requirements for migrating from one cluster to another is an insight that businesses can use to move data items (such as customers) from one cluster to one more profitable cluster. Is usually useful for decision-makers. For example, if a business has an inactive customer segment and another fairly active customer segment, information of the boundary conditions of some variables that can enable the movement of a customer from an inactive to an active segment would be highly insightful.

In recent times, an algorithm developed by Dasgupta et al. [1] focuses on solving this problem by exploring ways to bring in explainability to clustering along with accuracy. The developed algorithm centres around partitioning a dataset using decision trees into K clusters. In this article, we would dig deep into two of those algorithms – IMM and ExKMC, and how to implement them in Python.

_{Figure 1. Partitioning a Dataset into K clusters using K-means and decision trees (IMM & ExKMC) – Source}

IMM Clustering

Iterative Mistake Minimization (IMM) clustering is a tree-based clustering algorithm that builds a decision tree with the same number of leaves as the number of clusters considered in K-means clustering.

The following steps describe how the algorithm works on a high level:

Finding a clustering solution using some non-explainable clustering algorithm (like K-means)
Labelling each example according to its cluster
Calling a supervised algorithm that learns a decision tree

_{Figure 2. A red point moves to the cluster of blue points by the split. This, in IMM clustering, is known as a mistake as the split led to separation of the red point from its original cluster – Source}

Under the hood, the following steps happen while building the decision tree:

A reference set of K centres from a standard clustering algorithm is obtained for a dataset X.
Each data point Xj is assigned the label yj based on the centre it is closest to.
A decision tree is then built top-down using binary splits.
If a node contains two or more of the reference centres, then it is split again. This is done by picking a feature and a corresponding threshold value such that the resulting split sends at least one reference centre to each side and moreover produces the fewest mistakes: that is, separates the minimum points from their corresponding centres.
The optimal split is found by scanning through all pairs efficiently by dynamic programming. This node is then added to the tree.
The tree stops growing where each of the K centres is in its own leaf.

Growing a Bigger Tree – ExKMC Clustering

ExKMC is an extension of IMM where the authors [3] proposed growing the trees, such that the number of leaves exceeds the number of clusters (from K-means), to achieve better partitioning.

The algorithm intakes a value K, a dataset X, and a number of leaves K’>K.
ExKMC starts with a set of K reference centers taken from any clustering algorithm. This is followed by building a threshold tree with K leaves (IMM algorithm).
The best feature-threshold pair to expand the tree one node at a time is computed.
The tree is expanded by splitting the node with the most improvement to the surrogate cost. Here, the surrogate cost is the sum of squared distances between data points and their closest reference centre (as obtained from K-means).

Given reference centres μ₁,…,μ_k and a threshold tree T that defines the clustering (C¹,…..,C^k), the surrogate cost is :

Splitting of a node happens with the combination of a feature and threshold that leads to the most improvement to the surrogate cost.

Implementing in Python

The Python packages for IMM & ExKMC
algorithms are available publicly and can be installed using the following line
of code –

pip install ExKMC

We use the Iris dataset here to analyze how the two algorithms perform in terms of the ‘mistakes’ they produce with respect to the reference K-means clustering.

1) We start with importing the libraries that we need :

from ExKMC.Tree import Tree
from IPython.display import Image

2) We then read the dataset:

df1 = pd.read_csv(r"/content/drive/MyDrive/iris.csv")

3) The tree we build in the next step requires a kmeans model, so we perform the default kmeans clustering.

X = df1.drop('variety',axis=1)
for cols in X.columns:
  X[cols] = X[cols].astype(float)
  k1=X[cols].mean()
  k2 = np.std(X[cols])
  X[cols] = (X[cols] - k1)/k2
k=3
kmeans = KMeans(k,random_state=43)
kmeans.fit(X)
p = kmeans.predict(X)
class_names = np.array(['Setosa', 'Versicolor', 'Virginica'])

4) As per the IMM algorithm, we build the decision tree (that needs the number of clusters that we used in K-means (k) and the K-means model) :

tree = Tree(k=k)
tree.fit(X, kmeans)

5) Next, we plot the tree that we just built using the code below:

tree.plot(filename="test", feature_names=X.columns)
Image(filename='test.gv.png')

Output:

Since the X was scaled before carrying out the clustering, we see some of the thresholds to be negatives in the decision tree.

6) Until now, we have used the IMM algorithm – which is why we got the decision tree with exactly K leaves. In order to have a decision tree with K’ > K leaves, we will use the ExKMC algorithm. To use this, we can pass another parameter called ‘max_leaves’ to our Tree object. This parameter would take the number of leaves (or K’) that we would like to have in our decision tree.

tree = Tree(k=k, max_leaves=6)
tree.fit(X, kmeans)
tree.plot(filename="test", feature_names=X.columns)
Image(filename='test.gv.png')

Output:

As we can see in the above Figure, there are 6 leaves as we passed to the tree object.

Decisioning Through the Trees

We will start by analyzing the output of the tree built by the IMM algorithm.

The first output was from the IMM algorithm with a number of leaves equal to the number of clusters (3). The rules from the created decision tree bring in the explainability ascpect. Due to the inability of the decision tree with 3 leaves to form the same clusters as the K-means clustering, we see some mistakes at the bottom 2 nodes. As mentioned before, mistakes here refer to the points isolated from their reference cluster centre (from K-means) at each node. There, at the left-most bottom leaf, 8 samples belong to some other cluster.

The second output was the ExKMC output which produced a decision tree with 6 leaves. Here, owing to the higher depth of the tree with respect to the decision tree produced by IMM algorithm, we see fewer mistakes at the 6 leaves. As we see in the Table 1 below, there are some splits in the Tree that are redundant for any explanations – e.g. the node at the right bottom with the decision rule – Sepal Length >=1.280.

_{Table 1. Decision Rules For Explaining Each Cluster (From Output of ExKMC algorithm)}

Decision Rule	Cluster	Number of Samples (In Leaf)	Mistakes
Petal Length>1.056 & Sepal Length <=0.553 & Sepal Width <=-0.132	0	52	2
Petal Length<=1.056	1	50	0
Petal Length>1.056 & Sepal Length > 0.553 & petal Width > 0.133	2	40	1

Conclusion

Standard Clustering techniques are difficult to interpret because they cannot pinpoint the reasons behind formation of the clusters. Knowing the rules or boundary conditions that push a data point to a certain cluster can be very insightful for decision-makers. Algorithms that can bring in explainability to clustering are therefore sought after across the industry.

We looked at two techniques that use decision trees for clustering in order to bring in explainability. The partitions used in the decision tree can be used to explain the creation of the different clusters.
IMM clustering creates a decision tree with a number of leaves equal to the number of clusters that Kmeans clustering considers.
For better partitioning, ExKMC is an extension of IMM with more leaves than the number of clusters (from Kmeans).

Connect with me on LinkedIn: Nibedita Dutta

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Nibedita Dutta

Nibedita completed her master’s in Chemical Engineering from IIT Kharagpur in 2014 and is currently working as a Senior Data Scientist. In her current capacity, she works on building intelligent ML-based solutions to improve business processes.

Algorithm Beginner Machine Learning Python

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Adding Explainability to Clustering

Introduction

Bringing Explainability To Clustering

IMM Clustering

Growing a Bigger Tree – ExKMC Clustering

Implementing in Python

Decisioning Through the Trees

Conclusion

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Congratulations, You Did It!

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect