Home » The Most Comprehensive Guide to K-Means Clustering You’ll Ever Need

The Most Comprehensive Guide to K-Means Clustering You’ll Ever Need

• ARJUN CHAUDHURI says:

Hi Pulkit, Thank you for this excellent article on the subject – one of the most comprehensive ones I have read. My question is that lets say I have 7 distinct clusters arrived at using the techniques you have mentioned. How can I come up with relevant criteria/ rules using some ML algorithm such that any new observation can be assigned to one of the clusters by passing through the decision rule instead of running K-Means again.

• Pulkit Sharma says:

Hi Arjun,
Glad that you liked the article!
For new observations, you will first calculate the distance of this new observation will all the cluster centroids (7 as you have mentioned) and then assign this new observation to the cluster whose centroid is closest to this observations. In this way you can assign new observations to the cluster.

• Rajiv says:

Hi Pulkit,

Thanks for the post. Kindly clarify me:

1. In the “WholeSale Customer Data” data set, the variables: region and channel are categorical. In mathematical terms, we can not describe distance between different categories of a categorical variable. But we converted them to a numeric form here and the distances are calculated. How can we justify the usage of these variables while clustering?

2. Usually in most of the real-world problems, we have datasets of mixed form( containing of both numerical and categorical features). Is it ok to apply same k-means algorithm, on such datasets?

-Rajiv

• Sumit says:

– It is not advisable to use the ordinal form of categorical variables in clustering, you have to convert them in numeric values which make more sense with rest of the data points, you can use one of the following methods to convert them into numeric form
1. Use 1-hot encoding (So that one category is not influenced by other numerically)
2. If you have classification problem, use target encoding to encode the categorical variables
3. If the categories are ordinal in nature then you may use the label encoding
4. Find the correlation between the categorical variable and all the numeric variables, now replace the mean of the numeric variable value which has the highest correlation with the categorical variable. Correlation can be found using the one-way ANOVA test.

I would recommend to use the method 4 above.

• Pulkit Sharma says:

Hi Sumit,
Thanks for sharing these approaches to deal with categorical data while working with K-means algorithm.

• Joshua Larky says:

You may be interested in investigating K-Modes clustering algorithm, which will handle numerical and categorical data, whereas K-Means clustering is strictly numerical data.

• Thothathiri S says:

Cluster explained very well. Thanks for the article in Python.
Can you clarify below points
1) In the wholesale example, all the columns are considered for clustering, Column Channel & Region also need to be included? as there is no variation in that.
2) After identifying the cluster group, how to update back the cluster group in the raw data

• Pulkit Sharma says:

Hi,
Thank you for your feedback on the article.
1) This is based on your exploration. I have created the model using all the available features. You can explore the data more and then try to include the variables which you think are useful.
2) Using K-Means, each point is assigned to a specific cluster. You can use model.predict() to find the cluster number for each observation.

• Pon says:

Hi Pulkit,

1. C=[]
2. for index,row in X.iterrows():
3. min_dist=row[1]
4. pos=1
5. for i in range(K):
6. if row[i+1] < min_dist:
7. min_dist = row[i+1]
8. pos=i+1
9. C.append(pos)

In the line 3, i think it should be: min_dist=row[2]
and in line 6 should be: if row[i+2] < min_dist:

• Pulkit Sharma says:

Hi Pon,
Have you tried using the code that you have mentioned here? I tried it and it produced error. Also, what is the logic behind using the code that you have mentioned?

• Saleem says:

Thanks for the article Pulkit. Can you please clarify my queries:
1. K- Means , by default assigns the initial centroid thru init : {‘k-means++’}. Hope, it will be taken care by sklearn.
2. For an imbalanced data which has the class ratio of 100 : 1, can i generate labels thru kmeans and use it as a feature in my classification algorithm? Will it improve accuracy like knn?

• Pulkit Sharma says:

Glad that you liked the article Saleem!
1. Yes! By default, sklearn implementation of k-means initialize the centroids using k-means++ algorithm and hence even if you have not defined the initialization as k-means++, it will automatically pick this initialization.

2. You can cluster the points using K-means and use the cluster as a feature for supervised learning. It is not always necessary that the accuracy will increase. It may increase or might decrease as well. You can try and check that out.
Also, when you have an imbalanced dataset, accuracy is not the right evaluation metric to evaluate your model. You can try F1 score or AUC-ROC.

Hope this will clarify your queries.

• Rishab Gupta says:

Hey Pulkit, this is a really great article and it really helps a lot to get a clear understanding about k means.
I am trying to replicate the process in R and I had a question about multiple variables.
So given a similar dataset, if I have multiple observations and I have multiple variables, is there a way I can run a k means on multiple variables? If yer then is there a limit?

• Pulkit Sharma says:

Hi Rishab,
Yes you can apply k-means if you have multiple variables. In python I use the sklearn library to implement k-means, you can search for some similar thing in R as well.
There is no limit of variables as such. Its just that if you have more number of variables, the computation time will increase.

• Vincent Kizza says:

Awesome! You have given me a real push. Many thanks for the article.

• Pulkit Sharma says:

Thank you for your feedback Vincent!!

• Sujay says:

Hi ,
can you provide more information on code , “model.predict()” to find the cluster number for each observation.

• Pulkit Sharma says:

Hi Sujay,
It will takes each observation, find the distance of that observation from all the cluster centroids and then depending on the distance, assign it to the closest cluster. This is how the predictions are made in k-means clustering.

• Wasiq says:

Hi,
Great article and well explained for someone who has little to no experience or formal institutionalized education in the field. Very intuitive. My question is regarding how I can isolate a specific cluster to do further analysis on it or to prove some sort of hypothesis about a cluster. I have a decent understanding of algorithms due to an engineering background but lack the intuition for programming languages and thus am relatively inexperienced at python

• Pulkit Sharma says:

Hi Wasiq,
Thank you for your valuable feedback.
If you look at the last code block from the article:
frame = pd.DataFrame(data_scaled)
frame[‘cluster’] = pred

This frame dataframe will have a new variable named ‘cluster’ which will tell the cluster number for each observations. You can then separate the datasets based on their cluster value.

• Nikhil says:

Hi Pulkit ,
Can you share any code, where we are applying supervised learning after clustering, because that’s how the flow is right?

• Pulkit Sharma says:

Hi Nikhil,
Right now I don’t have any resource for this. Will surely look for it and share with you once I find some relevant resource.

• Maneesha says:

Hi Pulkit,

Thank a lot for this amazing and well explained article on K-means.
I am just confused about the way distances are calculated in K-means for choosing the centroids. What is the default method for calculating distances and can we mention any other method in place of default if we want to?

• Pulkit Sharma says:

Hi Maneesha,
By default, we use euclidean distance to calculate the distances. You can use other distances as well, but then you’ll have to write your custom code and then implement these algorithms.

• Kiran Arya says:

Hi Pulkit,

Thanks for the superb article. This is by far the most comprehensive piece on clustering i cam across. Would be great if you could also share how to evaluate the clusters created alongwith how to use this output.

Thanks,
Kiran

• Pulkit Sharma says:

Hi Kiran,
To evaluate the clusters, you can use any of these evaluation metrics: Inertia and Dunn Index. This will tell you how accurate your clusters are.

• Suat ATAN says:

Awesome article. Thanks a lot. It is very very explanative, exciting and useful.

• Pulkit Sharma says:

Thank you Suat!!

• Kurt Schulzke says:

Great article. However, this phrase is missing important information:

“. . . inertia actually calculates the sum of all the points within a cluster from the centroid of that cluster.”

I believe the correct statement is as follows:

“. . . inertia actually calculates the sum of the distances of all the points within a cluster from the centroid of that cluster.”

• Pulkit Sharma says:

Thank you for pointing that out kurt. This was what I meant from the statement. I have updated the same in the article.

• Sunny says:

Pulkit ..one of very simplified approach to expose K means to new entrants to data science..Thanks very much ..

If you have written any article on anomaly detection techniques using K means I will be interested .If you can share it will be much appreciated.

• Pulkit Sharma says:

Hi Sunny,
As of now, I have not covered this project of anomaly detection in my articles. Will share with you once I come across something relevant to this.

• Loanvenue says:

I really enjoyed your blog Thanks for sharing such an informative post.

• Pulkit Sharma says:

• Fernando Santos says:

• Pulkit Sharma says:

Good to hear that Fernando!!