K-means clustering is a very popular and powerful unsupervised machine learning technique where we cluster data points based on similarity or closeness between the data points how exactly We cluster them? which methods do we use in K Means to cluster? for all these questions we are going to get answers in this article, before we begin take a close look at the below clustering example, what do you think? it’s easily interpretable, right? We clustered data points into 3 clusters based on their similarity or closeness.
Table Of Contents
1.introduction to K Means
2.K Means ++ Algorithm
3.How To Choose K Value in K Means?
4.Practical Considerations in K Means
Let’s simply understand K-means clustering with daily life examples. we know these days everybody loves to watch web series or movies on amazon prime, Netflix. have you ever observed one thing whenever you open Netflix? that is grouping movies together based on their genre i.e crime, suspense..etc, hope you observed or already know this. so Netflix genre grouping is one easy example to understand clustering. let’s understand more about k means clustering algorithm.
Definition: It groups the data points based on their similarity or closeness to each other, in simple terms, the algorithm needs to find the data points whose values are similar to each other and therefore these points would then belong to the same cluster.
so how does the algorithm find out values between two points to cluster them, the algorithm finds values is by using the method of ‘Distance Measure’. here distance measure is ‘Euclidean Distance’
The observations which are closer or similar to each other would have low Euclidean distance and then clustered together.
one more formula that you need to know to understand K means is ‘Centroid’. The k-means algorithm uses the concept of centroid to create ‘k clusters.’
So now you are ready to understand steps in the k-Means Clustering algorithm.
Steps in K-Means:
step1:choose k value for ex: k=2
step2:initialize centroids randomly
step3:calculate Euclidean distance from centroids to each data point and form clusters that are close to centroids
step4: find the centroid of each cluster and update centroids
step:5 repeat step3
Each time clusters are made centroids are updated, the updated centroid is the center of all points which fall in the cluster. This process continues till the centroid no longer changes i.e solution converges.
You can play around with the K-means algorithm using the below link, try it.
So what next? how do you choose initial centroids randomly?
2. K-Means ++ Algorithm:
3.How To Choose K Value In K-Means:
step1: compute clustering algorithm for different values of k.
for example k=[1,2,3,4,5,6,7,8,9,10]
step2: for each k calculate the within-cluster sum of squares(WCSS).
step3: plot curve of WCSS according to the number of clusters.
step4: The location of bend in the plot is generally considered an indicator of the approximate number of clusters.
4.Practical Considerations In K-Means:
- A choosing number of Clusters in Advance(K).
- Standardization of Data(scaling).
- Categorical Data(can be solved with K-Mode).
- Impact of initial Centroids and Outliers.
5. Cluster Tendency:
The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.