This article was published as a part of the Data Science Blogathon.
K-means clustering is a very popular and powerful unsupervised machine learning technique where we cluster data points based on similarity or closeness between the data points how exactly We cluster them? which methods do we use in K Means to cluster? for all these questions we are going to get answers in this article, before we begin take a close look at the below clustering example, what do you think? it’s easily interpretable, right? We clustered data points into 3 clusters based on their similarity or closeness.
1.introduction to K Means
2.K Means ++ Algorithm
3.How To Choose K Value in K Means?
4.Practical Considerations in K Means
5.Cluster Tendency
Let’s simply understand K-means clustering with daily life examples. we know these days everybody loves to watch web series or movies on amazon prime, Netflix. have you ever observed one thing whenever you open Netflix? that is grouping movies together based on their genre i.e crime, suspense..etc, hope you observed or already know this. so Netflix genre grouping is one easy example to understand clustering. let’s understand more about k means clustering algorithm.
Definition:Â It groups the data points based on their similarity or closeness to each other, in simple terms, the algorithm needs to find the data points whose values are similar to each other and therefore these points would then belong to the same cluster.
so how does the algorithm find out values between two points to cluster them, the algorithm finds values is by using the method of ‘Distance Measure’. here distance measure is ‘Euclidean Distance’
The observations which are closer or similar to each other would have low Euclidean distance and then clustered together.
one more formula that you need to know to understand K means is ‘Centroid’. The k-means algorithm uses the concept of centroid to create ‘k clusters.’
So now you are ready to understand steps in the k-Means Clustering algorithm.
step1:choose k value for ex: k=2
step2:initialize centroids randomly
step3:calculate Euclidean distance from centroids to each data point and form clusters that are close to centroids
step4:Â find the centroid of each cluster and update centroids
step:5 repeat step3
Each time clusters are made centroids are updated, the updated centroid is the center of all points which fall in the cluster. This process continues till the centroid no longer changes i.e solution converges.
You can play around with the K-means algorithm using the below link, try it.
https://stanford.edu/class/engr108/visualizations/kmeans/kmeans.html
So what next? how do you choose initial centroids randomly?
Â
1.Elbow method
steps:
step1: compute clustering algorithm for different values of k.
for example k=[1,2,3,4,5,6,7,8,9,10]
step2: for each k calculate the within-cluster sum of squares(WCSS).
step3: plot curve of WCSS according to the number of clusters.
step4: The location of bend in the plot is generally considered an indicator of the approximate number of clusters.
Â
The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.
Lorem ipsum dolor sit amet, consectetur adipiscing elit,