K Means Clustering Simplified in Python
- What Is K Means Clustering
- Implementation of K means Clustering
- WCSS And Elbow Method To find No. Of clusters
- Python Implementation of K means Clustering
K means is one of the most popular Unsupervised Machine Learning Algorithms Used for Solving Classification Problems. K Means segregates the unlabeled data into various groups, called clusters, based on having similar features, common patterns.
Table of Contents
- What Is Clustering
- What Is K Means Algorithm
- Diagrammatic Implementation of KMeans Clustering
- Choosing The Right Number of Cluster
- Python Implementation
1. What Is Clustering?
Suppose we have N number of Unlabeled Multivariate Datasets of various Animals like Dogs, Cats, birds etc. The technique to segregate Datasets into various groups, on basis of having similar features and characteristics, is being called Clustering.
The groups being Formed are being known as Clusters. Clustering Technique is being used in various Field such as Image recognition, Spam Filtering
Clustering is being used in Unsupervised Learning Algorithm in Machine Learning as it can be segregated multivariate data into various groups, without any supervisor, on basis of common pattern hidden inside the datasets.
2. What Is K Means Algorithm
Kmeans Algorithm is an Iterative algorithm that divides a group of n datasets into k subgroups /clusters based on the similarity and their mean distance from the centroid of that particular subgroup/ formed.
K, here is the pre-defined number of clusters to be formed by the Algorithm. If K=3, It means the number of clusters to be formed from the dataset is 3
Algorithm steps Of K Means
The working of the K-Means algorithm is explained in the below steps:
Step-1: Select the value of K, to decide the number of clusters to be formed.
Step-2: Select random K points which will act as centroids.
Step-3: Assign each data point, based on their distance from the randomly selected points (Centroid), to the nearest/closest centroid which will form the predefined clusters.
Step-4: place a new centroid of each cluster.
Step-5: Repeat step no.3, which reassign each datapoint to the new closest centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to Step 7.
3. Diagrammatic Implementation of K Means Clustering
STEP 1:Let’s choose number k of clusters, i.e., K=2, to segregate the dataset and to put them into different respective clusters. We will choose some random 2 points which will act as centroid to form the cluster.
STEP 2: Now we will assign each data point to a scatter plot based on its distance from the closest K-point or centroid. It will be done by drawing a median between both the centroids. Consider the below image:
STEP 3: points left side of the line is near to blue centroid, and points to the right of the line are close to the yellow centroid. The left one Form cluster with blue centroid and the right one with the yellow centroid.
STEP 4:repeat the process by choosing a new centroid. To choose the new centroids, we will find the new center of gravity of these centroids, which is depicted below :
STEP 5: Next, we will reassign each datapoint to the new centroid. We will repeat the same process as above (using a median line). The yellow data point on the blue side of the median line will be included in the blue cluster
STEP 6: As reassignment has taken place, so we will repeat the above step of finding new centroids.
STEP 7: We will repeat the above process of finding the center of gravity of centroids, as being depicted below
STEP 8: After Finding the new centroids we will again draw the median line and reassign the data points, like the above steps.
STEP 9: We will finally segregate points based on the median line, such that two groups are being formed and no dissimilar point to be included in a single group
The final Cluster being formed are as Follows
4. Choosing The Right Number Of Clusters
The number of clusters that we choose for the algorithm shouldn’t be random. Each and Every cluster is formed by calculating and comparing the mean distances of each data points within a cluster from its centroid.
We Can Choose the right number of clusters with the help of the Within-Cluster-Sum-of-Squares (WCSS) method.
WCSS Stands for the sum of the squares of distances of the data points in each and every cluster from its centroid.
The main idea is to minimize the distance between the data points and the centroid of the clusters. The process is iterated until we reach a minimum value for the sum of distances.
To find the optimal value of clusters, the elbow method follows the below steps:
1 Execute the K-means clustering on a given dataset for different K values (ranging from 1-10).
2 For each value of K, calculates the WCSS value.
3 Plots a graph/curve between WCSS values and the respective number of clusters K.
4 The sharp point of bend or a point( looking like an elbow joint ) of the plot like an arm, will be considered as the best/optimal value of K
5. Python Implementation
Importing relevant libraries
import numpy as np import pandas as pd import statsmodels.api as sm import matplotlib.pyplot as plt import seaborn as sns sns.set() from sklearn.cluster import KMeans
Loading the Data
data = pd.read_csv('Countryclusters.csv') data
Plotting the data
plt.scatter(data['Longitude'],data['Latitude']) plt.xlim(-180,180) plt.ylim(-90,90) plt.show()
Selecting the feature
x = data.iloc[:,1:3] # 1t for rows and second for columns x
kmeans = KMeans(3) means.fit(x)
identified_clusters = kmeans.fit_predict(x) identified_clusters
array([1, 1, 0, 0, 0, 2])
data_with_clusters = data.copy() data_with_clusters['Clusters'] = identified_clusters plt.scatter(data_with_clusters['Longitude'],data_with_clusters['Latitude'],c=data_with_clusters['Clusters'],cmap='rainbow')
Trying different method ( to find no .of clusters to be selected)
WCSS and Elbow Method
wcss= for i in range(1,7): kmeans = KMeans(i) kmeans.fit(x) wcss_iter = kmeans.inertia_ wcss.append(wcss_iter) number_clusters = range(1,7) plt.plot(number_clusters,wcss) plt.title('The Elbow title') plt.xlabel('Number of clusters') plt.ylabel('WCSS')
we can choose 3 as no. of clusters, this method shows what is the good number of clusters.
With this, I finish this blog.
Hello Everyone, Namaste
My name is Pranshu Sharma and I am a Data Science Enthusiast
Thank you so much for taking your precious time to read this blog. Feel free to point out any mistake(I’m a learner after all) and provide respective feedback or leave a comment.
Email: [email protected]
The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.