K Means Clustering | Step-by-Step Tutorials for Clustering in Data Analysis
K means is one of the most popular Unsupervised Machine Learning Algorithms Used for Solving Classification Problems in data science and is very important if you are aiming for a data scientist role. K Means segregates the unlabeled data into various groups, called clusters, based on having similar features and common patterns. This tutorial will teach you the definition and applications of clustering, focusing on the K means clustering algorithm and its implementation in Python. It will also tell you how to choose the optimum number of clusters for a dataset.
- Understand what the K-means clustering algorithm is.
- Develop a good understanding of the steps involved in implementing the K-Means algorithm and finding the optimal number of clusters.
- Implement K means Clustering in Python with scikit-learn library.
This article was published as a part of the Data Science Blogathon.
Table of contents
What Is Clustering?
Suppose we have N number of unlabeled multivariate datasets of various animals like dogs, cats, birds, etc. The technique of segregating these datasets into various groups on the basis of having similar features and characteristics is called clustering.
The groups being formed are known as clusters. Clustering techniques are used in various fields, such as image recognition, spam filtering, etc. They are also used in unsupervised learning algorithms in machine learning, as they can segregate multivariate data into various groups, without any supervisor, on the basis of common patterns hidden inside the datasets.
What Is K-Means Clustering Algorithm?
The k-means clustering algorithm is an Iterative algorithm that divides a group of n datasets into k different clusters based on the similarity and their mean distance from the centroid of that particular subgroup/ formed.
K, here is the pre-defined number of clusters to be formed by the algorithm. If K=3, It means the number of clusters to be formed from the dataset is 3.
Implementation of the K-Means Algorithm
The implementation and working of the K-Means algorithm are explained in the steps below:
Step 1: Select the value of K to decide the number of clusters (n_clusters) to be formed.
Step 2: Select random K points that will act as cluster centroids (cluster_centers).
Step 3: Assign each data point, based on their distance from the randomly selected points (Centroid), to the nearest/closest centroid, which will form the predefined clusters.
Step 4: Place a new centroid of each cluster.
Step 5: Repeat step no.3, which reassigns each datapoint to the new closest centroid of each cluster.
Step 6: If any reassignment occurs, then go to step 4; else, go to step 7.
Step 7: Finish
Diagrammatic Implementation of K-Means Clustering
Step 1: Let’s choose the number k of clusters, i.e., K=2, to segregate the dataset and put them into different respective clusters. We will choose some random 2 points which will act as centroids to form the cluster.
Step 2: Now, we will assign each data point to a scatter plot based on its distance from the closest K-point or centroid. It will be done by drawing a median between both the centroids.
Step 3: points on the left side of the line are near the blue centroid, and points to the right of the line are close to the yellow centroid. The left forms a cluster with the blue centroid, and the right one with the yellow centroid.
Step 4: Repeat the process by choosing a new centroid. To choose the new centroids, we will find the new center of gravity of these centroids, as depicted below.
Step 5: Next, we will reassign each data point to the new centroid. We will repeat the same process as above (using a median line). The yellow data point on the blue side of the median line will be included in the blue cluster.
Step 6: As reassignment has occurred, we will repeat the above step of finding new k centroids.
Step 7: We will repeat the above process of finding the center of gravity of k centroids, as depicted below.
Step 8: After finding the new k centroids, we will again draw the median line and reassign the data points, like the above steps.
Step 9: We will finally segregate points based on the median line, such that two groups are being formed and no dissimilar point is to be included in a single group.
The final cluster formed is like this:
Choosing the Optimal Number of Clusters
The number of clusters that we choose for the algorithm shouldn’t be random. Each and every cluster is formed by calculating and comparing the mean distances of each data point within a cluster from its centroid.
We can choose the right number of clusters with the help of the Within-Cluster-Sum-of-Squares (WCSS) method. WCSS stands for the sum of the squares of distances of the data points in each and every cluster from its centroid.
The main idea is to minimize the distance (e.g., euclidean distance) between the data points and the centroid of the clusters. The process is iterated until we reach a minimum value for the sum of distances.
Here are the steps to follow in order to find the optimal number of clusters using the elbow method:
Step 1: Execute the K-means clustering on a given dataset for different K values (ranging from 1-10).
Step 2: For each value of K, calculate the WCSS value.
Step 3: Plot a graph/curve between WCSS values and the respective number of clusters K.
Step 4: The sharp point of bend or a point (looking like an elbow joint) of the plot, like an arm, will be considered as the best/optimal value of K.
Importing relevant libraries
import numpy as np import pandas as pd import statsmodels.api as sm import matplotlib.pyplot as plt import seaborn as sns sns.set() from sklearn.cluster import KMeans
Loading the data
data = pd.read_csv('Countryclusters.csv') data
Plotting the data
Python Code for K-Means Clustering:
Selecting the feature
x = data.iloc[:,1:3] # 1t for rows and second for columns x
kmeans = KMeans(3) means.fit(x)
identified_clusters = kmeans.fit_predict(x) identified_clusters
array([1, 1, 0, 0, 0, 2])
data_with_clusters = data.copy() data_with_clusters['Clusters'] = identified_clusters plt.scatter(data_with_clusters['Longitude'],data_with_clusters['Latitude'],c=data_with_clusters['Clusters'],cmap='rainbow')
WCSS and Elbow Method
wcss= for i in range(1,7): kmeans = KMeans(i) kmeans.fit(x) wcss_iter = kmeans.inertia_ wcss.append(wcss_iter) number_clusters = range(1,7) plt.plot(number_clusters,wcss) plt.title('The Elbow title') plt.xlabel('Number of clusters') plt.ylabel('WCSS')
This method shows that 3 is a good number of clusters.
To summarize everything that has been stated so far, k-means clustering is a widely used unsupervised machine learning technique that enables the grouping of data into clusters based on similarity. It is a simple algorithm that can be applied to various domains and data types, including image and text data. k-means can be used for a variety of purposes. We can use it to perform dimensionality reduction also, where each transformed feature is the distance of the point from a cluster center.
- K-means is a widely used unsupervised machine learning algorithm for clustering data into groups (also known as clusters) of similar objects.
- The objective is to minimize the sum of squared distances between the objects and their respective cluster centroids.
- The k-means clustering algorithm is limited as it can not handle complex and non-linear data.
Frequently Asked Questions
A. n_init is an integer and represents the number of times or the number of iterations the k-means algorithm will be run independently.
A. Advantages of K-means Clustering include its simplicity, scalability, and versatility, as it can be applied to a wide range of data types. Disadvantages include its sensitivity to the initial placement of centroids and its limitations in handling complex, non-linear data. k-means is also sensitive to outliers.
A. In K-Means, random_state represents random number generation for centroid initialization. We can use an Integer value to make the randomness fixed or constant. Also, it helps when we want to produce the same clusters every time.
The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.