Shivani Sharma — August 23, 2021
Beginner Clustering Data Science Machine Learning Python Unsupervised

This article was published as a part of the Data Science Blogathon

 

Clustering

 

The very first clustering algorithm that most people get exposed to is k-Means clustering. This is probably because it is very simple to understand, however, it has several disadvantages which I will mention later.

Clustering is generally viewed as an unsupervised method, so it is difficult to establish a good performance metric.

However, a lot of useful information can be extrapolated from this algorithm. The problem is how to assign semantics to each cluster and thus measure the “performance” of your algorithm. In many cases, a good way to proceed is to visualize your clusters. Obviously, if your data has high-dimensional features, as in many cases, the visualization is not that easy. Let me suggest two ways using k-means and a different clustering algorithm.

  • K-mean: In this case, you can reduce the dimensionality of your data using, for example, PCA. Using an algorithm like this, you can plot the data in a 2D plot and then visualize your clusters. However, what you see in this graph is a 2D projection of your data, so it may not be very accurate, but can still give you an idea of ​​how your clusters are distributed.
  • A self-organizing map is a clustering algorithm based on neural networks that create a discretized representation of the input training sample space called a map and is therefore a dimensionality reduction (SOM) technique. You can find a very nice python package called somoclu that implemented this algorithm and an easy way to visualize the result. This algorithm is very good for clustering but the fact is here you don’t have anything to understand or analyze.

 

Implementation-

 

Let’s begin with the implementation of k-means

k-Means clustering is based on the idea of a centroid. The centroid of a set of data points $S=x_1,x_2,…,x_n$ can be calculated as
[C=frac{sum_ix_i}{|S|}] Where $|S|$ is the number of points in the data set. The centroid is essentially the geometric centre of the data set.

The basic procedure of the k-Means algorithm is shown in the following pseudocode
# initialize the centroids to random data points
for(c_i in C)
{
c_i=random(S)
}
while(!convergence)
{
#assign points to a cluster
for(x_i in S)
{
for(c_i in C)
{
minDist=INFINITY
if(dist(x_i,c_i) < minDist)
{
clust[i].append(x_i)
minDist=dist(x_i,c_i
}
}
}
#recompute centroids
for(i in numClust)
c_i=getCentroid(clust[i])
}
}

So there are three main parts of the k-Means algorithm:

 

  • Initializing centroids.
  • Assigning data points to clusters.
  • Recomputing the centroids.

 

1. Initializing Centroids

The first step is to initialize the centroids. You must know a priori how many clusters you are looking for (i.e. the k value). Each cluster has a centroid. It is common practice to randomly initialize the centroids. That is, randomly select k data points from your data set and call them the centroids for the k clusters. Alternatively, you could use any prior information that you may have about the locations of the clusters to make a ‘smarter’ initialization which could lead to faster convergence.

2. Assigning Data Points to a Cluster,

In the upcoming step, we have to assign every data point to a cluster. The way that this is done is by computing the distances between the data point in question and each centroid. You then assign that data point to the cluster belonging to the closest centroid. You repeat this procedure for every data point, thus assigning each data point to a cluster.

3. Recompute the Centroids

Once each data point has been assigned to a cluster, the centroid of each cluster needs to be computed. This can be done using the centroid formula shown above.
In the algorithm, the initialization step is done once, then the algorithm iterates between step 2 and step 3 until convergence occurs. You know that convergence has occurred when the centroids stop changing.

Example and Code

 

Consider the following data set which was generated from sampling from three different normal distributions.

generated sample data | k-Means clustering

Instead of coding the k-Means algorithm from scratch, the Python Scikit-learn package was used to implement k-Means clustering. The following plot shows the result.

k mean clustering machine learning online -

SOURCE

The above result was with k set to 3. The black circles indicate the centroids. As you can see, the algorithm performs relatively well, with only a few data points placed in the incorrect cluster.

The following Python code was used in this example.

main()
from sklearn.cluster import KMeans
import matplotlib.pyplot as pl
from numpy import *

def generateData():
#sample from three different 2d normal distributions
 mean1 = [1,3]
 cov1 = [[1,0.5],[0.5,1]]
 x1,y1 = random.multivariate_normal(mean1,cov1,100).T
 mean2 = [7,4]
 cov2 = [[1,2],[2,2]]
 x2,y2 = random.multivariate_normal(mean2,cov2,100).T
 mean3 = [-2,-3]
 cov3 = [[0.2,1],[1,0.9]]
 x3,y3 = random.multivariate_normal(mean3,cov3,100).T
 x=hstack((x1,x2,x3))
 y=hstack((y1,y2,y3))
 pl.scatter(x,y)
 pl.show()
 return vstack((x,y)).T

def main():
 #generate data
 data=generateData()
 #run k-means algorithm
 kmeans = KMeans(init='random', k=3, n_init=10)
 kmeans.fit(data)
 #plot clusters and centroids
 h = .02
 x_min, x_max = data[:, 0].min() + 1, data[:, 0].max() - 1
 y_min, y_max = data[:, 1].min() + 1, data[:, 1].max() - 1
 xx, yy = meshgrid(arange(x_min, x_max, h), arange(y_min, y_max, h))
 Z = kmeans.predict(c_[xx.ravel(), yy.ravel()])
 Z = Z.reshape(xx.shape)
 pl.figure(1)
 pl.clf()
 pl.imshow(Z, interpolation='nearest',
 extent=(xx.min(), xx.max(), yy.min(), yy.max()),
 cmap=pl.cm.Paired,
 aspect='auto', origin='lower')
 pl.plot(data[:, 0], data[:, 1], 'k.', markersize=5)
 centroids = kmeans.cluster_centers_
 pl.scatter(centroids[:, 0], centroids[:, 1],
 marker='x', s=169, linewidths=3,
 color='w', zorder=10)
 pl.xlim(x_min, x_max)
 pl.ylim(y_min, y_max)
 pl.xticks(())
 pl.yticks(())
 pl.show()
if __name__ == '__main__':

Disadvantages of k-Means Clustering

 

Although k-Means clustering performed well in the example, it has some major disadvantages.
  • The final clusters are highly dependent upon the initialization choices of the initial centroids. Since the centroids are often randomly initialized, this could lead to different results each time that you run the algorithm.
  • The algorithm is quite computationally intensive. During each iteration, distance calculations need to be performed between each data point and each centroid. If you are dealing with a large data set and a high k value, this could be prohibitively expensive computationally.
  • You must know the k value a priori. It is often not obvious how many clusters are in a data set. Plotting your data set could give you some visual clues however data sets of more than 3-dimensions are particularly problematic since they are impossible to visualize.

 

Conclusion

 

As we all know nothing can beat machine learning algorithms, it is one of the most trending technologies. There are many algorithms that we can use in Data analysis and prediction, k-means is one of them. The implementation is quite easier and simple to understand how clustering works. Thanks for reading. Continue learning and stay tuned for more!

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

About the Author

Our Top Authors

  • Analytics Vidhya
  • Guest Blog
  • Tavish Srivastava
  • Aishwarya Singh
  • Aniruddha Bhandari
  • Abhishek Sharma
  • Aarshay Jain

Download Analytics Vidhya App for the Latest blog/Article

Leave a Reply Your email address will not be published. Required fields are marked *