# Getting Started with K-Means Clustering Algorithm

This article was published as a part of the Data Science Blogathon

## Clustering

The very first clustering algorithm that most people get exposed to is k-Means clustering. This is probably because it is very simple to understand, however, it has several disadvantages which I will mention later.

Clustering is generally viewed as an unsupervised method, so it is difficult to establish a good performance metric.

However, a lot of useful information can be extrapolated from this algorithm. The problem is how to assign semantics to each cluster and thus measure the “performance” of your algorithm. In many cases, a good way to proceed is to visualize your clusters. Obviously, if your data has high-dimensional features, as in many cases, the visualization is not that easy. Let me suggest two ways using k-means and a different clustering algorithm.

**K-mean**: In this case, you can reduce the dimensionality of your data using, for example, PCA. Using an algorithm like this, you can plot the data in a 2D plot and then visualize your clusters. However, what you see in this graph is a 2D projection of your data, so it may not be very accurate, but can still give you an idea of how your clusters are distributed.**A self-organizing map**is a clustering algorithm based on neural networks that create a discretized representation of the input training sample space called a map and is therefore a dimensionality reduction (SOM) technique. You can find a very nice python package called somoclu that implemented this algorithm and an easy way to visualize the result. This algorithm is very good for clustering but the fact is here you don’t have anything to understand or analyze.

## Implementation-

**Let’s begin with the implementation of k-means**

k-Means clustering is based on the idea of a centroid. The centroid of a set of data points $S=x_1,x_2,…,x_n$ can be calculated as

[C=frac{sum_ix_i}{|S|}]
Where $|S|$ is the number of points in the data set. The centroid is essentially the geometric centre of the data set.

# initialize the centroids to random data points for(c_i in C) { c_i=random(S) } while(!convergence) { #assign points to a cluster for(x_i in S) { for(c_i in C) { minDist=INFINITY if(dist(x_i,c_i) < minDist) { clust[i].append(x_i) minDist=dist(x_i,c_i } } } #recompute centroids for(i in numClust) c_i=getCentroid(clust[i]) } }

So there are three main parts of the k-Means algorithm:

- Initializing centroids.
- Assigning data points to clusters.
- Recomputing the centroids.

### 1. Initializing Centroids

*a priori*how many clusters you are looking for (i.e. the

*k*value). Each cluster has a centroid. It is common practice to randomly initialize the centroids. That is, randomly select

*k*data points from your data set and call them the centroids for the

*k*clusters. Alternatively, you could use any prior information that you may have about the locations of the clusters to make a ‘smarter’ initialization which could lead to faster convergence.

### 2. Assigning Data Points to a Cluster,

### 3. Recompute the Centroids

## Example and Code

Consider the following data set which was generated from sampling from three different normal distributions.

Instead of coding the k-Means algorithm from scratch, the Python Scikit-learn package was used to implement k-Means clustering. The following plot shows the result.

*k*set to 3. The black circles indicate the centroids. As you can see, the algorithm performs relatively well, with only a few data points placed in the incorrect cluster.

The following Python code was used in this example.

main() from sklearn.cluster import KMeans import matplotlib.pyplot as pl from numpy import * def generateData(): #sample from three different 2d normal distributions mean1 = [1,3] cov1 = [[1,0.5],[0.5,1]] x1,y1 = random.multivariate_normal(mean1,cov1,100).T mean2 = [7,4] cov2 = [[1,2],[2,2]] x2,y2 = random.multivariate_normal(mean2,cov2,100).T mean3 = [-2,-3] cov3 = [[0.2,1],[1,0.9]] x3,y3 = random.multivariate_normal(mean3,cov3,100).T x=hstack((x1,x2,x3)) y=hstack((y1,y2,y3)) pl.scatter(x,y) pl.show() return vstack((x,y)).T def main(): #generate data data=generateData() #run k-means algorithm kmeans = KMeans(init='random', k=3, n_init=10) kmeans.fit(data) #plot clusters and centroids h = .02 x_min, x_max = data[:, 0].min() + 1, data[:, 0].max() - 1 y_min, y_max = data[:, 1].min() + 1, data[:, 1].max() - 1 xx, yy = meshgrid(arange(x_min, x_max, h), arange(y_min, y_max, h)) Z = kmeans.predict(c_[xx.ravel(), yy.ravel()]) Z = Z.reshape(xx.shape) pl.figure(1) pl.clf() pl.imshow(Z, interpolation='nearest', extent=(xx.min(), xx.max(), yy.min(), yy.max()), cmap=pl.cm.Paired, aspect='auto', origin='lower') pl.plot(data[:, 0], data[:, 1], 'k.', markersize=5) centroids = kmeans.cluster_centers_ pl.scatter(centroids[:, 0], centroids[:, 1], marker='x', s=169, linewidths=3, color='w', zorder=10) pl.xlim(x_min, x_max) pl.ylim(y_min, y_max) pl.xticks(()) pl.yticks(()) pl.show() if __name__ == '__main__':

## Disadvantages of k-Means Clustering

- The final clusters are highly dependent upon the initialization choices of the initial centroids. Since the centroids are often randomly initialized, this could lead to different results each time that you run the algorithm.
- The algorithm is quite computationally intensive. During each iteration, distance calculations need to be performed between each data point and each centroid. If you are dealing with a large data set and a high
*k*value, this could be prohibitively expensive computationally. - You must know the
*k*value a priori. It is often not obvious how many clusters are in a data set. Plotting your data set could give you some visual clues however data sets of more than 3-dimensions are particularly problematic since they are impossible to visualize.

**Conclusion**

As we all know nothing can beat machine learning algorithms, it is one of the most trending technologies. There are many algorithms that we can use in Data analysis and prediction, k-means is one of them. The implementation is quite easier and simple to understand how clustering works. Thanks for reading. Continue learning and stay tuned for more!