*This article was published as a part of the Data Science Blogathon*

Clustering Algorithms come in handy to use when the dataset provided in the problem statement is not labelled and therefore can not be predicted using supervised learning techniques.

Among the unsupervised techniques used K means algorithm is the most important algorithm that helps to cluster the data on the basis of their similarity.

Therefore it becomes necessary for every aspiring **Data Scientist** and **Machine Learning Engineer **to have a good knowledge of these techniques.

In this article, we will discuss the most important questions on K means Clustering Algorithm which is helpful to get you a clear understanding of the algorithm, and also for **Data Science Interviews**, which covers its very fundamental level to complex concepts.

**Let’s get started!**

K Means algorithm is a centroid-based clustering (unsupervised) technique. This technique groups the dataset into k different clusters having an almost equal number of points. Each of the clusters has a centroid point which represents the mean of the data points lying in that cluster.The idea of the K-Means algorithm is to find k-centroid points and every point in the dataset will belong to either of the k-sets having minimum Euclidean distance.

It is an approximation iterative algorithm that is used to cluster the data points.The steps of this algorithm are as follows:
**Step-1: Initialization**

**Step-2: Assignment**

**Step-3: Updation of Centroid**

**Step-4: Repeat Steps 2 and 3 until convergence**

- Initialization
- Assignment
- Update Centroid
- Repeat Steps 2 and 3 until convergence

Randomly initialized k-centroids from the data points.

For each observation in the dataset, calculate the euclidean distance between the point and all centroids. Then, assign a particular observation to the cluster with the nearest centroid.

Now, observations in the clusters are changed. Therefore, update the value of the centroid with the new mean(average of all observations)value.

Repeat steps 2 and 3 until the algorithm converges. If convergence is achieved then break the loop. Convergence refers to the condition where the previous value of centroids is equal to the updated value after the algorithm run.

**For Example,** let’s have 2 variables, named age and salary where age is in the range of 20 to 60 and salary is in the range of 100-150K, since scales of these variables are different so when these variables are substituted in the euclidean distance formula, then the variable which is on the large scale suppresses the variable which is on the smaller scale. So, the impact of age will not be captured very clearly. Hence, you have to scale the variables to the same range using **Standard Scaler, Min-Max Scaler**, etc.

Euclidean distance is preferred over Manhattan distance since Manhattan distance calculates distance only vertically or horizontally due to which it has dimension restrictions.On the contrary, Euclidean distance can be used in any space to calculate the distances between the data points. Since in K means algorithm the data points can be present in any dimension, so Euclidean distance is a more suitable option.

Let’s understand this with an example,Say, we have **10 different data points** present, now consider the different cases:

**k=10:**For the max value of k, all points behave as one cluster. So, within the cluster sum of squares is zero since only one data point is present in each of the clusters. So, at the max value of k, this should tend to zero.**K=1:**For the minimum value of k i.e, k=1, all these data points are present in the one cluster, and due to more points in the same cluster gives more variance i.e, more within-cluster sum of squares.**Between K=1 from K=10:**When you increase the value of k from 1 to 10, more points will go to other clusters, and hence the total within the cluster sum of squares (inertia) will come down. So, mostly this forms an elbow curve instead of other complex curves.

Hence, we can conclude that there does not exist any other possibility for the plot.

There does not exist a correct answer to this question as k means being an unsupervised learning technique does not discuss anything about the output column. As a result, one can not get the accuracy number or values from the algorithm directly.

Centroid point is the point that acts as a representative of a particular cluster and is the average of all the data points in the cluster which changes in each step (until convergence). Centroid can be calculated using the given formula:

**where,**

**C _{i}:** i

**S _{i}: **All points belonging to set-i with centroid as C

**x _{j}:** j

**||S _{i}||:** number of points in set-i

Yes, the final results of the k means algorithm depend on the centroid initialization as poor initialization can cause the algorithm to get stuck into an inferior local minimum.

The objective of the K-Means algorithm is to find the k (k=no of clusters) number of centroids from C_{1}, C_{2},——, C_{k} which minimizes the within-cluster sum of squares i.e, the total sum over each cluster of the sum of the square of the distance between the point and its centroid.

This cost comes under the **NP-hard problem** and therefore has **exponential time complexity**. So we come up with the idea of approximation using Lloyd’s Algorithm.

- Easy to understand and implement.
- Computationally efficient for both training and prediction.
- Guaranteed convergence.

- We need to provide the number of clusters as an input variable to the algorithm.
- It is very sensitive to the initialization process.
- Good at clustering when we are dealing with spherical cluster shapes, but it will perform poorly when dealing with more complicated shapes.
- Due to the leveraging of the euclidean distance function, it is sensitive to outliers.

The major challenge associated with k means clustering is its **initialization sensitivity**.While finding the initial centroids for K-Means clustering using Lloyd’s algorithm, we were using randomization i.e, initial k-centroids were picked randomly from the data points.

This Randomization in picking the k-centroids creates the problem of initialization sensitivity which tends to affect the final formed clusters. As a result, the final formed clusters depend on how initial centroids were picked.

There are two ways to avoid the problem of initialization sensitivity:

**Repeat K means:**It basically repeats the algorithm again and again along with initializing the centroids followed by picking up the cluster which results in the small intracluster distance and large intercluster distance.**K Means++:**It is a smart centroid initialization technique.

Amongst the above two techniques, K-Means++ is the best approach.

In k-means, we randomly initialized the k number of centroids while in the k-means++ algorithm, firstly we initialized 1 centroid and for other centroids, we have to ensure that the next centroids are very far from the initial centroids which result in a lower possibility of the centroid being poorly initialized. As a result, the convergence is faster in K means++ clustering.Moreover, in order to implement the k-means++ clustering using the **Scikit-learn** library, we set the parameters to **init = kmeans++ **instead of **random**.

K Means++ algorithm is a smart technique for centroid initialization that initialized one centroid while ensuring the others to be far away from the chosen one resulting in faster convergence.The steps to follow for centroid initialization are:

**Step-1:** Pick the first centroid point randomly.

**Step-2: **Compute the distance of all points in the dataset from the selected centroid. The distance of x_{i} point from the farthest centroid can be calculated by the given formula:

**where,**

**d _{i}:** Distance of x

**m:** number of centroids already picked

**Step-3:** Make the point x_{i} as the new centroid that is having maximum probability proportional to d_{i}

**Step-4: **Repeat the above last two steps till you find k centroids.

Most of the people give answers to this question directly as the Elbow Method however the explanation is only partially correct.In order to find the optimal value of k, we need to observe our business problem carefully, along with analyzing the business inputs as well as the person who works on that data so that a decent idea regarding the optimal number of clusters can be extracted.

**For Example,** If we consider the data of a shopkeeper selling a product in which he will observe that some people buy things in summer, some in winter while some in between these two. So, the shopkeeper divides the customers into three categories. Therefore, K=3.

In cases where we do not get inference from the data directly we often use the following mentioned techniques:

**Elbow Method –**This method finds the point of inflection on a graph of the percentage of variance explained to the number of K and finds the elbow point.**Silhouette method –**The silhouette method calculates similarity/dissimilarity score between their assigned cluster and the next best (i.e, nearest) cluster for each of the data points.

Moreover, there are also other techniques along with the above-mentioned ones to find the optimal no of k.

where,

**K:** It represents the number of clusters

**I:** It represents the number of iterations

**N:** It represents the sample size

**M:** It represents the number of variables

**Conclusion: **There is a significant Impact on capping the number of iterations.

**Predicting complexity in terms of Big-O notation:**

**“K*N*M”**

Prediction needs to be computed for each record, the distance to each cluster and assigned to the nearest ones.

When the K-Means algorithm has reached the local or global minima, it will not change the assignment of data points to clusters for two successive iterations during the algorithm run.

The K means clustering algorithm fails to give good results in the below-mentioned cases:

- When the dataset contains
**outliers** - When the
**density spread**of data points across the data space is different. - When the data points follow a
**non-convex shape**.

The idea behind this is **mini-batch k means**, which is an alternative to the traditional k means clustering algorithm that provides better performance for training on larger datasets.It leverages the mini-batches of data, taken at random to update the cluster mean with a decreasing learning rate. For each data batch, the points are all first assigned to a cluster and then means are re-calculated. The cluster centres are then further re-calculated using** gradient descent**. This algorithm provides faster convergence than the typical k-means, but with a slightly different cluster output.

The following can be used as possible stopping conditions in K-Means clustering:

**Max number of iterations has been reached**: This condition limits the runtime of the clustering algorithm, but in some cases, the quality of the clustering will be poor because of an insufficient number of iterations.**When RSS(within-cluster sum of squares) falls below a threshold**: This criterion ensures that the clustering is of the desired quality after termination. Practically in real-life problems, it’s a good practice to combine it with a bound on the number of iterations to guarantee convergence.**Convergence**: Points stay in the same cluster i.e., the algorithm has converged at the minima.**Stability**: Centroids of new clusters do not change.

The number of variables going into K means the algorithm has an impact on both the time(during training) and complexity(upon application) along with the behaviour of the algorithm as well.This is also related to the **“Curse of dimensionality”**. As the dimensionality of the dataset increases, more and more examples become nearest neighbours of x_{t}, until the choice of nearest neighbour is effectively random.

A key component of K means is that the distance-based computations are directly impacted by a large number of dimensions since the distances between a data point and its nearest and farthest neighbours can become equidistant in high dimension thereby resulting in reduced accuracy of distance-based analysis tools.

Therefore, we have to use the **Dimensionality reduction** techniques such as **Principal component analysis (PCA)**, or** Feature Selection Techniques**.

*Thanks for reading!*

I hope you enjoyed the questions and were able to test your knowledge about K means Clustering Algorithm.

If you liked this and want to know more, go visit my other articles on Data Science and Machine Learning by clicking on** **the** Link**

Please feel free to contact me** **on** Linkedin, Email.**

Something not mentioned or want to share your thoughts? Feel free to comment below And I’ll get back to you.

Currently, I pursuing my Bachelor of Technology (B.Tech) in Computer Science and Engineering from the **Indian Institute of Technology Jodhpur(IITJ). **I am very enthusiastic about Machine learning, Deep Learning, and Artificial Intelligence.

*The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion. *

Lorem ipsum dolor sit amet, consectetur adipiscing elit,