*This article was published as a part of the Data Science Blogathon.*

K-means clustering is a very popular and powerful unsupervised machine learning technique where we cluster data points based on similarity or closeness between the data points how exactly We cluster them? which methods do we use in K Means to cluster? for all these questions we are going to get answers in this article, before we begin take a close look at the below clustering example, what do you think? it’s easily interpretable, right? We clustered data points into 3 clusters based on their similarity or closeness.

1.introduction to K Means

2.K Means ++ Algorithm

3.How To Choose K Value in K Means?

4.Practical Considerations in K Means

5.Cluster Tendency

Let’s simply understand K-means clustering with daily life examples. we know these days everybody loves to watch web series or movies on amazon prime, Netflix. have you ever observed one thing whenever you open Netflix? that is grouping movies together based on their genre i.e crime, suspense..etc, hope you observed or already know this. so Netflix genre grouping is one easy example to understand clustering. let’s understand more about k means clustering algorithm.

**Definition: **It groups the data points based on their similarity or closeness to each other, in simple terms, the algorithm needs to find the data points whose values are similar to each other and therefore these points would then belong to the same cluster.

so how does the algorithm find out values between two points to cluster them, the algorithm finds values is by using the method of ‘Distance Measure’. here distance measure is ‘Euclidean Distance’

The observations which are closer or similar to each other would have low Euclidean distance and then clustered together.

one more formula that you need to know to understand K means is ‘Centroid’. The k-means algorithm uses the concept of centroid to create ‘k clusters.’

So now you are ready to understand steps in the k-Means Clustering algorithm.

step1:choose k value for ex: k=2

step2:initialize centroids randomly

step3:calculate Euclidean distance from centroids to each data point and form clusters that are close to centroids

step4: find the centroid of each cluster and update centroids

step:5 repeat step3

Each time clusters are made centroids are updated, the updated centroid is the center of all points which fall in the cluster. This process continues till the centroid no longer changes i.e solution converges.

You can play around with the K-means algorithm using the below link, try it.

https://stanford.edu/class/engr108/visualizations/kmeans/kmeans.html

So what next? how do you choose initial centroids randomly?

Here comes the concept of the k-Means++ algorithm.

I’m not going to stress you more on this so don’t worry. it is very easy to understand. So what is k-means++??? Let’s say we want to choose two centroids initially(k=2), you can choose one centroid randomly or you can choose one of the data points randomly. simple right? Our next task is to choose another centroid, how do you choose? any idea?

We choose the next centroid from the data points which is at a long distance from the existing centroid or the one which is at a long distance from an existing cluster that has a high chance of picking up.

** **

__1.Elbow method__

__steps:__

__step1: compute clustering algorithm for different values of k.__

__for example k=[1,2,3,4,5,6,7,8,9,10]__

__step2: for each k calculate the within-cluster sum of squares(WCSS).__

__step3: plot curve of WCSS according to the number of clusters.__

__step4: The location of bend in the plot is generally considered an indicator of the approximate number of clusters.__

** **

- A choosing number of Clusters in Advance(K).
- Standardization of Data(scaling).
- Categorical Data(can be solved with K-Mode).
- Impact of initial Centroids and Outliers.

Before we apply a clustering algorithm to the given data, it is important to check whether the given data has some meaningful clusters or not. The process to evaluate the data to check if the data is feasible for clustering or not is known as ‘Clustering Tendency’.so we should not blindly apply the clustering method and we should check clustering tendency. how?

We use ‘Hopkins Statistic’ to know whether to perform clustering or not for a given dataset.it examines whether the data points differ significantly from uniformly distributed data in multidimensional space.

This concludes our article on the k-means clustering algorithm. In my next article, I will talk about the python implementation of the K-means clustering algorithm.

Thank you!

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Become a full stack data scientist
##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

Understanding Cost Function
Understanding Gradient Descent
Math Behind Gradient Descent
Assumptions of Linear Regression
Implement Linear Regression from Scratch
Train Linear Regression in Python
Implementing Linear Regression in R
Diagnosing Residual Plots in Linear Regression Models
Generalized Linear Models
Introduction to Logistic Regression
Odds Ratio
Implementing Logistic Regression from Scratch
Introduction to Scikit-learn in Python
Train Logistic Regression in python
Multiclass using Logistic Regression
How to use Multinomial and Ordinal Logistic Regression in R ?
Challenges with Linear Regression
Introduction to Regularisation
Implementing Regularisation
Ridge Regression
Lasso Regression

Introduction to Stacking
Implementing Stacking
Variants of Stacking
Implementing Variants of Stacking
Introduction to Blending
Bootstrap Sampling
Introduction to Random Sampling
Hyper-parameters of Random Forest
Implementing Random Forest
Out-of-Bag (OOB) Score in the Random Forest
IPL Team Win Prediction Project Using Machine Learning
Introduction to Boosting
Gradient Boosting Algorithm
Math behind GBM
Implementing GBM in python
Regularized Greedy Forests
Extreme Gradient Boosting
Implementing XGBM in python
Tuning Hyperparameters of XGBoost in Python
Implement XGBM in R/H2O
Adaptive Boosting
Implementing Adaptive Boosing
LightGBM
Implementing LightGBM in Python
Catboost
Implementing Catboost in Python

Introduction to Clustering
Applications of Clustering
Evaluation Metrics for Clustering
Understanding K-Means
Implementation of K-Means in Python
Implementation of K-Means in R
Choosing Right Value for K
Profiling Market Segments using K-Means Clustering
Hierarchical Clustering
Implementation of Hierarchial Clustering
DBSCAN
Defining Similarity between clusters
Build Better and Accurate Clusters with Gaussian Mixture Models

Introduction to Machine Learning Interpretability
Framework and Interpretable Models
model Agnostic Methods for Interpretability
Implementing Interpretable Model
Understanding SHAP
Out-of-Core ML
Introduction to Interpretable Machine Learning Models
Model Agnostic Methods for Interpretability
Game Theory & Shapley Values

Deploying Machine Learning Model using Streamlit
Deploying ML Models in Docker
Deploy Using Streamlit
Deploy on Heroku
Deploy Using Netlify
Introduction to Amazon Sagemaker
Setting up Amazon SageMaker
Using SageMaker Endpoint to Generate Inference
Deploy on Microsoft Azure Cloud
Introduction to Flask for Model
Deploying ML model using Flask