Guide to K-Nearest Neighbors Algorithm in Machine Learning

tavish 28 Aug, 2024

8 min read

Introduction

In the four years of my data science career, I have built more than 80% of classification models and just 15-20% of regression models. These ratios can be more or less generalized throughout the industry. The reason behind this bias towards classification models is that most analytical problems involve making decisions. In this article, we will talk about one such widely used machine learning classification technique called the k-nearest neighbors (KNN) algorithm. Our focus will primarily be on how the algorithm works on new data and how the input parameter affects the output/prediction.

Note: People who prefer to learn through videos can learn the same through our free course – K-Nearest Neighbors (KNN) Algorithm in Python and R. And if you are a complete beginner to Data Science and Machine Learning, check out our Certified BlackBelt program –

Certified AI & ML Blackbelt+ Program

Learning Objectives

Understand the working of KNN and how it operates in python and R.
Get to know how to choose the right value of k for KNN
Understand the difference between training error rate and validation error rate

Introduction
What is KNN (K-Nearest Neighbor) Algorithm in Machine Learning?
When Do We Use the KNN Algorithm?
How Does the KNN Algorithm Work?
How Do We Choose the Factor K?
Breaking It Down – Pseudo Code of KNN
Implementation in Python From Scratch
Comparing Our Model With Scikit-learn
Implementation of KNN in R
Comparing Our KNN Predictor Function With “Class” Library
Conclusion
Frequently Asked Questions

What is KNN (K-Nearest Neighbor) Algorithm in Machine Learning?

The K-Nearest Neighbors (KNN) algorithm is a popular machine learning technique used for classification and regression tasks. It relies on the idea that similar data points tend to have similar labels or values.

During the training phase, the KNN algorithm stores the entire training dataset as a reference. When making predictions, it calculates the distance between the input data point and all the training examples, using a chosen distance metric such as Euclidean distance.

Next, the algorithm identifies the K nearest neighbors to the input data point based on their distances. In the case of classification, the algorithm assigns the most common class label among the K neighbors as the predicted label for the input data point. For regression, it calculates the average or weighted average of the target values of the K neighbors to predict the value for the input data point.

The KNN algorithm is straightforward and easy to understand, making it a popular choice in various domains. However, its performance can be affected by the choice of K and the distance metric, so careful parameter tuning is necessary for optimal results.

When Do We Use the KNN Algorithm?

KNN Algorithm can be used for both classification and regression predictive problems. However, it is more widely used in classification problems in the industry. To evaluate any technique, we generally look at 3 important aspects:

1. Ease of interpreting output

2. Calculation time

3. Predictive Power

Let us take a few examples to place KNN in the scale :

KNN classifier fairs across all parameters of consideration. It is commonly used for its ease of interpretation and low calculation time.

How Does the KNN Algorithm Work?

Let’s take a simple case to understand this algorithm. Following is a spread of red circles (RC) and green squares (GS):

You intend to find out the class of the blue star (BS). BS can either be RC or GS and nothing else. The “K” in KNN algorithm is the nearest neighbor we wish to take the vote from. Let’s say K = 3. Hence, we will now make a circle with BS as the center just as big as to enclose only three data points on the plane. Refer to the following diagram for more details:

The three closest points to BS are all RC. Hence, with a good confidence level, we can say that the BS should belong to the class RC. Here, the choice became obvious as all three votes from the closest neighbor went to RC. The choice of the parameter K is very crucial in this algorithm. Next, we will understand the factors to be considered to conclude the best K.

How Do We Choose the Factor K?

First, let us try to understand the influence of the K-nearest neighbors (KNN) in the algorithm. If we consider the last example, keeping all 6 training observations constant, a given K value allows us to establish boundaries for each class. These decision boundaries effectively segregate, for instance, RC from GS. Similarly, let’s examine the impact of the value “K” on these class boundaries. The following illustrates the distinct boundaries that separate the two classes, each corresponding to different values of K.

If you watch carefully, you can see that the boundary becomes smoother with increasing value of K. With K increasing to infinity it finally becomes all blue or all red depending on the total majority. The training error rate and the validation error rate are two parameters we need to access different K-value. Following is the curve for the training error rate with a varying value of K :

As you can see, the error rate at K=1 is always zero for the training sample. This is because the closest point to any training data point is itself.Hence the prediction is always accurate with K=1. If validation error curve would have been similar, our choice of K would have been 1. Following is the validation error curve with varying value of K:

This makes the story more clear. At K=1, we were overfitting the boundaries. Hence, error rate initially decreases and reaches a minima. After the minima point, it then increase with increasing K. To get the optimal value of K, you can segregate the training and validation from the initial dataset. Now plot the validation error curve to get the optimal value of K. This value of K should be used for all predictions.

The above content can be understood more intuitively using our free course – K-Nearest Neighbors (KNN) Algorithm in Python and R

Breaking It Down – Pseudo Code of KNN

We can implement a KNN model by following the below steps:

Load the data
Initialise the value of k
For getting the predicted class, iterate from 1 to total number of training data points
- Calculate the distance between test data and each row of training dataset. Here we will use Euclidean distance as our distance metric since it’s the most popular method. The other distance function or metrics that can be used are Manhattan distance, Minkowski distance, Chebyshev, cosine, etc. If there are categorical variables, hamming distance can be used.
- Sort the calculated distances in ascending order based on distance values
- Get top k rows from the sorted array
- Get the most frequent class of these rows
- Return the predicted class

Implementation in Python From Scratch

We will be using the popular Iris dataset for building our KNN model. You can download it from here.

import pandas as pd
data = pd.read_csv("iris.csv")
data.head()

Comparing Our Model With Scikit-learn

from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(data.iloc[:,0:4], data['Name'])

# Predicted class
print(neigh.predict(test))

-> ['Iris-virginica']

# 3 nearest neighbors
print(neigh.kneighbors(test)[1])
-> [[141 139 120]]

We can see that both the models predicted the same class (‘Iris-virginica’) and the same nearest neighbors ( [141 139 120] ). Hence we can conclude that our model runs as expected.

Implementation of KNN in R

Step 1: Importing the data
Step 2: Checking the data and calculating the data summary

Output

#Top observations present in the data
SepalLength SepalWidth PetalLength PetalWidth Name
1 5.1 3.5 1.4 0.2 Iris-setosa
2 4.9 3.0 1.4 0.2 Iris-setosa
3 4.7 3.2 1.3 0.2 Iris-setosa
4 4.6 3.1 1.5 0.2 Iris-setosa
5 5.0 3.6 1.4 0.2 Iris-setosa
6 5.4 3.9 1.7 0.4 Iris-setosa

#Check the dimensions of the data
[1] 150 5

#Summarise the data
SepalLength SepalWidth PetalLength PetalWidth Name
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 Iris-setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 Iris-versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 Iris-virginica :50
Mean :5.843 Mean :3.054 Mean :3.759 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500

Step 3: Splitting the Data

Step 4: Calculating the Euclidean Distance

Step 5: Writing the function to predict kNN
Step 6: Calculating the label(Name) for K=1

Output

For K=1
[1] "Iris-virginica"

In the same way, you can compute for other values of K.

Comparing Our KNN Predictor Function With “Class” Library

Output

For K=1
[1] "Iris-virginica"

We can see that both models predicted the same class (‘Iris-virginica’).

Conclusion

The KNN algorithm is one of the simplest classification algorithms. Even with such simplicity, it can give highly competitive results. KNN algorithm can also be used for regression problems. The only difference from the discussed methodology will be using averages of nearest neighbors rather than voting from k-nearest neighbors. KNN can be coded in a single line on R. I am yet to explore how we can use the KNN algorithm on SAS.

Key Takeaways

KNN classifier operates by finding the k nearest neighbors to a given data point, and it takes the majority vote to classify the data point.
The value of k is crucial, and one needs to choose it wisely to prevent overfitting or underfitting the model.
One can use cross-validation to select the optimal value of k for the k-NN algorithm, which helps improve its performance and prevent overfitting or underfitting. Cross-validation is also used to identify the outliers before applying the KNN algorithm.
The above article provides implementations of KNN in Python and R, and it compares the result with scikit-learn and the “Class” library in R.

Frequently Asked Questions

Q1. What is K nearest neighbors algorithm?

A. KNN classifier is a machine learning algorithm used for classification and regression problems. It works by finding the K nearest points in the training dataset and uses their class to predict the class or value of a new data point. It can handle complex data and is also easy to implement, which is why KNN has become a popular tool in the field of artificial intelligence.

Q2. What is KNN algorithm used for?

A. KNN algorithm is most commonly used for:
1. Disease prediction – Predicting the likelihood of diseases based on symptoms.
2. Handwriting recognition – Recognizing handwritten characters.
3. Image classification – Categorizing and recognizing images.

Q3. What is the difference between KNN and Artificial Neural Networks?

A. K-nearest neighbors (KNN) are mainly used for classification and regression problems, while Artificial Neural Networks (ANN) are used for complex function approximation and pattern recognition problems. Moreover, ANN has a higher computational cost than KNN.

tavish 28 Aug, 2024

Algorithm Big data Business Analytics Classification Intermediate

Responses From Readers

Harshal 10 Oct, 2014

Useful article. Can you share similar article for randomforest ? What are limitations with data size for accuracy?

Show 1 reply

Tavish Srivastava 10 Oct, 2014

Harshal, We have already published many articles on random forest. Here is the link of the article on random forest on similar lines http://www.analyticsvidhya.com/blog/2014/06/introduction-random-forest-simplified/. You can also subscribe to analyticsvidhya to get access to weekly updates on such articles.

saurabh 10 Oct, 2014

Good one please share the value of Red circle and green square

Show 1 reply

Tavish Srivastava 10 Oct, 2014

Saurabh, The first graph is for illustrating purposes. You can create a random dataset to check the algorithm.

DEBASHIS ROUT 10 Oct, 2014

I am currently doing part time MS in BI & Data Mining. I found this article is really helpful to understand in more detail and expecting to utilize in my upcoming project work. I need to know do you have any article on importance of Data quality in BI , Classification & Decision Tree.

Show 1 reply

Tavish 10 Oct, 2014

Debashish, We have published many articles on CART models before. Here is a link which will give you a kick start http://www.analyticsvidhya.com/blog/2014/06/comparing-random-forest-simple-cart-model/.

saraswathi 10 Oct, 2014

Hello the article is very clear and precise. I would like some clarification on the single line "To get the optimal value of K, you can segregate the training and validation from the initial dataset.". Do you mean that we segregate those points on the border of the boundaries for validation and keep the remaining for training. This is not very clear to me. Can you please elaborate ? Thanks.

Show 1 reply

Tavish Srivastava 10 Oct, 2014

Saraswathi, Here is what I meant : Take the entire population, and randomly split it into two. Now on the training sample, score each validation observation with different k-values. The error curve will give you the best value of k. Hope it becomes clear now.

Harvey S 10 Oct, 2014

Nice tease: "KNN can be coded in a single line on R. " Can you give an example?

Show 1 reply

Tavish Srivastava 10 Oct, 2014

We will cover this piece in our coming articles. Stay tuned.

Felix 13 Oct, 2014

Hi, great post, thanks. I would like to add, that "low calculation time" is not true for the prediction phase with big, high dimensional datasets. But it's still a good choice in many applications.

Show 1 reply

Tavish 13 Oct, 2014

Felix, You are probably right for cases where the distance between observations comparable in the large dataset. But in general population have natural clusters which makes the calculation faster. Let me know in case you disagree. Tavish

kabir singh 15 Oct, 2014

I am trying to figure out churn analysis, any suggestions where I am start looking? BTW following this website for 6-8 months now, you guys are doing an amazing job

Najma Naaz 10 Jun, 2015

That was very helpful. Thank you! Can you please share a concise article on neural nets and deep learning as well?

Tiago 21 Jun, 2016

Thank you.

Charles 12 Feb, 2017

Very useful. it was very explanatory. Thanks for that. Can you please post about adabooster algorithm?

Show 1 reply

Aishwarya Singh 08 Oct, 2018

You'll find it here : https://www.analyticsvidhya.com/blog/2015/11/quick-introduction-boosting-algorithms-machine-learning/

thue xe du lich gia re 26 Jun, 2017

Quality articles or reviews is the secret to invite the visitors to visit the website, that's what this web site is providing.

Emanuel Fakhar 23 Jul, 2017

The world of DS would be so boring and exaggerated without you guys. Anything I study, I get a better perspective from this site. And you are so generous and grounded compared to idiots here in UK. God bless.

Stonehead Park 24 Sep, 2017

Excellent post, I appreciate your effort. :)

Just81100 28 Sep, 2017

KNN is fast to train but the prediction speed grows exponentially with the data set size and his complexity rather than Random forest...

Tushar Pandit 05 Oct, 2017

Nice article. Do you have anything which can focus more on l0,l1,l2 and l-infinity norms(distances) with respect to KNN?

Soumya Shreya 27 Mar, 2018

It is a really nice and well explained article, I am a beginner in the field of data science and machine learning and I find these articles really helpful to learn and understand the algorithms. Thanks for publishing them. Can you suggest me some datasets where I can experiment and apply KNN.

Show 1 reply

Aishwarya Singh 29 Mar, 2018

Hi Soumya, You can use the Cancer dataset to practice kNN. Refer this link for the same.

Krishna 27 Mar, 2018

How do we handle categorical features with kNN? Do we need to create dummies for them? Do you suggest any other distance method other than euclidean distance if we have more number of features? I feel we have to treat outliers as they may have impact on the distances, similarly missing values.. Please share your opinion.

Show 1 reply

Aishwarya Singh 28 Mar, 2018

Hi, Yes you can create dummies for categorical variables in kNN. Apart from Euclidean distance, there are other methods that can be used to find the distance such as Manhattan or Minkowski. For outliers adn missing value treatment, you can refer this article .

Aanish Singla 28 Mar, 2018

IMO limitation of KNN comes into play when dimensions increase because in higher dimensions, finding neighbors which are quite close to each other in all dimensions might be tough, hence so called neighbors might be really far apart from each other which defeats the purpose of the algorithm. Kindly share your thoughts/experiences.

Show 1 reply

Aishwarya Singh 29 Mar, 2018

Hi Aanish, Thank you for sharing your thoughts.

Amlesh Kanekar 25 Apr, 2018

I found it "inspiring". Have spent last 4 months learning linear algebra, statistics, python. This learning list was culled from analyticsvidhya.com. Now just glimpsing through your article gives me the confidence to code knn from scratch. Thank you!

Show 1 reply

Aishwarya Singh 25 Apr, 2018

Hi Amlesh, Glad you found this useful!

Abdul Rehman 22 Jul, 2018

Hi, Can you guide me how i can design that code for image data? i am waiting for your reply. Thank you

Show 1 reply

Uzair Ghaffar 17 Jul, 2024

Try Chat-Gpt

Pratik 30 Aug, 2018

Hi, Thanks for the clean article really helped in understanding KNN. I am working on a project for predictive web user navigation. Something like system learns the repetitive user interactions(adjustable nGram value) depending upon predicting what user would do next, So that system can assist). Any pointers what graph Algos i should be considering? Thanks

Max 06 Oct, 2018

That was very helpful. Thank you! How to make the same visualization as in the pictures in section "How do we choose the factor K" ?

Show 1 reply

Aishwarya Singh 08 Oct, 2018

Hi Max, For this, you will have to use a for loop. For each value of k, calculate the validation error and store in a separate list. Then plot these validation error values against k values.

Amrut Jadhav 29 Oct, 2018

Nice article. I think it would be better to explain some disadvantages of k-nearest algorithm.

What is a KNN (K-Nearest Neighbors)? | Unite.AI 24 Aug, 2020

[…] learning technique and algorithm that can be used for both regression and classification tasks. K-Nearest Neighbors examines the labels of a chosen number of data points surrounding a target data point, in order to […]

Priscilia 19 Apr, 2022

hello thanks for this article it is helpful. please can you share with the case where we have mutual k nearest neighbor

abdul rehman 16 Nov, 2023

I am continually amazed by the amount of information available on this subject. What you presented was well researched and well worded in order to get your stand on this across to all your readers. 中文補習

abdul rehman 01 Dec, 2023

Our administrations are accessible in all urban communities of Minneapolis and its environment. We are having long periods of experience which empowered our clients and customers to confide in our administrations. 佐敦補習

abdul rehman 01 Dec, 2023

I would state, you do the genuinely amazing.This substance is made to a wonderful degree well. Pigeon Forge Magic Shows

Guide to K-Nearest Neighbors Algorithm in Machine Learning

Introduction

Table of contents

What is KNN (K-Nearest Neighbor) Algorithm in Machine Learning?

When Do We Use the KNN Algorithm?

How Does the KNN Algorithm Work?

How Do We Choose the Factor K?

Breaking It Down – Pseudo Code of KNN

Implementation in Python From Scratch

Comparing Our Model With Scikit-learn

Implementation of KNN in R

Comparing Our KNN Predictor Function With “Class” Library

Conclusion

Frequently Asked Questions

Frequently Asked Questions

Responses From Readers

Write for us

Guide to K-Nearest Neighbors Algorithm in Machine Learning

Introduction

Table of contents

What is KNN (K-Nearest Neighbor) Algorithm in Machine Learning?

When Do We Use the KNN Algorithm?

How Does the KNN Algorithm Work?

How Do We Choose the Factor K?

Breaking It Down – Pseudo Code of KNN

Implementation in Python From Scratch

Comparing Our Model With Scikit-learn

Implementation of KNN in R

Comparing Our KNN Predictor Function With “Class” Library

Conclusion

Frequently Asked Questions

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

NaÃ¯ve Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Frequently Asked Questions

Responses From Readers

Write for us