How to treat outliers in a data set?

snehal_bm Last Updated : 08 Jul, 2021

6 min read

This article was published as a part of the Data Science Blogathon

Introduction

When we started our data science journey and worked with our first data set for example the iris data set, we did not have to do data cleaning but the real-world data sets are far from perfect. There are many shortcomings in the data set which should be dealt with before fitting any model to it. If the data is not treated well, it might lead to biases and the results will not be reliable. This is where the exploratory data analysis comes into the picture.

There are multiple steps involved in the exploratory data analysis like identifying all the variables and their data types, univariate and bivariate analysis, handling missing values, dealing with the outliers, etc. It is always advisable to never skip the exploratory data analysis step during any model building. One of the most important steps in exploratory data analysis is outlier detection. Outliers are extreme values that might do not match with the rest of the data points. They might have made their way to the dataset either due to various errors. There are numerous ways to treat the outliers but based on the dataset we have to choose the best method.

Let us look at all the steps involved in understanding outliers and dealing with them.

What are outliers?

“A celebrity in the crowd of commoners is an outlier”

Image Source : Google Images https://wallhere.com/en/wallpaper/253405

The above statement might have given a fair clue about what are outliers. Anomalies of Outliers are those data points that lie at a great distance from the rest of the data like a sudden increase or decrease by many folds or in the simple world an outlier is a value that lies outside the range of all other values in the dataset. For example, while measuring the body temperature of patients in a hospital there was an entry of 988 degrees Celsius which is clearly incorrect. There might be a missing decimal point like it should have been 98.8 instead of 988.

Another example is while measuring the weights of high school students, there was an entry with a weight of 1234 which is highly unlikely. It could be a data entry error. It is not necessarily that an outlier is always an erroneous entry, in some cases, it could the result of some experiment but it’s up to the data scientist to decide. The range of outliers depends on business problems and can change from case to case. It’s always best to discuss with the business stakeholders before terming a data point as an outlier. The outliers need special attention so that they don’t cause any issues in the model results.

How do they affect the calculation/ biases due to outlier

If the outliers are not treated in the first step while doing the exploratory data analysis, it can lead to biases in the results. There are many unfavorable impacts created by a bias which could lead to poor business decisions and ultimately a loss to the business.

“Avoiding bias starts by recognizing that data bias exists, both in the data itself and in the people analyzing or using it,” said Hariharan Kolam, CEO, and founder of Findem said in his speech. The bias can not only be introduced by data but also by the one working on it. The biases can be introduced subconsciously but they will be there, we just have to make sure that before modeling the data, these biases are dealt with and make sure that they don’t possess any threat to our end results.

Different algorithms to treat outliers

There are numerous machine learning algorithms to treat outliers out of which the following are the most popularly used, lets look at each algorithm in detail with examples.

Z score test

Z score test is one of the most commonly used methods to detect outliers. It measures the number of standard deviations away the observation is from the mean value. A z score of 1.5 indicated that the observation is 1.5 standard deviations above the mean and -1.5 means that the observation is 1.5 standard deviations below or less than the mean.

Z score = (x -mean) / std. deviation

Where x is the data point

If the z score of observation is 3 or more it is generally treated as an anomaly or an outlier.

Let us use the above table and detect the outliers in the weights of students by finding their z score

import pandas as pd
import scipy.stats as stats
student_info = pd.read_excel('student_weight.xlsx')
z_score = stats.zscore(student_info['weights(in Kg)'])
print(z_score)

Output

[-0.30359971 -0.32843404 -0.35326838 -0.34085121 -0.37189413 -0.34085121

-0.29739113 2.99936649 -0.32843404 -0.33464263]

We can clearly see that entry 588 is an outlier and the same is confirmed by the z score test.

Box plot

The box plot shows the distribution of the data points by dividing them into different quartiles. The box plot marks the minimum, maximum, median, first, and third quartiles of the dataset. These percentiles are also known as the lower quartile, median and upper quartile. This is one of the visual methods to detect anomalies. Any outliers which lie outside the box and whiskers of the plot can be treated as outliers.

import matplotlib.pyplot as plt
fig = plt.figure(figsize =(10, 7))
 plt.boxplot(student_info['weights(in Kg)'])
 plt.show()

The below graph shows the box plot of the student’s weights dataset. The is an observation lying much away from the box and whiskers of the box which shows that this data point is an outlier.

Isolation Forest

The isolation forest algorithm is an easy to implement yet powerful choice for outlier detection. Isolation Forest is based on the decision tree algorithm as it isolates the outliers from the dataset by selecting a random feature and a split value between the maximum and minimum values of the selected feature.

The isolation forest method is preferred over other methods when the data set is huge and has many features as it uses lesser memory compared to other techniques.

Below is the code for detecting outliers using isolation forest

from sklearn.ensemble import IsolationForest
model=IsolationForest(n_estimators=50, max_samples='auto', contamination=float(0.1),max_features=1.0)
model.fit(student_info[['weights(in Kg)']])
student_info['scores']=model.decision_function(student_info[['weights(in Kg)']])
student_info['anomaly']=model.predict(student_info[['weights(in Kg)']])
anomaly=student_info.loc[student_info['anomaly']==-1]
anomaly_index=list(anomaly.index)
print(anomaly)

Output

treat outlier isolation

DBSCAN

Density-based spatial clustering of applications with noise or popularly known as DBSCAN is a clustering algorithm.DBSCAN like any other clustering algorithm divides the dataset into different groups by checking their aggregation with other data points and the observations which fail to aggregate are termed as outliers.

from sklearn.cluster import DBSCAN
model = DBSCAN(eps=0.8, min_samples=10).fit(student_info[['weights(in Kg)']])
X = model.labels_
plt.scatter(student_info['weights(in Kg)'], student_info['student_name'], marker='o')
plt.xlabel('Students', fontsize=16)
plt.ylabel('Weights', fontsize=16)
plt.title('Students Vs Weights', fontsize=20)
plt.show()

How to treat them?

It might be tempting to just remove the records where there are outliers in the data set but it’s not always the best approach. The outlier treatment method can vary from case to case and should be discussed with the business before finalizing the method. There are different approaches such as replacing the outlier with the mean value, or median value or in some cases dropping the observation with the suspected outlier so as to avoid any bias in them. We tend to delete the outlier if they are due to data entry errors caused due to human error, data processing errors.

Depending on the size of the data set it is advisable to treat the outliers separately during model fitting and build a different model which can fit the outliers and a separate model for the rest of the dataset but this process can be time-consuming and add to the cost.

The media shown in this article on treat outliers are not owned by Analytics Vidhya and are used at the Author’s discretion.

snehal_bm

Beginner Data Exploration Machine Learning Project Python

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

How to treat outliers in a data set?

Introduction

What are outliers?

How do they affect the calculation/ biases due to outlier

Different algorithms to treat outliers

Z score test

Output

Box plot

Isolation Forest

DBSCAN

How to treat them?

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Congratulations, You Did It!

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid