Before we get started the discussion on Outliers, we should understand exactly what **Feature Improvements mean under Feature Engineering.**

- When we have
**a LOT OF FEATURES**in the given dataset, feature engineering can become quite a challenging and interesting module**.** - The number of features could significantly impact the model considerably, So that feature engineering is an important task in the Data Science life cycle.

In the Feature Engineering family, we are having many key factors are there, letâ€™s discuss **Outlier **here. This is one of the interesting topics and easy to understand inÂ *Layman’s terms.*

- An outlier is an observation of a data point that lies an abnormal distance from other values in a given population. (odd man out)
- Like in the following data point (Age)
- 18,22,45,67,89,
**125,**30

- 18,22,45,67,89,

- Like in the following data point (Age)
- An outlier is an object(s) that deviates significantly from the rest of the object collection.
- List of Cities
- New York, Las Angles, London,
**France**, Delhi, Chennai

- New York, Las Angles, London,

- List of Cities
- It is an abnormal observation during the Data Analysis stage, that data point lies far away from other values.
- List of Animals
- cat, fox, rabbit,
**fish**

- cat, fox, rabbit,

- List of Animals
- An outlier is an observation that diverges from well-structured data.
- The root cause for the Outlier can be an error in measurement or data collection error.
**Quick ways to handling Outliers.**- Outliers can either be a mistake or just variance. (As mentioned, examples)
- If we found this is due to a mistake, then we can ignore them.
- If we found this is due to variance, in the data, we can work on this.

In the picture of the Apples, we can find the out man out?? Is it? Hope can Yes!

But the huge list of a given feature/column from the .csv file could be a really challenging one for naked eyes.

First and foremost, the best way to find the **Outliers **are in the feature is the visualization method.

**Of course! It would be below quick reasons.**

- Incorrect data entry or error during data processing
- Missing values in a dataset.
- Data did not come from the intended sample.
- Errors occur during experiments.
- Not an errored, it would be unusual from the original.
- Extreme distribution than normal.

That’s fine, but you might have questions about **Outlier**Â if you’re a real lover of Data Analytics, Data mining, and Data Science point of view.

Let’s have a quick discussion on those.

that the observations of the given data set, how theÂ data point(s) differ significantly from the overall perspective. Simply sayingÂ odd one/many. this would be anÂ error duringÂ data collection.Â**Outliers tell us**- Generally,Â
**Outliers**Â affectÂ statistical results while doing the EDA process, we could say a quick example is theÂ of a given set of data set, which will be misleading that theÂ data values would be higher than they really are.**MEAN**and**Â MODE** - the
**Â CORRELATION COEFFICIENT**Â is highly sensitive to outliers. Since it measures the strength of a linear relationship between

two variables. the relationship dependent of the data. correlation is a non-resistant measure and r**(correlation coefficient)**is strongly affected by

outliers.

**Positive Relationship**Â*WhenÂ*the correlation coefficient is**closer to value 1**

- Â
**N****egative Relationship**- When the correlation coefficient is
**closer to value -1**

- When the correlation coefficient is
**Independent***When***X****and***Y***are independent**, then the**correlation coefficient**is close to**Â zero (0)**

- We could understand the data collection process from the Outliers and its observations. An analysis of how it occurs and how to minimize and set the process in future
**data collection guidelines**. - Even though the Outliers increase the inconsistent results in your dataset during analysis and the power of statistical impacts significant, there would
**challenge**and**roadblocks**to remove them in few situations. **DO or DO NOT (Drop Outlier)**- Before dropping the Outliers,Â we must analyze the dataset with and without outliers and understand better the impact of the results.
- If you observed that it is obvious due to incorrectly entered or measured, certainly you can drop the outlier. No issues on that case.
- If you find that your assumptions are getting affected, you may drop the outlier straight away, provided that no changes in the results.
- If the outlier affects your assumptions and results. No questions simply drop the outlier and proceed with your further steps.

So far we have discussed what is Outliers, how it affects the given dataset,Â and Either can we drop them or NOT. Let see now how to find from the given dataset. Are you ready!

We will look at simple methods first,Â * Univariate *andÂ

**Univariate method:**I believe you’re familiar with Univariate analysis, playing around one variable/feature from the given data set. Here to look at the Outlier we’re going to apply the BOX plot to understand the nature of the Outlier and where it is exactly.

Let see some sample code.Â Just I am taking titanic.csv as a sample for my analysis, here I am considering age for my analysis.

plt.figure(figsize=(5,5)) sns.boxplot(y='age',data=df_titanic)

You can see the outliers on the top portion of the box plot visually in the form of dots.

- Multivariate method: Just I am takingÂ titanic.csv as a sample for my analysis, here I am considering age and passenger class for my analysis.

plt.figure(figsize=(8,5)) sns.boxplot(x='pclass',y='age',data=df_titanic)

We can very well use Histogram and Scatter Plot visualization technique to identify the outliers.

On top of this, we have with

mathematically to find the Outliers as followsÂ Z-Score andÂ Inter Quartile Range (IQR) Score methods

**Z-Score method:**Â In which the distribution of data in the form mean is 0 and the standard deviation (SD) is 1 as Normal Distribution format.

Let’s consider below the age group of kids, which was collected during data science life cycle stage one, and proceed for analysis, before going into further analysis, Data scientist wants to remove outliers. Look at code and output, we could understand the essence of finding outliers using the Z-score method.

import numpy as np kids_age = [1, 2, 4, 8, 3, 8, 11, 15, 12, 6, 6, 3, 6, 7, 12,9,5,5,7,10,10,11,13,14,14] mean = np.mean(voting_age) std = np.std(voting_age) print('Mean of the kid''s age in the given series :', mean) print('STD Deviation of kid''s age in the given series :', std) threshold = 3 outlier = [] for i in voting_age: z = (i-mean)/std if z > threshold: outlier.append(i) print('Outlier in the dataset is (Teen agers):', outlier)

Mean of the kid’s age in the given series: 2.6666666666666665

STD Deviation of kids age in the given series: 3.3598941782277745

The outlier in the dataset is (Teenagers): **[15]**

**(IQR) Score method:**Â In which data has beenÂ divided into quartiles (Q1, Q2, and Q3). Please refer to the pictureÂ *Outliers Scaling* above.Â Ranges as below.

- 25th percentile of the data – Q1
- 50th percentile of the data – Q2
- 75th percentile of the data – Q3

Let’s have the juniorÂ boxing weight category series from the given data set and will figure out the outliers.

import numpy as np import seaborn as sns # jr_boxing_weight_categories jr_boxing_weight_categories = [25,30,35,40,45,50,45,35,50,60,120,150]Â Q1 = np.percentile(jr_boxing_weight_categories, 25, interpolation = 'midpoint') Q2 = np.percentile(jr_boxing_weight_categories, 50, interpolation = 'midpoint') Q3 = np.percentile(jr_boxing_weight_categories, 75, interpolation = 'midpoint')

IQR = Q3 - Q1 print('Interquartile range is', IQR) low_lim = Q1 - 1.5 * IQR up_lim = Q3 + 1.5 * IQR print('low_limit is', low_lim) print('up_limit is', up_lim) outlier =[] for x in jr_boxing_weight_categories: if ((x> up_lim) or (x<low_lim)): outlier.append(x) print(' outlier in the dataset is', outlier)

Interquartile range is 20.0

low_limit is 5.0

up_limit is 85.0

the outlier in the dataset is** [120, 150]**

sns.boxplot(jr_boxing_weight_categories)

Loot at the boxplot we could understand where the outliers are sitting in the plot.

So far, we have discussed what is Outliers, how it looks like, Outliers are good or bad for data set, how to visualize using matplotlib /seaborn and stats methods.

Now, will conclude correcting or removing the outliers and taking appropriateÂ decision. we can use the same Z- score andÂ (IQR) Score with the condition we can correct or remove the outliers on-demand basis. because as mentioned earlier **Outliers are n****ot errors, it would be unusual from the original.**

Hope this article helps you to understand the Outliers in the zoomed view in all aspects. let’s come up with another topic shortly.Â until then bye for now! Thanks for reading! Cheers!!

*The media shown in this article are not owned by Analytics Vidhya and is used at the Authorâ€™s discretion.*

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Become a full stack data scientist
##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

Understanding Cost Function
Understanding Gradient Descent
Math Behind Gradient Descent
Assumptions of Linear Regression
Implement Linear Regression from Scratch
Train Linear Regression in Python
Implementing Linear Regression in R
Diagnosing Residual Plots in Linear Regression Models
Generalized Linear Models
Introduction to Logistic Regression
Odds Ratio
Implementing Logistic Regression from Scratch
Introduction to Scikit-learn in Python
Train Logistic Regression in python
Multiclass using Logistic Regression
How to use Multinomial and Ordinal Logistic Regression in R ?
Challenges with Linear Regression
Introduction to Regularisation
Implementing Regularisation
Ridge Regression
Lasso Regression

Introduction to Stacking
Implementing Stacking
Variants of Stacking
Implementing Variants of Stacking
Introduction to Blending
Bootstrap Sampling
Introduction to Random Sampling
Hyper-parameters of Random Forest
Implementing Random Forest
Out-of-Bag (OOB) Score in the Random Forest
IPL Team Win Prediction Project Using Machine Learning
Introduction to Boosting
Gradient Boosting Algorithm
Math behind GBM
Implementing GBM in python
Regularized Greedy Forests
Extreme Gradient Boosting
Implementing XGBM in python
Tuning Hyperparameters of XGBoost in Python
Implement XGBM in R/H2O
Adaptive Boosting
Implementing Adaptive Boosing
LightGBM
Implementing LightGBM in Python
Catboost
Implementing Catboost in Python

Introduction to Clustering
Applications of Clustering
Evaluation Metrics for Clustering
Understanding K-Means
Implementation of K-Means in Python
Implementation of K-Means in R
Choosing Right Value for K
Profiling Market Segments using K-Means Clustering
Hierarchical Clustering
Implementation of Hierarchial Clustering
DBSCAN
Defining Similarity between clusters
Build Better and Accurate Clusters with Gaussian Mixture Models

Introduction to Machine Learning Interpretability
Framework and Interpretable Models
model Agnostic Methods for Interpretability
Implementing Interpretable Model
Understanding SHAP
Out-of-Core ML
Introduction to Interpretable Machine Learning Models
Model Agnostic Methods for Interpretability
Game Theory & Shapley Values

Deploying Machine Learning Model using Streamlit
Deploying ML Models in Docker
Deploy Using Streamlit
Deploy on Heroku
Deploy Using Netlify
Introduction to Amazon Sagemaker
Setting up Amazon SageMaker
Using SageMaker Endpoint to Generate Inference
Deploy on Microsoft Azure Cloud
Introduction to Flask for Model
Deploying ML model using Flask

Great explanation! Thanks