This article was published as a part of the Data Science Blogathon.

When it comes to data cleaning, it is not always that we have to deal with **NaN or Zero values** so that we can remove them, and data cleaning is done! In real-time practical projects, things aren’t that simple. We have to do further analysis, and for the same **Outliers, detection** is one of the methods we have to focus on for every iteration of analysis.

In this article, we will do the **Outliers pruning** on three types of data, i.e., dimensional, **two-dimensional**, and **Curve** data, using some statistical methods like **z-score**, data distribution, and **polynomial fit distribution**. By the end of this article, one will be able to detect the outliers in all sorts of data (as mentioned), and this is not all. Along with working with scratch at the end, we will also use **python’s famous sk-learn package** to **automate** everything within a few lines of code. This way, one will have the mathematical knowledge behind the concept and the simple and steady practical implementation.

**Inference:** In this article, only NumPy and matplotlib libraries will be used when we are not automating stuff. We have the **text-based dataset** for all three types for that **np. load txt()** method is preferred. Then looking at the shape of the 1-D and 2-D datasets we can see both have **1010 rows** (**10 are outliers** that I have deliberately added for analyzing the outliers).

We later plot the data points using **a scatter plot,** and in the output, there are two plots, one indicating the **1-D(blue)** and **2-D(red)** dataset, and the second one is **curve fit**. Though from the naked eye as well, some outliers are visible from the graph still, that is not the right way for outlier detection.

The most basic and most common way of manually doing outlier pruning on data distributions is to:

- Using statistical measures to fit the model as a polynomial equation
- Find all points below a certain z-score
- Remove those outliers
- Refit the distributions and potentially run again from Step 1 (till all the outliers are removed).

mean, std = np.mean(d1), np.std(d1) z_score = np.abs((d1 - mean) / std) threshold = 3 good = z_score < threshold

print(f"Rejection {(~good).sum()} points") from scipy.stats import norm print(f"z-score of 3 corresponds to a prob of {100 * 2 * norm.sf(threshold):0.2f}%") visual_scatter = np.random.normal(size=d1.size) plt.scatter(d1[good], visual_scatter[good], s=2, label="Good", color="#4CAF50") plt.scatter(d1[~good], visual_scatter[~good], s=8, label="Bad", color="#F44336") plt.legend();

**Output:**

**Inference:** One of the best and most used methods for detecting outliers is **the z-score**. **What is a z-score?** A Z-score isÂ **a standard score** that checks whether the data points lie within the range (between highest percentile and lower percentile). **If the threshold value goes par z-score, that particular data point is an outlier.**

In the above line of code, we are also following the same approach: firstly calculating the z-score with formula ((**X-mean)/ standard deviation**) where X is each data point, and if we want to traverse through each instance for that loop is mandatory. The plot says it all! According to the threshold value, we can see that the red dots are the outliers or the bad data, while the green ones areÂ **good** data.

from scipy.stats import multivariate_normal as mn mean, cov = np.mean(d2, axis=0), np.cov(d2.T) good = mn(mean, cov).pdf(d2) > 0.01 / 100 # where "cov" is the covariance and "pdf" is pobability density function plt.scatter(d2[good, 0], d2[good, 1], s=2, label="Good", color="#4CAF50") plt.scatter(d2[~good, 0], d2[~good, 1], s=8, label="Bad", color="#F44336") plt.legend();

**Output:**

So, how do we pick what our threshold should be? Visual inspection is hard to beat. You can argue for relating the number to the number of samples you have or how much of the data you are willing to cut but be warned that too much rejection will eat away at your actual data sample and bias your results.

If you don’t have distribution but instead have data with uncertainties, you can do similar things. To take a real-world example, in one of the papers, we have values of xs, ys, and error (wavelength, flux, and flux) and want to subtract the smooth background. We wanted to do this with a simple **polynomial fit. Still, unfortunately**, the data had several emission lines and cosmic ray impacts in it (visible as spikes), which biased our poly fitting, so we had to remove them.

We fit a polynomial to it, remove all points more than three standard deviations from the polynomial from consideration, and loop until all points are within three standard deviations. In the example below, for simplicity, the data is normalized so all errors are one.

xs, ys = d3.T p = np.polyfit(xs, ys,deg=5) ps = np.polyval(p, xs) plt.plot(xs, ys, ".", label="Data", ms=1) plt.plot(xs, ps, label="Bad poly fit") plt.legend();

**Output:**

x, y = xs.copy(), ys.copy() for i in range(5): p = np.polyfit(x, y, deg=5) ps = np.polyval(p, x) good = y - ps < 3 # Here we will only remove positive outliers x_bad, y_bad = x[~good], y[~good] x, y = x[good], y[good] plt.plot(x, y, ".", label="Used Data", ms=1) plt.plot(x, np.polyval(p, x), label=f"Poly fit {i}") plt.plot(x_bad, y_bad, ".", label="Not used Data", ms=5, c="r") plt.legend() plt.show() if (~good).sum() == 0: break

**Output:**

**Inference:** Before discussing the plots, let’s first see what **statistical measures** we have used via Python; so firstly, we fit the data points with **the 5th degree of a polynomial** within the range of 5 iterations (though at the end of the loop, we do have the breakpoint which will break the loop of the outliers are removed before 5th iteration).

**1st plot:**In this plot, we have**good, bad data and polynomial fit**, where we can see that due to the presence of an outlier, the fit line is not giving better insights.**2nd plot:**In this plot, we applied the**5th-degree polynomial fit,**which tends to detect the outliers and remove a few with a few left as**residual**.**3rd plot:**Here, in another iteration, we can see that**all the outliers are removed**with no red dots in the graph.

Blessed `sk-learn`

to the rescue. Check out the main page, which lists manyÂ ways you can do outlier detection. I think **LOF (Local Outlier Factor)** is great – it uses the distance from one point to its closest twenty neighbors to figure out point density and removes those in low-density regions.

from sklearn.neighbors import LocalOutlierFactor lof = LocalOutlierFactor(n_neighbors=20, contamination=0.005) good = lof.fit_predict(d2) == 1 plt.scatter(d2[good, 0], d2[good, 1], s=2, label="Good", color="#4CAF50") plt.scatter(d2[~good, 0], d2[~good, 1], s=8, label="Bad", color="#F44336") plt.legend();

**Output:**

**Inference:** As mentioned for Automating the above process, we are using the **LOF** from sk-Learn’s **neighbor’s** module where we just have to call the LOF instance by passing in the **number of neighbors** and **contamination rate** than at last using the **fit_predict** method on top of the whole dataset (setting the threshold as well simultaneously) and boom! We got the same plot with red dots as bad data and another as good data.

We can far detect the outliers using both **statistical methods** (from scratch) and **the sk-learn library**. It’s been good learning as we have covered most of the things related to outlier detection and removal. Now in this section, we will discuss everything in a nutshell to give a brief about outlier pruning.

- Firstly we
**loaded**all the data needed for the analysis using**Numpy**. Then we looked for the steps required for the outlier pruning/detection. - Then we went for the practical implementation, where we detected the outliers and removed them in both
**N-Dimensional**and uncertain**curve data**using various**statistical measures**. - Finally, we covered the
**Automating section,**where we learned how a simply python’s sk-learn library can execute the above long process in a matter of a few lines of code.

Hereâ€™s the repo link to this article. I hope you liked my article on **Outliers pruning using python**. If you have any opinions or questions, comment below.

Connect with me on LinkedIn for further discussion on Python for Data Science or otherwise.

Â **The media shown in this article is not owned by Analytics Vidhya and is used at the Authorâ€™s discretion.**

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Become a full stack data scientist
##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

Understanding Cost Function
Understanding Gradient Descent
Math Behind Gradient Descent
Assumptions of Linear Regression
Implement Linear Regression from Scratch
Train Linear Regression in Python
Implementing Linear Regression in R
Diagnosing Residual Plots in Linear Regression Models
Generalized Linear Models
Introduction to Logistic Regression
Odds Ratio
Implementing Logistic Regression from Scratch
Introduction to Scikit-learn in Python
Train Logistic Regression in python
Multiclass using Logistic Regression
How to use Multinomial and Ordinal Logistic Regression in R ?
Challenges with Linear Regression
Introduction to Regularisation
Implementing Regularisation
Ridge Regression
Lasso Regression

Introduction to Stacking
Implementing Stacking
Variants of Stacking
Implementing Variants of Stacking
Introduction to Blending
Bootstrap Sampling
Introduction to Random Sampling
Hyper-parameters of Random Forest
Implementing Random Forest
Out-of-Bag (OOB) Score in the Random Forest
IPL Team Win Prediction Project Using Machine Learning
Introduction to Boosting
Gradient Boosting Algorithm
Math behind GBM
Implementing GBM in python
Regularized Greedy Forests
Extreme Gradient Boosting
Implementing XGBM in python
Tuning Hyperparameters of XGBoost in Python
Implement XGBM in R/H2O
Adaptive Boosting
Implementing Adaptive Boosing
LightGBM
Implementing LightGBM in Python
Catboost
Implementing Catboost in Python

Introduction to Clustering
Applications of Clustering
Evaluation Metrics for Clustering
Understanding K-Means
Implementation of K-Means in Python
Implementation of K-Means in R
Choosing Right Value for K
Profiling Market Segments using K-Means Clustering
Hierarchical Clustering
Implementation of Hierarchial Clustering
DBSCAN
Defining Similarity between clusters
Build Better and Accurate Clusters with Gaussian Mixture Models

Introduction to Machine Learning Interpretability
Framework and Interpretable Models
model Agnostic Methods for Interpretability
Implementing Interpretable Model
Understanding SHAP
Out-of-Core ML
Introduction to Interpretable Machine Learning Models
Model Agnostic Methods for Interpretability
Game Theory & Shapley Values

Deploying Machine Learning Model using Streamlit
Deploying ML Models in Docker
Deploy Using Streamlit
Deploy on Heroku
Deploy Using Netlify
Introduction to Amazon Sagemaker
Setting up Amazon SageMaker
Using SageMaker Endpoint to Generate Inference
Deploy on Microsoft Azure Cloud
Introduction to Flask for Model
Deploying ML model using Flask