Hello all, In this tutorial, we will be discussing some advance statistical terms and methods which play a crucial role in Feature engineering, data preprocessing steps, and are also asked in interviews. I would like to request you first if you haven’t looked at our previous two articles on Statistics, please give it a read that will help you build your basics strong. Have a look at Basic statistics for data science and practitioners can have a look at Intermediate statistics for data science.

- Chebyshev’s InEquality
- Quantile-Quantile(Q-Q) Plot
- Bernoulli Distribution
- Log-Normal Distribution
- Power Law Distribution
- Box-Cox Transform
- End Notes

Consider a random variable X, that follows the Gaussian Distribution(Normal) then according to the Empirical formula we can tell the percentage of data points lying in any standard deviation. But If some random variable suppose Y does not belong to the Gaussian distribution, and we want to find the way what percentage of a data point belonging to the first standard deviation then For finding this we basically use Chebyshev’s Inequality.

According to Chebyshev’s Inequality,

Â Â Â Â Â Â Â **Pr(Î¼-kÏƒ < X 1 – (1/k^2)**

k is specifying for what range of standard deviation we have to find out the percentage of data points lying under it.

For example, k = 2, and Y does not follow the Gaussian distribution.

If my random variable does not follow Gaussian distribution then more than 75 per cent of data points belonging to the Y random variable will be falling in a range of the second standard deviation.

Quantile plots play a very vital role in graphically analyzing and comparing two probability distributions by plotting their quantiles against each other. It is also used in feature transformation to check if a particular feature is normally distributed or not. It is perfectly normally distributed then all the points exactly lie on a straight line. X == Y.

- Sort the data features values and make percentiles from 1 to 100.
- Assume any normally distributed feature or random normally distributed variables.
- Simply draw the percentiles on a normally distributed plot. If all points lie online then it is normally distributed.

If all the points separate away at the top(end) of the line or lie above the line, then it is right-skewed data. And if at the start of a line points lie below the line then it is left-skewed data.

we can simply plot Q-Q plot using the scipy library because it provides a feature for probability plotting. We are considering the popular Titanic dataset and plotting the Q-Q plot as well as its histogram of the Age column.

import pandas as pd import matplotlib.pyplot as plt import scipy.stats as stat #probability plot import pylab data = pd.read_csv('titanic_train.csv',usecols=['Age','Fare','Survived']) def plot_data(df,feature): plt.figure(figsize=(10,6)) plt.subplot(1,2,1) #1st plot df[feature].hist() plt.subplot(1,2,2) #2nd plot stat.probplot(df[feature], dist='norm', plot=pylab) plt.show() plot_data(data, "Age")

If the number of samples is small then it is hard to interpret the Q-Q plots, which means that if a dataset is very small then the Plotting Q-Q plot will not give any sense.

Bernoulli Distribution is the distribution that always has an output as 2. for example I toss a coin the output can be head or tail. And suppose if random variable gets tail, its success or value as 1, and If head then failure.

This is called as ** probability mass function**, there is one more function Probability density function which we studied in Basic Statistics. The difference in both is that when we use PDF its output is a continuous value, and when we use PMF the output is a discrete value(fixed set of values).

- To compute the mean of this kind of distribution, the mean will always be equal to P.
- The Median is 0 is q>p. median is 0.5 if p=q. median is 1 if q0.5 then p will be median else q is median.
- If p>q then p is the mode, else q will e mode.
- Variance is the product of failure and success(PQ)
- standard deviation is a square root of variance.

We saw that 0 is always to failure, and success is assigned as 1 like we want to find how many students in the class like attending presentations so like is success and dislike is a failure. so, we can take the probability of the weighted sum of values in our Bernoulli distribution.

Â Â Â Î¼=(percentage of failures)(0)+(percentage of successes)(1)

Â Â Â Î¼=(0.25)(0)+(0.75)(1)

Î¼=0+0.75

Î¼=0.75

we have already seen the general formula for mean and you can also estimate using the above example which is the probability of success into the probability of failure(p. (1-p) = 1). A similar weighting technique can be used to find the variance of the Bernoulli distribution of the random variables.

If you square root the variance we will get standard deviation which is 0.4330

A random variable X will belong to log-normal distribution if log(x) is normally distributed. Usually, data follows Gaussian distribution, but there are some chunks of data that also follow Log-normal distribution, Binomial Distribution, etc. so there is a need to study all the kinds of distribution. The log-normal distribution is also known as right-skewed distribution because at the start it looks like normal distribution and at the end the right tail is long.

Examples can be an income distribution, and most of the time when we are working with sentiment analysis they usually follow a log-normal distribution. one common use of log-normal is in analyzing stock prices.

A Normal Distribution can be converted to log-normal distribution by taking a natural log of data points. Normal Distribution may carry some problems that can be solved using log-normal distribution. The normal distribution can have negative values but log-normal only includes all positive sets of values.

Power Law distribution is a relationship between two quantities wherein a relative change in one quantity results in a proportional change in another quantity.

It basically reflects the 80-20 rule. let us understand with an example what it means to say.

- 80 per cent of sales are coming from 20 per cent of overall products.
- 80 per cent of window crashes are only due to 20 per cent of overall bugs.

One distribution is an example of Power-law distribution is Pareto distribution. the height of the starting point is basically defined by the alpha parameter. as height increases, alpha also increases. And there is only a slight difference between Pareto distribution and log-normal distribution.

The situation can occur in datasets commonly, and most of the time. we work with this as right-skewed data and, we forgot about the property of power-law distribution so, we need to check this to get some amazing results from such distributions.

Box-Cox Transform is basically a technique to transform any random variable from Pareto distribution(Power-law) to Gaussian Distribution. Box and Cox is the name of two scientists who collaborated and published this technique. When our feature is right-skewed then most practitioners use this particular technique to bring variables to normal distribution for easy analysis. Hence it is a feature transformation technique.

Now let us understand how does it work.

The exponent here is a variable called lambda that varies over a range of -5 to 5, and in process of searching, we examine all values of lambda. finally, we choose the optimal value which results in the best approximation to a normal distribution for your variable. There are two techniques to find this, one is a maximum likelihood and the second is bayesian statistics.

Remember the point that Box-Cox transform is strictly applicable to numbers greater than zero. negative number and zero are not allowed.

Scipy provides us with a feature to apply the box-cox transformation. We will apply it to the same Age variable whose distribution we have seen above using a Q-Q plot.

data['Age_boxcox'],parameters = stat.boxcox(data['Age']) # print(parameters) #0.7964531473656952 plot_data(data,'Age_boxcox')

These are some important Statistics terms placed under advanced concepts because understanding this can seem a little bit tricky. But I hope that after reading this article you are having a sort of idea of how to use it in analysis and identify the kind of distribution while doing Exploratory data analysis. If you have any doubt, please feel free to drop them in the comment box.

**Raghav Agrawal**

I am pursuing my bachelorâ€™s in computer science. I am very fond of Data science and big data. I love to work with data and learn new technologies. Please feel free to connect with me on Linkedin.

If you like my article, please give it a read to others too. link

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Become a full stack data scientist##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

Understanding Cost Function
Understanding Gradient Descent
Math Behind Gradient Descent
Assumptions of Linear Regression
Implement Linear Regression from Scratch
Train Linear Regression in Python
Implementing Linear Regression in R
Diagnosing Residual Plots in Linear Regression Models
Generalized Linear Models
Introduction to Logistic Regression
Odds Ratio
Implementing Logistic Regression from Scratch
Introduction to Scikit-learn in Python
Train Logistic Regression in python
Multiclass using Logistic Regression
How to use Multinomial and Ordinal Logistic Regression in R ?
Challenges with Linear Regression
Introduction to Regularisation
Implementing Regularisation
Ridge Regression
Lasso Regression

Introduction to Stacking
Implementing Stacking
Variants of Stacking
Implementing Variants of Stacking
Introduction to Blending
Bootstrap Sampling
Introduction to Random Sampling
Hyper-parameters of Random Forest
Implementing Random Forest
Out-of-Bag (OOB) Score in the Random Forest
IPL Team Win Prediction Project Using Machine Learning
Introduction to Boosting
Gradient Boosting Algorithm
Math behind GBM
Implementing GBM in python
Regularized Greedy Forests
Extreme Gradient Boosting
Implementing XGBM in python
Tuning Hyperparameters of XGBoost in Python
Implement XGBM in R/H2O
Adaptive Boosting
Implementing Adaptive Boosing
LightGBM
Implementing LightGBM in Python
Catboost
Implementing Catboost in Python

Introduction to Clustering
Applications of Clustering
Evaluation Metrics for Clustering
Understanding K-Means
Implementation of K-Means in Python
Implementation of K-Means in R
Choosing Right Value for K
Profiling Market Segments using K-Means Clustering
Hierarchical Clustering
Implementation of Hierarchial Clustering
DBSCAN
Defining Similarity between clusters
Build Better and Accurate Clusters with Gaussian Mixture Models

Introduction to Machine Learning Interpretability
Framework and Interpretable Models
model Agnostic Methods for Interpretability
Implementing Interpretable Model
Understanding SHAP
Out-of-Core ML
Introduction to Interpretable Machine Learning Models
Model Agnostic Methods for Interpretability
Game Theory & Shapley Values

Deploying Machine Learning Model using Streamlit
Deploying ML Models in Docker
Deploy Using Streamlit
Deploy on Heroku
Deploy Using Netlify
Introduction to Amazon Sagemaker
Setting up Amazon SageMaker
Using SageMaker Endpoint to Generate Inference
Deploy on Microsoft Azure Cloud
Introduction to Flask for Model
Deploying ML model using Flask