This article was published as a part of the Data Science Blogathon

Hello all, In this tutorial, we will cover some Intermediate statistics terms which are very helpful in exploratory data analysis, feature engineering tasks. If you are a beginner, I would like to request you to please look at our previous article Basic Statistics concepts for Machine Learning, which will familiarize you with statistics, its importance, and some basic terms required to understand these intermediate terms.

- Z-Score in statistics
- Confidence Interval
- Hypothesis Testing
- Type-I and Type-II Error
- Different test for Hypothesis

- Covariance
- Correlation
- Pearson Correlation Coefficient
- Spearman rank correlation coefficient

- Conclusion

Z-score is a measure that describes a relationship of a particular value with the mean of a group of values. It is measured in terms of standard deviation from the mean. it is computed using the below formula.

** z = (x – μ) / σ**

- Z-score will help us to know how much standard deviation far a value is from the mean.
- Z-score is used in Standardization – we can scale down the values in feature towards mean using z-score.
- Compare scores between different distributions – we can also use a z-score to compare two distributions and tell that which one is better. for example, suppose Indian-England test series data of past 2 years. we are given an average score, maximum score, and standard deviation, and based on this we want to find in which year India played more powerful. we can use a z-score to solve this type of problem. have a look at the below figure and you will see how we use a z-score to do so.

**Python Code:**

Confidence Interval is a probability that a population parameter will fall between a certain range for a certain proportion of times. In simple words, confidence interval tells the percentage confidence of certain events happening in a particular range. It is one of the important measures in data analysis for proving our assumptions true

CI = point estimate ± Margin of an error

where the margin of error is basically a standard deviation and point estimate is mean. for calculating confidence interval we calculate point estimate, for example, we need to find 95 per cent confidence so we will assume point estimate as 95 and try to find the quantity of data lie between which range.

How to Compute Confidence Interval using Python

```
import scipy.stats as stat
np.random.seed(10)
data = np.random.randint(10, 30, 50)
#create 95% confidence interval for population mean weight
conf_interval = stat.norm.interval(alpha=0.95, loc=np.mean(data), scale=stat.sem(data))
print(conf_interval)
```

The 95 per cent Confidence interval for the true population mean is **(18.93, 22.10) **

A hypothesis in simple words is an assumption or a guess about something in a world around you. So results can be 2 things like either your guess is correct or incorrect. In data science terms we refer to hypothesis testing as where We try to evaluate 2 mutually exclusive statements on a population using a sample of data.

- make initial assumptions – The initial assumption you make is known as the
with is denoted with H0, which is always assumed as true before the experiment. And in opposite to it, we have an**Null Hypothesis**denoted by H1.**Alternate Hypothesis** - Collect data – To prove your assumptions correct we collect some data related to it or we can say as we collect evidence to prove our statement correct. while working with the Machine learning problem statements we are having data, we try to find some patterns from it as evidence.

When we know the actual outcome that the Null Hypothesis is True but due to lack of evidence we failed to prove it and we have to reject it and select an Alternate hypothesis is known as Type-1 error. And in the opposite of it, the same applies to Type-II error, when we have cannot reject the Null hypothesis, there Type-II error is achieved. You can understand it in a better way in form of Confusion Matrix.

P-value is the probability of obtaining results at least as extreme as observed results for the Hypothesis test assuming the Null hypothesis as correct and is performed by knowing the distribution of data. The P-value is also known as significance level and also denoted as alpha. the default value assumed is 5 per cent or 0.05. when P-value is less than 5 per cent, it means we do not have enough evidence to prove the NULL hypothesis as correct and have to reject it. P-value is usually found using a __P-value table__ also known as a * z-table*.

If we have 2 categorical variables then we use the chi-square test.

Chi-square is a very good way to show a relationship between 2 categorical features. Chi-square is a measure that basically tells a difference that exists between your observed counts and the count you would expect if there would no relationship between 2 variables in the population.

Compute P-value for Chi-Square test using Python

stat.chi2.pdf(3.84, 1)

we apply the chi-square transformation and calculated the probability density function which in turn gives P-value.

When we assume continuous feature for Hypothesis testing then the type of test we use is T-test. T-test tells the significant difference between the mean of two groups which may or may not be related to a label. In simple words, the t-test helps us in comparing an average of 2 groups and determine that if they came from the same population or not.

For calculating T-value we require 3 data values. It includes differences between mean values, standard deviation, and several observations.

If we want to perform a test on a more continuous feature then we go with Correlation which we will study in the further part of the article.

Covariance is one of the very important topics when we consider data preprocessing in order. Quantifying the relationship between two random variables is known as Covariance. It is similar to variance, where variance tells how a single variable varies from the mean, covariance helps to know how two variables vary together. Covariance does not represent strength between two variable and only indicate the direction of the linear relationship between them.

** Cov(x,y) = SUM [(x _{i} – x_{m}) * (y_{i} – y_{m})] / (n – 1)**

- x
_{i}is a given x value in the data set - x
_{m }is the mean, or average, of the x values - Yi is the y value in the data set that corresponds with x
_{i} - y
_{m}is the mean, or average, of the y values - n is the number of data points

Covariance is an important term that will help you in the data analysis step and also it is used by many machine learning algorithms like Linear Regression.

Compute covariance using Python.

arr = np.array([[2,6,8],[1,5,7],[3,6,9]]) print("covariance: ", np.cov(arr))

Correlation is a measure used to represent how strongly 2 variables relate to each other. Correlation is the scaled form of covariance. correlation ranges between -1 to +1. If the value of correlation is near +1, it means two variables are highly positively correlated. And in the opposite, its value is near -1 which means two variables are negatively correlated. It basically measures the strength and direction of a linear relationship between two variables.

- Strength – if I have 2 variables as X and Y, then if X increases then do Y increase or decrease, this is only a strength that correlation tells us.
- The direction of relationship – It means whether the relationship is in a positive or negative direction.

We also use correlation in __feature selection__ and to avoid

It is the most used technique to find correlation coefficients. Pearson Correlation coefficient is the covariance of two variables divided by-product of their standard deviation. Its range is between -1 to +1 and It is represented by *ρ* (rho).

- When there is the perfect linear relationship then the value of the Pearson correlation coefficient will be +1(When x increases, Y also increases).
- When X(independent variable) is increasing, Y(dependent variable) is decreasing then the value will be -1.
- When there is a non-linear relationship or a constant line at 0 then the value is 0.

we can directly use the ** corr** method of pandas dataframe to find the Pearson correlation coefficient.

df.corr()

It is a little bit different in both methods. In Spearman rank correlation we trying to find the Pearson correlation of rank of x and rank of y. now, what is the rank of X and Y?

Steps to compute Spearman Correlation coefficient is,

- Sort the data by the first column(Xi), and create a new column and assign it ranked values from 1,2,3,…n.
- Now sort the data by second column(Yi). Create another column and rank it.
- Create a new column difference(Di) that holds a difference between two rank columns.
- Finally, create a new column that holds a squared value of the difference column.

Now you have all the values, substitute values to the equation and you will get a correlation coefficient.

We have covered some of the important statistical concepts which are used in feature engineering, data analysis. I hope that it was easy for you to cope up with every concept. If you have any doubt then please drop it in the comment box below. At this point, our basic and intermediate statistics are completed, and Now in an upcoming article, we will discuss some advanced statistics terms which are mostly used and asked in interviews.

*keep learning, happy learning*

**Raghav Agrawal**

I am pursuing my bachelor’s in computer science. I am very fond of Data science and big data. I love to work with data and learn new technologies. Please feel free to connect with me on Linkedin.

If you like my article, please give it a read to others too. link

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Become a full stack data scientist
##

##

##

##

##

##

##

##

##

##

##

##

Understanding Cost Function
Understanding Gradient Descent
Math Behind Gradient Descent
Assumptions of Linear Regression
Implement Linear Regression from Scratch
Train Linear Regression in Python
Implementing Linear Regression in R
Diagnosing Residual Plots in Linear Regression Models
Generalized Linear Models
Introduction to Logistic Regression
Odds Ratio
Implementing Logistic Regression from Scratch
Introduction to Scikit-learn in Python
Train Logistic Regression in python
Multiclass using Logistic Regression
How to use Multinomial and Ordinal Logistic Regression in R ?
Challenges with Linear Regression
Introduction to Regularisation
Implementing Regularisation
Ridge Regression
Lasso Regression
##

##

##

##

##

##

##

##

##

Introduction to Stacking
Implementing Stacking
Variants of Stacking
Implementing Variants of Stacking
Introduction to Blending
Bootstrap Sampling
Introduction to Random Sampling
Hyper-parameters of Random Forest
Implementing Random Forest
Out-of-Bag (OOB) Score in the Random Forest
IPL Team Win Prediction Project Using Machine Learning
Introduction to Boosting
Gradient Boosting Algorithm
Math behind GBM
Implementing GBM in python
Regularized Greedy Forests
Extreme Gradient Boosting
Implementing XGBM in python
Tuning Hyperparameters of XGBoost in Python
Implement XGBM in R/H2O
Adaptive Boosting
Implementing Adaptive Boosing
LightGBM
Implementing LightGBM in Python
Catboost
Implementing Catboost in Python
##

##

##

##

Introduction to Clustering
Applications of Clustering
Evaluation Metrics for Clustering
Understanding K-Means
Implementation of K-Means in Python
Implementation of K-Means in R
Choosing Right Value for K
Profiling Market Segments using K-Means Clustering
Hierarchical Clustering
Implementation of Hierarchial Clustering
DBSCAN
Defining Similarity between clusters
Build Better and Accurate Clusters with Gaussian Mixture Models
##

Introduction to Machine Learning Interpretability
Framework and Interpretable Models
model Agnostic Methods for Interpretability
Implementing Interpretable Model
Understanding SHAP
Out-of-Core ML
Introduction to Interpretable Machine Learning Models
Model Agnostic Methods for Interpretability
Game Theory & Shapley Values
##

Deploying Machine Learning Model using Streamlit
Deploying ML Models in Docker
Deploy Using Streamlit
Deploy on Heroku
Deploy Using Netlify
Introduction to Amazon Sagemaker
Setting up Amazon SageMaker
Using SageMaker Endpoint to Generate Inference
Deploy on Microsoft Azure Cloud
Introduction to Flask for Model
Deploying ML model using Flask