Instead of starting with the definition of statistics, I wish to start with the quote of Karl Pearson’s definition on Statistics, **“Statistics is the
grammar of science”.**

Recently everyone is talking about Data. After hearing the word** “Data” **the** **basic questions that arise in our minds are,

What is Data?

How Data is collected?

How Data can be analyzed?

How Data is interpreted?

To answer all these questions, the term **“Statistics”** is used. Statistics is the basic and important tool to deal with the data. Now coming to the definition of statistics, it involves the collection, descriptive, analysis and concludes the data.

There are two types of Statistics, **Descriptive and Inferential Statistics. **

In **Descriptive Statistics**, from the given observation the data is summarized. The summarization takes place by considering the sample from the population using the mean or standard deviation.

There are four different categories in Descriptive Statistics. They are,

- Measure of frequency
- Measure of dispersion
- Measure of central tendency
- Measure of position.

Based on the number of times a particular data has occurred defines the measure of frequency. The Measure of dispersion can be defined based on the Range, Variance, Standard Deviation, etc., The mean, median, mode, skewness of the respective data comes under the measure of central tendency. Finally, based on the percentile and quartile the position is measured.

Then looking into **Inferential Statistics**, once the data is collected, tabulated, and analyzed the summary or the inference is derived by using inferential statistics. The inferences are drawn based upon sampling variation and observational error.

Based on the information and conclusion derived from the sample the inferential statistics help us to predict and estimate results for the population.

Statistics is used in a variety of sectors in our day-to-day life for analyzing the right data. Based on the interpretation the development steps are taken in both private and public sectors.

Before getting into Analyzing the data, there are few things to remember.

Define your question, Collect the right data, Understand the data, Cleaning the data, Analyze the data and finally interpret the results for the questions.

**What is defining the question? **For an organization, the betterment steps are taken from the past data analysis. For better steps, there will be a few objectives to be answered perfectly to give a good interpretation. The question should give the potential solution for the problem. For that framing, a relevant question is more important. Based on the questions only, the data will be collected. So defining the question is plays a major role.

For example, In a company, if the employee attrition is high. The solution for reducing the employee moving from the company should be drawn for that the basic variables like the employee experience, their satisfactory level, their promotion, duration of working hours, etc., to be determined such that the problem can be resolved to give a potential solution.

**How to collect the right data? **The data collection has two classifications. One is primary data and another is secondary data. In primary data, the data will be collected through questionnaires, by sending emails or approaching each person. For example census. Whereas, in secondary data, it’s the data that is already available in the secondary source like agency or Database.

Now before collecting the new data, identify the existing data that is available from the database. Apart from that collect the relevant data to satisfy the objective. Then organize the existing data with the new data to proceed with the analysis. For example: Taking the same case of employee attrition, the data to be collected are experience in the company, working hours, educational qualification, distance from home, traveling hours, promotion, age of the employee, increment or hike, etc., these data are important to be collected to find the reason for employee attrition. There may be few variables that already available in the database and any new variables that are needed can be added.

**Why do we need to Understanding the data? **Once the data is collected there may be many variables that are related directly or indirectly to the objective. For that, we first need to study about all the variables whether it is nominal or ordinal. Preparing the data for analysis is done after understanding the data. While understanding we get to know about the data types, rows, and columns, missing in the data, finding the independent and dependent variables, etc.,

There may be few variables that may not be related to the question that the organization has and those variables can be used in future for future analysis. For finding those kinds of variables understanding the data is more important. Taking the same employee attrition example there may be data related to the family such as family members, years of experience in a previous company, social status, etc., and every variable is to be understood such that to split the data in such a way for answering the question.

**How Cleaning the data is done? **Data cleaning is the process of modifying the data, removing the duplicate variables, creating dummy variables if needed. Deleting the unwanted columns that are not related to the question. If the data cleaning is not proper it may lead to a lower accuracy of the model and may tend to misleading conclusions.

Once the data cleaning is done, the right data to answer the question is ready. Data manipulation is done in many ways like plotting the data, creating pivot tables for the variables, correlation, regression, detecting outliers process may take place. During the stage of manipulation, it may be needed to proceed with an existing dataset or remove some dataset or there may be a need to add few more data to answer the question. After all these stages, the required data will be ready for analysis.

**How to analyze the data? **When starting to talk about analysis the main thing is that model selection. Selecting the model plays an important role to analyze the data and answering the objective. Defining the dependent and independent variables is the important stage when analyzing the data. Currently, machine learning techniques are used for data analysis such that predictions and interpretations can be done easily. But still, some objectives can be answered directly while doing data visualization and basic statistical analysis. The tools used for analyzing the data are, Python, Excel, R programming, SPSS, STATA, etc.,

* Correlation* is used to find the relationship or association between two or more variables. Correlation lies between values -1 to +1. The interpretation is that, if the correlation is +1 then it is strongly positively correlated, -1 then it is strongly negatively correlated and 0 implies no correlation exists. Correlation works in both the case of quantitative and qualitative data.

Coming to * regression*, this analysis is used when we need to find the dependencies of one variable on the other. The regression value lies between 0 and 1. If the regression value is 1 then it is a perfect fit and 0 then it is not a good fit. The predictive model can be done by using regression analysis. This also uses both quantitative and qualitative data. There are two types of regression analysis. Linear Regression and Multiple Linear Regression.

In * Linear Regression*, it has one dependent variable and one independent variable. For example, if the price is low the sales will be high. In the case of

In the Case of * Survival analysis*, if the data is concerning the time of occurrence of an event, then survival analysis can be applied. The event will have the outcome as 0 or 1. For example, the patient survival from a heart attack can be denoted the 0 or 1. 0 denotes the person not survived and 1 denotes he/she survived. This can be predicted having the variables such as age, smoker or non-smoker, urban or rural living person, having blood pressure or not. Based on all the factors considering the person surviving status can be estimated. Currently, Survival Analysis can be applied in the case of COVID patients.

Finally coming to the part of * Machine Learning techniques*, such as the Random Forest, Decision Tree, KNN, etc., can be applied in the case of prediction and classification technique. In the example of employee attrition, taking the objective as either the male or female employee who can leave the company can be determined by using classification technique. Several models can be developed and based on the accuracy of the models, which model can predict the future employee attrition can be determined. If higher the accuracy then that particular model can be used to predict the future data.

**Interpreting**** the result: **After analyzing the data, it’s time to interpret the result. While interpreting the result check whether the analysis answered all the questions that were framed, does the data collected helped in the analysis, and from the interpretation is there a positive result for the betterment of the objective. By considering our example of employee attrition the analysis part should suggest some better steps or improvements to reduce the employee attrition from the company.

These are the basic important thing to be done and noticed while doing statistical data analysis.

Finally, I wish to quote the words of Seth Godin – “**Data is not useful until it becomes information”**

Hope you all found some basic ideas about statistics and data analysis using statistics.

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Become a full stack data scientist
##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

Understanding Cost Function
Understanding Gradient Descent
Math Behind Gradient Descent
Assumptions of Linear Regression
Implement Linear Regression from Scratch
Train Linear Regression in Python
Implementing Linear Regression in R
Diagnosing Residual Plots in Linear Regression Models
Generalized Linear Models
Introduction to Logistic Regression
Odds Ratio
Implementing Logistic Regression from Scratch
Introduction to Scikit-learn in Python
Train Logistic Regression in python
Multiclass using Logistic Regression
How to use Multinomial and Ordinal Logistic Regression in R ?
Challenges with Linear Regression
Introduction to Regularisation
Implementing Regularisation
Ridge Regression
Lasso Regression

Introduction to Stacking
Implementing Stacking
Variants of Stacking
Implementing Variants of Stacking
Introduction to Blending
Bootstrap Sampling
Introduction to Random Sampling
Hyper-parameters of Random Forest
Implementing Random Forest
Out-of-Bag (OOB) Score in the Random Forest
IPL Team Win Prediction Project Using Machine Learning
Introduction to Boosting
Gradient Boosting Algorithm
Math behind GBM
Implementing GBM in python
Regularized Greedy Forests
Extreme Gradient Boosting
Implementing XGBM in python
Tuning Hyperparameters of XGBoost in Python
Implement XGBM in R/H2O
Adaptive Boosting
Implementing Adaptive Boosing
LightGBM
Implementing LightGBM in Python
Catboost
Implementing Catboost in Python

Introduction to Clustering
Applications of Clustering
Evaluation Metrics for Clustering
Understanding K-Means
Implementation of K-Means in Python
Implementation of K-Means in R
Choosing Right Value for K
Profiling Market Segments using K-Means Clustering
Hierarchical Clustering
Implementation of Hierarchial Clustering
DBSCAN
Defining Similarity between clusters
Build Better and Accurate Clusters with Gaussian Mixture Models

Introduction to Machine Learning Interpretability
Framework and Interpretable Models
model Agnostic Methods for Interpretability
Implementing Interpretable Model
Understanding SHAP
Out-of-Core ML
Introduction to Interpretable Machine Learning Models
Model Agnostic Methods for Interpretability
Game Theory & Shapley Values

Deploying Machine Learning Model using Streamlit
Deploying ML Models in Docker
Deploy Using Streamlit
Deploy on Heroku
Deploy Using Netlify
Introduction to Amazon Sagemaker
Setting up Amazon SageMaker
Using SageMaker Endpoint to Generate Inference
Deploy on Microsoft Azure Cloud
Introduction to Flask for Model
Deploying ML model using Flask

So nice to know .I dont know this .Thank you for sharing to ...non computer IT savvy like me.Rosita