This article was published as a part of the Data Science Blogathon

**Data Science** is an interdisciplinary field that uses various algorithms or techniques to extract information from the data. Data science cannot be learned over a single night. It is a gradual curve. There are various skills to be a data scientist. Most importantly you need to be good at **statistics and probability**.

You all will have a question in your mind that,

__Where these statistics will be used in data science and real life? __

In this blog, we will see some basic statistical concepts and where they will be used in data science.

“Statistical methods can help you make the “best educated guess.”

Let us see a quick intro about this blog,

**→Population and sample**

→→→→__Where this population and sample is used in data science?__

→** Variable**

- Numerical variable
- Categorical variable

→ **Random Variable**

- Discrete Random variable

→→→→Where we use a discrete random variable in data science?

2. Continuous Random Variable

→→→→__Where we use Continuous random variables in data science?__

→Advantages of collecting data in Continuous Form

→The disadvantage of collecting data in Discrete form

→**Data**

- Quantitative data
- Qualitative data

→**Percentile**

→**Quartile
**

In statistics, the first thing we need to know is population and sampling.

In stats, we need to study the population. **The population** may be the number of persons, number of things, or any objects that we take for the analysis. It refers to the total quantity. This must be very big data.

**Difficulties in taking population :**

→ Collecting the entire population takes lots of time

→ The money required to collect those data is very high

So calculating the population of data is practically difficult. So here comes the new word **“Sample”**

The core idea of **sampling** is to select the portion or subset of the whole population and study that specific portion to gain the information of the population. So we were using a sample to gain information on the overall population.

In simple terms, Population is very big data, so, we take a particular part of the information(sample) in population and analyzing and arriving at a conclusion, considering that result shows all extracts of population.

https://pixabay.com/illustrations/human-banner-header-humanity-1375492/

Now you may have a question.

Let us see an example, We all know the process of election. All the people of the country elect the candidate by polling in ballots or EVMs etc. In this main election, all the people’s votes are considered. The overall count of the people is population.

Usually, before and after the election, NEWS channels conduct Opinion polls (i.e) entry poll(before the election), and exit poll(after election) surveys. In these opinion polls, the poll samples of 5000–10000 people are taken. This sample represents the views of the people of the country.

Best results depend on how well the sample represents the population. The sample must contain the characteristic of the population. It should represent the population.

https://pixabay.com/illustrations/rare-disease-population-2888820/

A variable is any characteristics, things, or number that can be measured or counted. They can be weight, height, age, etc.

They can be numerical variables or categorical variables.

**Numerical variable:**

The numerical variable can be in units or numbers.

Example: Weight of the students in a class, Height of the students in a class, Age of the students in a class.

**Categorical Variable: **

A categorical Variable can be a person or thing or characteristics.

Example: Analysing the Hair color of the student, or the blood group of the students in a class.

https://pixabay.com/illustrations/blood-hepatitis-scientist-diabetes-4039751/

Variable means a varying process. Random variable refers to a variable that possesses changes randomly. A random variable cannot be a single fixed value. It keeps changing. It changes because of uncertainty(the state of being uncertain). We measure uncertainty by using the probability concept.

Example: The height of the students in a classroom is an example of a random variable. Because it changes concerning time. It cannot be a definite value.

Random Variables can be of two types,

- Discrete Random Variable

- Continuos Random Variable

Let us discuss this in detail.

Any random variables that can be counted are called discrete random variables. There are no in-between values.

**Where we use a discrete random variable in data science?**

Number of people in a stadium for a week

If you are analyzing a cricket stadium dataset, so you are calculating the number of peoples in a stadium on a particular “day 1”, and you find that there are 12000 peoples on that day. Then this can be expressed as a discrete random variable. This value is definite and it cannot be 11999.50 or 12000.50. It is a countable value, so it comes under discrete random variables.

Any random variable that can be measured and varies continuously is called a continuous random variable. It can have in-between values.

**Where we use Continuous random variables in data science?**

If you are analyzing the weight of the students in a class, then it can be expressed as a continuous random variable. So a “Student A” have 49 Kg and “Student B” have 55.3 Kg. It will not be the same and it varies. It can also have in-between values. So it comes under continuous random variable.

→ In Shifting data from the continuous form to discrete, there is no loss in data.

→ In Shifting Data from Discrete form to Continuous form, there is always a loss in data.

Let us see some important terminologies.

Data is pieces of information that can be from a population or sample. Data can be of two types.

It always represents numbers (i.e) Numerical data. This can be age, height weight, etc. The mean or average taken from the quantitative data is highly useful.

Example: Average weight of the students in a class

It always represents categorical data. This can be blood group, address of the person, the vehicle of the person, etc. It will be mostly in the form of words or letters. The mean or average taken from the qualitative data doesn’t make sense.

Example: Average blood group or average vehicle name doesn’t make sense.

A percentile is a number where a certain percentage of the score falls below that number. It is a relative measure and is identified based on ranking.

Let us see an example with common percentiles,

We are analyzing the sales made by each sales representative in a textile shop in a month

Sales made by each sales representative in a textile shop in a month

Now we can calculate the percentile, First, we will sort the table for better understanding.

Excel gives us a predefined function to calculate percentile.

Let us see the meaning of percentile,

For this example, we take percentile inclusive,

25th percentile: 25 % of Salesmen made the sales less than 5500

50th percentile: 50% of Salesmen made the sales less than 8500

75th percentile: 75% of salesmen made the sales less than 10750

**Percentile** is respect to 100 parts.

if we want to take the whole sales into 100 parts, is percentile.

**Decile **

if we want to take the whole sales into 10 parts, then it is Decile.

**Quartile**

If we want to take the whole sales into 4 parts, then it is Quartile

* We have seen some basic stat concepts and where it is practically used in datasets. Thanks for reading!*

I hope you enjoyed the article and increased your knowledge about Statistics.

Please feel free to contact me** **at** [email protected]**

Want to share your thoughts? Feel free to comment below

**About the author**

Currently, I am pursuing my Bachelor of Engineering (B.E) in Computer Science from the **Government College of Engineering, Srirangam, Tamil Nadu**.** **I am very enthusiastic about Statistics, Machine Learning, and Data Science.

**Connect with me on Linkedin Mohamed Illiyas**

*The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.*

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Become a full stack data scientist
##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

Understanding Cost Function
Understanding Gradient Descent
Math Behind Gradient Descent
Assumptions of Linear Regression
Implement Linear Regression from Scratch
Train Linear Regression in Python
Implementing Linear Regression in R
Diagnosing Residual Plots in Linear Regression Models
Generalized Linear Models
Introduction to Logistic Regression
Odds Ratio
Implementing Logistic Regression from Scratch
Introduction to Scikit-learn in Python
Train Logistic Regression in python
Multiclass using Logistic Regression
How to use Multinomial and Ordinal Logistic Regression in R ?
Challenges with Linear Regression
Introduction to Regularisation
Implementing Regularisation
Ridge Regression
Lasso Regression

Introduction to Stacking
Implementing Stacking
Variants of Stacking
Implementing Variants of Stacking
Introduction to Blending
Bootstrap Sampling
Introduction to Random Sampling
Hyper-parameters of Random Forest
Implementing Random Forest
Out-of-Bag (OOB) Score in the Random Forest
IPL Team Win Prediction Project Using Machine Learning
Introduction to Boosting
Gradient Boosting Algorithm
Math behind GBM
Implementing GBM in python
Regularized Greedy Forests
Extreme Gradient Boosting
Implementing XGBM in python
Tuning Hyperparameters of XGBoost in Python
Implement XGBM in R/H2O
Adaptive Boosting
Implementing Adaptive Boosing
LightGBM
Implementing LightGBM in Python
Catboost
Implementing Catboost in Python

Introduction to Clustering
Applications of Clustering
Evaluation Metrics for Clustering
Understanding K-Means
Implementation of K-Means in Python
Implementation of K-Means in R
Choosing Right Value for K
Profiling Market Segments using K-Means Clustering
Hierarchical Clustering
Implementation of Hierarchial Clustering
DBSCAN
Defining Similarity between clusters
Build Better and Accurate Clusters with Gaussian Mixture Models

Introduction to Machine Learning Interpretability
Framework and Interpretable Models
model Agnostic Methods for Interpretability
Implementing Interpretable Model
Understanding SHAP
Out-of-Core ML
Introduction to Interpretable Machine Learning Models
Model Agnostic Methods for Interpretability
Game Theory & Shapley Values

Deploying Machine Learning Model using Streamlit
Deploying ML Models in Docker
Deploy Using Streamlit
Deploy on Heroku
Deploy Using Netlify
Introduction to Amazon Sagemaker
Setting up Amazon SageMaker
Using SageMaker Endpoint to Generate Inference
Deploy on Microsoft Azure Cloud
Introduction to Flask for Model
Deploying ML model using Flask