Why should we care about random variables and their distributions? Let’s explore this important concept in statistics. We’ll break down the basics of probability, understand what random variables are, and see why their distributions matter. Whether you’re new to this or want a refresher, this guide is here to help you grasp these essential ideas and feel more comfortable with statistics.

*This article was published as a part of the Data Science Blogathon.*

A random variable, also known as a stochastic variable, functions in the realm of real values within the entire sample space of an experiment. Imagine the domain as the pool of all possible values applicable to the function. Just as a function processes its domain, a random variable operates on its domain (the experiment’s sample space) and assigns a real value to each event or outcome.

The collection of these real values, derived from the random variable, forms its range. In statistical notation, a random variable is typically denoted by a capital letter, while its observed values go by small letters.

Consider the experiment of tossing two coins. We can define X to be a random variable that measures the number of heads observed in the experiment. The sample set is here:

There are 4 possible outcomes for the experiment, and this is the domain of X. The random variable X takes these 4 outcomes/events and processes them to give different real values. For each outcome, the associated value is shown as:

Thus, we can represent X as follows:

Also Read: End to End Statistics for Data Science

There are three types of random variables- discrete random variables, continuous random variables, and mixed random variables.

Discrete random variables are random variables, whose range is a countable set. A countable set can be either a finite set or a countably infinite set. For instance, in the above example, X is a discrete variable as its range is a finite set ({0, 1, 2}).

Continuous random variables, on the contrary, have a range in the forms of some interval, bounded or unbounded, of the real line. E.g., Let Y be a random variable that is equal to the height of different people in a given population set. Since the people can have different measures of height (not limited to just natural numbers or any countable set), Y is a continuous variable (in fact, the distribution of Y follows a normal/gaussian distribution on most occasions).

Lastly, mixed random variables are ones that are a mixture of both continuous and discrete variables. These variables are more complicated than the other two. Hence, they are explained at the end of this article.

When we express the likelihood of values in a random variable’s range, we essentially discuss the probability distribution of that random variable. This means calculating the probability for each value within the variable’s range. Probability distribution descriptions differ slightly for discrete and continuous random variables.

For discrete variables, the term â€˜Probability mass function (PMF)â€™ is used to describe their distributions. Using the example of coin tosses, as discussed above, we calculate the probability of X taking the values 0, 1 and 2 as follows:

We use the notation P_{X}(x) to refer to

the PMF of the random variable X. The distribution is shown as follows:

The table can also be graphically demonstrated:

In general, if a random variable X has a countable range given by:

Then, we define probability mass function as:

This also leads us to the general description of the distribution in tabular format:

Properties of probability mass function:

1) PMF can never be more than 1 or negative i.e.,

2) PMF must sum to one over the entire range set of a random variable.

3) For A, a subset of R_{x},

For continuous variables, the term â€˜Probability density function (PDF)â€™ is used to describe their distributions. Weâ€™ll consider the example of the distribution of heights. Suppose, we survey a group of 1000 people and measure the height of each person very precisely. The distribution of the heights can be shown by a density histogram as follows:

We have grouped the different heights in certain intervals. But letâ€™s see what happens when we try to reduce the size of the histogram bins. In other words, we make the grouping intervals smaller and smaller.

Going further, we further reduce the bin size to such an extent that every observation tends to have its own bin. We are essentially constructing these extremely tiny rectangles, that we connect together by a smooth curve, giving us the following distribution:

And thatâ€™s it! We have got the probability distribution of heights for our sample population set. But howâ€™s probability related to all of this? Observe the y axis. It shows density, which indicates the proportion of the population having a particular range of height. The probability that a randomly chosen person from the population having a height within the given interval corresponds to this proportion. That sounds more probabilistic!

We use the notation f_{X}(x) to refer to the PDF of random variable X. Both PMF and PDF are analogous. We just replace summation with integration to account for their continuous behaviour.

1) PDF can never negative i.e.,

2) PDF must integrate to one over the entire range of a random variable.

3) For A, a subset of R_{x},

More specifically, if A = [a, b], then,

Graphically, the probability that a continuous random variable X takes a value within a given interval is the area below the PDF for X, enclosed between the given interval. For instance, in the above example, if we wish to determine the probability that a randomly selected person from the population has a height between 65 cm and 75 cm, we calculate the purple area (using definite integration):

Note: Unlike PMF, PDF can take a value greater than 1. This is because of a difference in their interpretation. In the case of PMF, the value of the function for a particular x has the same interpretation as probability, making its value restricted to the [0, 1] interval. However, in PDF, the value does not translate to probability. In fact, P(X = x) = 0, if X is a continuous variable (itâ€™s like calculating area under the PDF curve, just below a point).

Sometimes, itâ€™s easier to have the distribution of a random variable expressed in an alternative way. Cumulative distribution functions (CDF) are one such way. Cumulate means to gather or sum up. CDFs do the same. A useful property of CDFs is that they are defined in the same way for both discrete and continuous variables. A CDF shows the probability that a random variable X takes a value lesser than or equal to x. Mathematically, a CDF is defined as follows:

Letâ€™s consider both the discrete and the continuous case.

The CDF of discrete random variables resembles a staircase, a graph with many jumps. Weâ€™ll again use the coin toss example. The following PMF was obtained:

Weâ€™ll now calculate the CDF of X for different values of x:

Hence, we can define F_{X}(x) as follows:

The CDF can also be shown graphically as follows:

For discrete variables, we can define the following relation between PMF and CDF:

The CDF of continuous random variables is noisier than that of discrete variables. Just as previously done, we sum up (rather say, integrate) the PDF to get the CDF. For the example of the height, the following CDF has been made:

For continuous variables, we can define the following relation between PDF and CDF:

1) CDF is always a non-decreasing function.

2) CDF always lies between 0 and 1.

3) For all a < b,

Many times, itâ€™s handy to use just a few numbers to express the distribution of a random variable. There are many such numbers- most common of which are expectation and variance.

The expectation (also known as mean) of a random variable X is its weighted average. For discrete random variables, the expectation is calculated using the following equation:

For continuous variables, once again, we replace summation with integration to arrive at the following equation:

Basic properties of expectation of random variables:

1) The expectation of a constant is the constant itself.

2) The expectation of the sum of two random variables is equal to the sum of their expectations.

3) If Y = aX + b, then the expectation of Y is calculated as:

The variance of a random variable X is the expected value of the square of the deviation of different values of X from the expectation of X. It shows the spread of the distribution of a random variable is. It is generally represented as:

However, a more useful expression for variance is obtained after simplifying the above equation:

Where,

Basic properties of variance of random variables:

1) The variance of a constant is zero.

2) For two random variables- X & Y, the variance of their sum is expressed as follows:

Cov(X, Y) is called the covariance of X & Y. Covariance describes the relationship between two variables. It can be defined by the following equation:

3) If Y = aX + b, then the variance of Y is defined as:

Mixed random variables have a discrete part (where the range of the variable is a countable set), and a continuous part (where the range of the variable takes the form of an interval of the real line). A mixed random variable Z can be shown as follows:

The CDF of the mixed random variable Z can be found out by calculating the weighted average of its components:

The expectation of Z can also be calculated by using the above methodology:

In conclusion, understanding random variables and their distributions is fundamental to unraveling the mysteries of probability and statistics. We’ve taken a journey through the basics, delving into the significance of these concepts. Whether navigating data analysis, research, or simply enhancing your statistical literacy, a solid grasp of random variables and distributions empowers you to interpret, analyze, and draw meaningful insights from a world of uncertainty. Embrace the power of probability, and let these concepts guide you in making sense of the randomness surrounding us.

A. A random variable is a numerical outcome of a random phenomenon, representing different values based on chance, like the result of a coin flip.

A. Yes, random variables can be categorized into three types: discrete, continuous, and mixed, depending on their possible values.

A. A variable is any quantity that can vary, while a random variable specifically represents uncertain outcomes in probability experiments.

A. Identify characteristics with varying outcomes in a probability scenario; if results are uncertain and can be expressed numerically, it’s a random variable.

*The media shown in this article are not owned by Analytics Vidhya and is used at the Authorâ€™s discretion. *

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Become a full stack data scientist
##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

Understanding Cost Function
Understanding Gradient Descent
Math Behind Gradient Descent
Assumptions of Linear Regression
Implement Linear Regression from Scratch
Train Linear Regression in Python
Implementing Linear Regression in R
Diagnosing Residual Plots in Linear Regression Models
Generalized Linear Models
Introduction to Logistic Regression
Odds Ratio
Implementing Logistic Regression from Scratch
Introduction to Scikit-learn in Python
Train Logistic Regression in python
Multiclass using Logistic Regression
How to use Multinomial and Ordinal Logistic Regression in R ?
Challenges with Linear Regression
Introduction to Regularisation
Implementing Regularisation
Ridge Regression
Lasso Regression

Introduction to Stacking
Implementing Stacking
Variants of Stacking
Implementing Variants of Stacking
Introduction to Blending
Bootstrap Sampling
Introduction to Random Sampling
Hyper-parameters of Random Forest
Implementing Random Forest
Out-of-Bag (OOB) Score in the Random Forest
IPL Team Win Prediction Project Using Machine Learning
Introduction to Boosting
Gradient Boosting Algorithm
Math behind GBM
Implementing GBM in python
Regularized Greedy Forests
Extreme Gradient Boosting
Implementing XGBM in python
Tuning Hyperparameters of XGBoost in Python
Implement XGBM in R/H2O
Adaptive Boosting
Implementing Adaptive Boosing
LightGBM
Implementing LightGBM in Python
Catboost
Implementing Catboost in Python

Introduction to Clustering
Applications of Clustering
Evaluation Metrics for Clustering
Understanding K-Means
Implementation of K-Means in Python
Implementation of K-Means in R
Choosing Right Value for K
Profiling Market Segments using K-Means Clustering
Hierarchical Clustering
Implementation of Hierarchial Clustering
DBSCAN
Defining Similarity between clusters
Build Better and Accurate Clusters with Gaussian Mixture Models

Introduction to Machine Learning Interpretability
Framework and Interpretable Models
model Agnostic Methods for Interpretability
Implementing Interpretable Model
Understanding SHAP
Out-of-Core ML
Introduction to Interpretable Machine Learning Models
Model Agnostic Methods for Interpretability
Game Theory & Shapley Values

Deploying Machine Learning Model using Streamlit
Deploying ML Models in Docker
Deploy Using Streamlit
Deploy on Heroku
Deploy Using Netlify
Introduction to Amazon Sagemaker
Setting up Amazon SageMaker
Using SageMaker Endpoint to Generate Inference
Deploy on Microsoft Azure Cloud
Introduction to Flask for Model
Deploying ML model using Flask