# All about Statistical Modeling

*This article was published as a part of the Data Science Blogathon.*

**What is a Statistical Model?**

“Modeling is an art, as well as a science and, is directed toward finding a good approximating model … as the basis for statistical inference” – Burnham & Anderson

A statistical model is a **type of mathematical model** that comprises of the **assumptions** undertaken to describe the data generation process.

Let us focus on the two highlighted terms above:

- Type of mathematical model? Statistical model is non-deterministic unlike other mathematical models where variables have specific values. Variables in statistical models are stochastic i.e. they have probability distributions.
- Assumptions? But how do those assumptions help us understand the properties or characteristics of the true data? Simply put, these assumptions make it easy to calculate the probability of an event.

Quoting an example to better understand the role of statistical assumptions in data modeling:

**Assumption 1:** Assuming that we have 2 fair dice, and each face has equal probability to show up i.e. 1/6. Now, we can calculate the probability of two dice showing up 5 as 1/6*1/6. As we can calculate the probability of every event, it constitutes a statistical model.

**Assumption 2:** The dice are weighted and all we know is that probability of face 5 is 1/8 which makes it easy to calculate the probability of both dice to show 5 as 1/8*1/8. But we do not know the probability of other faces, so we cannot calculate the probability of every event. Hence this assumption does not constitute statistical model.

**Why do we need Statistical Modeling?**

The statistical model plays a fundamental role in carrying out statistical inference which helps in making propositions about the unknown properties and characteristics of the population as below:

**1) ****Estimation:**

It is the central idea behind Machine Learning i.e. finding out the number which can estimate the parameters of distribution.

Note that the estimator is a random variable in itself, whereas an estimate is a single number which gives us an idea of the distribution of the data generation process. For example, the mean and sigma of Gaussian distribution

#### 2) **Confidence Interval:**

It gives an error bar around the single estimate number i.e. a range of values to signify the confidence in the estimate arrived on the basis of a number of samples. For example, estimate A is calculated from 100 samples and has a wider confidence interval, whereas estimate B is calculated from 10000 samples and thus has a narrower confidence interval

**3) ****Hypothesis Testing**

It is a statement of finding statistical evidence. Let’s further understand the need to perform statistical modeling with the help of an example below.

Objective is to understand the underlying distribution to calculate the probability that a randomly selected researcher would have written, let’s say, 3 research papers .

We have a discrete random variable with 8 (9-1) parameters to learn i.e., probability of 0,1,2.. research papers. As the number of parameters to be estimated increase, so is the need to have those many observations, but this is not the purpose of data modeling.

**So, we can reduce the number of unknowns from 8 parameters to only 1 parameter lambda, simply by assuming that the data is following Poisson distribution. **

Our assumption that the data follows Poisson distribution might be a simplification as compared to the real data generation process, but it is a good approximation.

**Types of modeling assumptions:**

Now that we understand the significance of statistical modeling,** let’s understand the types of modeling assumptions:**

1) **Parametric:** It assumes a finite set of parameters which capture everything about the data. If we know the parameter θ which very well embodies the data generation process, then predictions (x) are independent of the observed data (D)

2) **Non-parametric: **It assumes that no finite set of parameters can define the data distribution. The complexity of the model is unbounded and grows with the amount of data

3) **Semi-parametric: **It’s a hybrid model whose assumptions lies between parametric and non-parametric approaches. It consists of two components – structural (parametric) and random variation (non-parametric). Cox proportional hazard model is a popular example of semi-parametric assumptions.

**Definition of a statistical model: (S,P)**

**S:** Assume that we have a collection of N i.i.d copies such as X1, X2, X3…Xn through a statistical experiment (it is the process of generating or collecting data). All these **random variables are measurable over some sample space which is denoted by S**.

**P: **It is the** set of probability distributions on S **that contains the distribution which is an approximate representation of our actual distribution.

Let’s internalize the concept of **sample space** before understanding how a statistical model for these distributions could be represented.

1) Bernoulli : {0,1}

2) Gaussian : (-∞, +∞)

**So now we have seen a few examples of sample space of some of the distribution’s family, now let’s see how a statistical model is defined:**

1) Bernoulli : ({0,1},(Ber(p))p∈(0,1))

2) Gaussian: ((-∞, +∞),(N(𝜇,0.3))𝜇∈R)

**Well, specified and Misspecified models:**

What is the Model Specification? As per Wikipedia definition:

**Model specification** consists of selecting an appropriate functional form for the model. For example, given “personal income” (y) together with “years of schooling” (s) and “on-the-job experience” (x), we might specify a functional relationship y=f(s,x)} as follows:

**Model Misspecification**: Has it ever happened with you that the model is converging properly on simulated data, but the moment real data comes, its robustness degrades, and it is no more converging? Well, this could typically happen if the model you developed does not match the data which is generally known as Model Misspecification. It could be because the class of distribution assumed for modeling does not contain the unknown probability distribution p from where the sample is drawn i.e. the true data generation process.

Source: Author

I hope this article has given you an understanding of what is a statistical model, why do we need such models, what role assumptions play and how can those assumptions decide the goodness of our model.

*****True or actual distribution/ data generation process referred throughout this article implies that there exists a probability distribution which gets induced by the process that generates the observed data

**References:**

https://mc-stan.org/docs/2_22/stan-users-guide/well-specified-models.html

http://mlss.tuebingen.mpg.de/2015/slides/ghahramani/gp-neural-nets15.pdf

https://courses.edx.org/courses/course-v1:MITx+18.6501x+3T2019/course/

## One thought on "All about Statistical Modeling"

## dhruv says: December 14, 2020 at 9:04 pm

Thanks Maanvi. Interesting article. Just a suggestion: maybe you can think of adding a simple example of some of the cases, so newcomers to statistics find it even more easier to understand from a practical view point.