Anshul Saini — July 19, 2021

This article was published as a part of the Data Science Blogathon

## Introduction

In this article, I will try to give you a basic understanding of a Linear Regression model. I will tell you how our system decides what should be the best fit line, what is the error and how it can be minimized, how should we decide when to use this model, etc.
Linear Regression, the term itself says “linear”, means that we can only use this model when we have a linear relationship among the variables. A linear relationship is a term used to describe a straight-line relationship between two variables.

Image Source: http://pubs.sciepub.com/

## Contents

1) What is Linear Regression?

2) Real-life examples of Regression

3) Different names of Linear Regression

4) What kind of relationship can a regression model show?

• Positive Relationship
• Negative Relationship
• No Relationship

5) Covariance and Correlation

6) Least Square Method

• Sum of squared Estimate of Errors
• Sum of squared Residuals
• Sum of squared Total

7) How do you do Linear Regression mathematically?

8) Interpretation of a dataset using Linear Regression

• Anova section
1. R-squared
3. F-statistic
• Variables section
1. Coefficient
2. Standard Error
3. t-statistic
4. p-value

9) Endnotes

## What is Linear Regression?

Before we jump into the details of linear regression, you might be asking yourself why we are looking at this algorithm.

Isn’t it a procedure we use in Statistics?

Yes, you are right.

Linear Regression was made in the field of statistics, it is used as a model for understanding the association between the independent and dependent variables. These models are utilized to foresee the connection between two quantitative variables where the predictor variables are known as an independent variable and the variable which is being predicted is called a dependent variable.

Suppose we want to predict the price of a house based on its Area, Garage Area, Land Contour, Utilities, etc. So here “price” will be the dependent variable and “Area, Garage Area, Land Contour, Utilities” will be the independent variable. It’s that easy.

## Real-life examples of Regression

### Example #1

Businesses frequently use Linear regression to comprehend the connection between advertising spending and revenue.

For example, they might use the Liner regression model using advertising spend as an independent variable or predictor variable and revenue as the response variable. The equation would take the following form:

revenue = β0 + β1 (ad. spending)

### Example #2

It can be used in the medical field to understand the relationships between drug dosage and blood pressure of the patients

Researchers may manage different measurements of a specific medication to patients and see how their circulatory strain reacts/blood pressure responds. They might fit a model using dosage as an independent variable and Blood pressure as the dependent variable.

blood pressure = β0 + β1 (dosage)

### Example #3

Agriculture scientists frequently use Linear regression to see the impact of rainfall and fertilizer on the number of fruits/vegetables yielded

For example, scientists might use different amounts of fertilizer and see the effect of rain on different fields and see how it affects crop yield. They might fit a multiple linear regression using rainfall and fertilizer as the predictor variables and crop yield as the dependent variable or response variable. The regression model would take the following form:

crop yield = β0 + β1 (rainfall) + β2(fertilizer)

## Different Names of Linear Regression.

You must be confused between linear regression, multiple linear regression, polynomial regression, etc. There is a wide range of names with a regression word behind them, this is because Linear Regression has been around since 1805 and it has been studied from every possible angle and each angle has a different name. The main aim behind every regression model is to predict a value using some features.

When we have a single independent variable then we call it Linear Regression and when there are more than 2 independent variables we call it Multiple Linear Regression simple as that.

Linear regression is a linear model, which means it is only applied when we have a linear relationship among the variables. Now you might be wondering how do we check the relationship between the variables. For this, we can make a pair plot or scatter plot between the variables, and from the graph, we can decide If we can use Linear Regression or not.

## What kind of relationship can a Linear Regression show?

1. Positive RelationshipWhen the regression line between the two variables moves in the same direction with an upward slope then the variables are said to be in a Positive Relationship, it means that if we increase the value of x (independent variable) then we will see an increase in our dependent variable.

2. Negative Relationship When the regression line between the two variables moves in the same direction with a downward slope then the variables are said to be in a Negative Relationship it means that if we increase the value of an independent variable (x) then we will see a decrease in our dependent variable (y)

3. No Relationship – If the best fit line is flat (not sloped) then we can say that there is no relationship among the variables. It means there will be no change in our dependent variable (y) by increasing or decreasing our independent variable (x) value.

Image Source: math.stackexchange.com

Now how do we know what kind of relationship these variables have? Well by using correlation or covariance we can see what type of relationship is there.

Covariance tells us the direction of the relationship between X and Y but it doesn’t tell us how positive or negative the relationship is. If the covariance value is negative then we can say that if our independent variable (X) increases then our dependent variable (Y) decreases and vice versa.

Correlation is a statistical measure that tells us the direction of the relationship as well as the strength of the relationship (how much positive the variables are correlated, how much negative the variables are correlated). The range of correlation is between -1<correlation< +1. It will be called perfect correlation if all the points fall on the best fit line (which is very unlikely)

## Least Square Method

The main idea behind the Linear Regression model is to fit a line that is the best fit to data and for this, we use a technique called Least Square Method. In layman’s terms, the Least square method is the process of fitting the best curve for a set of data points by reducing the distance between the actual value and predicted value (Sum of Squared Residuals). The distance between the actual and the predicted value is often known as Error or Variation or Variance.

We all know the equation of a straight line right? It is y = a + bx.

Similarly, when we talk about the equation of the best fit line for Linear Regression it becomes:

And for Multiple Linear regression since we have more than 2 independent variables the equation becomes:

Where β0 is the Y-intercept of the regression line

β1  Is the slope of the regression line

XIs the explanatory variable

Now the question that comes into mind is, what error is this? Can we visualize it? How do we find it? In a linear model or any model we don’t have to worry about the mathematical part, everything is done by the model itself.

Let’s interpret the graph above. In linear regression the best fit line will be somewhat like this, the only difference will be the number of data points. To make it easier I have taken a fewer number of data points.

Suppose there’s a variable Yi, The distance between this Yi and the predicted value is what we call “SUM OF SQUARED ESTIMATE OF ERRORS” (SSE) . This is the unexplained variance and we have to minimize it to get the best accuracy.

The distance between the predicted value y_hat and the mean of the dependent variable is called “SUM OF SQUARED RESIDUALS” (SSR). This is the explained variance of our model and we want to maximize it.

The total variation in the model (SSR+SSE=SST) is called “SUM OF SQUARED TOTAL” .

## How do you do Linear Regression?

Suppose, we want to know to what degree the tip amount can be predicted by the bill studied. So the Tip is the dependent variable (response variable) and the bill is the independent variable (predictor variable).

To fit the best fit line we need to minimize the sum of squared errors, which means the distance between the predicted value and actual value.

### Step 1 – Check if there is a linear relationship between the variables

We know the equation of a line is y=mx+c or y = x*β1+β0. Let’s make a scatter plot and see if we can see any relationship between the variables. Always remember, that the best fit line will always pass through the centroid, which means the intersection of x_bar and y_bar.

We see that there is a positive relationship, as we increase the bill amount, there is an increase in tip amount also. Hence we can use our Linear regression model to predict the response variable.

### Step 2 – Check the correlation of the data

After plotting a scatter plot and knowing what type of relationship it has, make sure to calculate the Correlation to know how much strength this direction has. Here in this case the correlation comes out to be 0.866. This tells us that the relationship we see is very strong.

### Step 3 – Calculations

Now since we know that the relationship is positive and very strong, we can now start with our calculations.

The equation of best-fit line is : Ŷ = x*β1+β0

where βis the coefficient of regression or slope, to predict Ŷ we need to know this coefficient. It will also tell us that if we increase our independent variable by 1 unit then what will be the change in dependent variable. The formula for finding this is:

and is the constant term is calculated by β0=ȳ*β1

We get x̄ = 74 , ȳ = 10 , Σ (x-x̄)(y-ȳ) = 615 , Σ (x-)^2 = 4206

Putting all the values in βwe get β1 0.1462

It means if we increase the Bill by 1 unit the tip amount will increase by 0.1462 unit

Similarly β= -0.8203, the intercept may or may not have any meaning in real life

Hence the equation of the best fit line is Ŷ = 0.1462x – 0.8203

We see that there is so much calculation here, this is why we use Python libraries to make our work easier. We don’t need to worry about the mathematical part but we need to know what’s going on under the hood.

In the next section we’ll interpret the results of a Linear regression on a Medical cost dataset. Here is the full EDA and Code.

## Interpretation of Linear Regression Result

To get this kind of result for your Linear regression model you can do it like this :

```import statsmodels.api as sm

model = sm.ols(y,x).fit()

model.summary()```

### Anova section

Coefficient of Determination (R squared) – R squared states the proportion of the variability explained by our model i.e what per cent of our model represents the real-life model. It is used to know the accuracy of our model. The range of R square is from 0-1, it can be less than 0 only when the best fit line is worst than the average best fit line. Here we can see that it is 0.751 which means that our model could explain 75.1% variance in the data or that our fitted values represent the original values with good accuracy

A good model has a high R squared value. But how high is high?

R square > 80% implies the model is a good fit

60% < R square < 80% model is an okay fit

R square < 60% model needs improvement

If your R-Squared value is less you may need to check your independent variables again and see if there are any outliers in them.

Adjusted R Squared – Every time we add a new input variable, there will be an increase in the R square. So, it is not a good approach to use the R square as a deciding quantity as to whether we should add a new input variable or not. Hence, one more quantity is known as “Adjusted R squared” is used

It is a modified version of R squared. It is more useful when we add irrelevant variables to our model, which means if we add variables that do not affect the target variable then the adjusted R Squared value will decrease and R squared value will increase. It is always lower than the R square

Usually, the value of R squared and adjusted R Squared is somewhat the same but if you see a large difference then you need to check out your independent variables again and see if there is any relationship between the target variable and the independent variable.

Where R square – Coefficient of determination

N – Total sample size

p = number of predictors (independent variables) in our model

F- Statistic – The value of F-statistic here reveals to us that not every one of the coefficients of our model may be equivalent to 0. If the overall F-test is significant, we can conclude that R-squared does not equal zero and the correlation between the model and dependent variable is statistically significant.

Now your question might be when can be the value of R square be 0. Well, when the slope or say the coefficient of the variables is equal, only then R square will be 0. If it is 0 then it means that there is no benefit gained from doing this regression.

Suppose, we have only one student in a class and the average of weights in the class is 60. What does this mean? This means that the average weight is itself the weight of that student. In this case, our R squared value will be 0.

The Null hypothesis here is (H0) : the model with no independent variables fits the data as well as our model

The Alternate hypothesis (H1) : our model fits the data better than the intercept only model

The p-value here helps us to check whether there is enough evidence to accept our Alternate hypothesis. If we are testing at a 95% confidence interval then, if:

P-value < level of significance (in this case 5%),we reject H0

p-value
> level of significance, we cannot reject H0

### Variables Section

Let’s interpret this part and see what we can get from this portion. For a better understanding of Machine Learning, an individual must have a rudimentary knowledge of statistics.

Coefficient: The coefficient of a variable tells us that if we increase our independent variable by 1 unit holding other variables of the model constant (remember Linear Regression assumes that there is no Multicollinearity in our model, that means the independent variables have no collinearity between them), then our dependent variable will increase by that much value.

Here we can see that the coefficient of age is 257.4050, this means that if we increase age by 1 year then it will increase our target variable by 257.4050. If there was a negative coefficient. for example coefficient of “region”, then we would say that if we increase the value of the region by 1 unit then there’s a decrease of -353.4491 in our target variable.

Coefficients can also help us to see how significant the variable is for our model. If our coefficient value is close to zero, we could say that there is no relationship between the variable and the target variable.

Standard. Error – o understand what standard error is we first need to know about standard deviation. Standard deviation tells us the variation of the values from the mean or how spread the data is. About 95% of the values lie within 2 standard deviations of the mean.

Now in layman’s terms if I tell you how you can interpret standard error then I would say, suppose we have a population and from this population we pick enough samples let’s say 10. Now, if we find the mean of these samples and plot them on a standard normal graph, then the standard deviation of these sample means is what we call standard error. It will tell you how accurate the mean of any given sample is from the true population.

Here, in our regression model, the standard error gives the estimated standard deviation of the distribution of coefficients.

Confusing right? Don’t worry we’ll try to break this.
We’ve previously determined that for every 1 unit increase in age, the target variable will go up by 257.4050. Now in case if we ran the model, again and again, the standard error will tell that there may be some variation in this coefficient, suppose here in age the std. the error of 11.878 tells that the coefficient of age may vary by 11.878 if we run this model again.

Just so you don’t need to scroll back up again, I will post the result again here 🙂

t-stat – t-statistic or t-value whatever you like more is calculated by dividing our coefficient from our standard error.

t-stat = Coefficient/Std.Error

The value of the t-statistic will help us determine if the coefficient value is really that number or it just happened by chance. If the value of the t-statistic is greater than the tabulated value.

Then we will reject the null hypothesis since the value falls under the rejection area.

The null hypothesis here is (H0): each of the coefficients at the population level is 0

The alternate hypothesis (H1): the coefficients are not 0 at population level

The higher the value of t-statistic in magnitude, the more significant the variable is.

P > |t| – p-value is the probability for the Null hypothesis to be true. Here in this table if we see the p-value of age, we can see the probability that the null hypothesis is True is approximately equal to 0. P-value will tell us that it is very unlikely that there is no relationship between the independent variable and the dependent variable, which means that the coefficients are not 0 at the population level.

Typically, a p-value of 5% or less is a good cut-off point. If the p-value is greater than 0.05 (p-value>0.05) then we fail to reject the null hypothesis and say that there is no relationship between the variable and the target variable, if it is less than 0.05 (p-value<0.05) then reject the null hypothesis and say that the coefficients are not equal to 0.

The higher the value of t-statistic, the lower will the p-value, the higher will be the chances that the value of the coefficients are significant and didn’t happen by chance.

## Endnotes

Today you learned how to interpret your Linear Regression model from scratch, congratulations! It’s always better to know what is happening under the hood, you need to know what exactly these terms tell in this model only then you’ll be able to proceed and make good predictions.

To check out the full code please refer to my Github repository.

Thank you and have a nice day, Cheers!!

Hello, I am Anshul Saini from Uttar Pradesh. I am an undergraduate student currently in my last year majoring in Statistics (Bachelors of Statistics) and have a strong interest in the field of data science, machine learning, and artificial intelligence. I enjoy diving into data to discover trends and other valuable insights about the data. I am constantly learning and motivated to try new things.

I am open to collaboration and work.

For any doubt and queries, feel free to contact me on Email 