Image Source: http://pubs.sciepub.com/
1) What is Linear Regression?
2) Real-life examples of Regression
3) Different names of Linear Regression
4) What kind of relationship can a regression model show?
- Positive Relationship
- Negative Relationship
- No Relationship
5) Covariance and Correlation
6) Least Square Method
- Sum of squared Estimate of Errors
- Sum of squared Residuals
- Sum of squared Total
7) How do you do Linear Regression mathematically?
8) Interpretation of a dataset using Linear Regression
- Anova section
- Adj. R-squared
- Variables section
- Standard Error
What is Linear Regression?
Before we jump into the details of linear regression, you might be asking yourself why we are looking at this algorithm.
Isn’t it a procedure we use in Statistics?
Yes, you are right.
Linear Regression was made in the field of statistics, it is used as a model for understanding the association between the independent and dependent variables. These models are utilized to foresee the connection between two quantitative variables where the predictor variables are known as an independent variable and the variable which is being predicted is called a dependent variable.
Suppose we want to predict the price of a house based on its Area, Garage Area, Land Contour, Utilities, etc. So here “price” will be the dependent variable and “Area, Garage Area, Land Contour, Utilities” will be the independent variable. It’s that easy.
Real-life examples of Regression
revenue = β0 + β1 (ad. spending)
blood pressure = β0 + β1 (dosage)
crop yield = β0 + β1 (rainfall) + β2(fertilizer)
Different Names of Linear Regression.
You must be confused between linear regression, multiple linear regression, polynomial regression, etc. There is a wide range of names with a regression word behind them, this is because Linear Regression has been around since 1805 and it has been studied from every possible angle and each angle has a different name. The main aim behind every regression model is to predict a value using some features.
When we have a single independent variable then we call it Linear Regression and when there are more than 2 independent variables we call it Multiple Linear Regression simple as that.
Linear regression is a linear model, which means it is only applied when we have a linear relationship among the variables. Now you might be wondering how do we check the relationship between the variables. For this, we can make a pair plot or scatter plot between the variables, and from the graph, we can decide If we can use Linear Regression or not.
What kind of relationship can a Linear Regression show?
1. Positive Relationship – When the regression line between the two variables moves in the same direction with an upward slope then the variables are said to be in a Positive Relationship, it means that if we increase the value of x (independent variable) then we will see an increase in our dependent variable.
2. Negative Relationship – When the regression line between the two variables moves in the same direction with a downward slope then the variables are said to be in a Negative Relationship it means that if we increase the value of an independent variable (x) then we will see a decrease in our dependent variable (y)
3. No Relationship – If the best fit line is flat (not sloped) then we can say that there is no relationship among the variables. It means there will be no change in our dependent variable (y) by increasing or decreasing our independent variable (x) value.
Image Source: math.stackexchange.com
Now how do we know what kind of relationship these variables have? Well by using correlation or covariance we can see what type of relationship is there.
Covariance tells us the direction of the relationship between X and Y but it doesn’t tell us how positive or negative the relationship is. If the covariance value is negative then we can say that if our independent variable (X) increases then our dependent variable (Y) decreases and vice versa.
Correlation is a statistical measure that tells us the direction of the relationship as well as the strength of the relationship (how much positive the variables are correlated, how much negative the variables are correlated). The range of correlation is between -1<correlation< +1. It will be called perfect correlation if all the points fall on the best fit line (which is very unlikely)
Least Square Method
The main idea behind the Linear Regression model is to fit a line that is the best fit to data and for this, we use a technique called Least Square Method. In layman’s terms, the Least square method is the process of fitting the best curve for a set of data points by reducing the distance between the actual value and predicted value (Sum of Squared Residuals). The distance between the actual and the predicted value is often known as Error or Variation or Variance.
We all know the equation of a straight line right? It is y = a + bx.
Similarly, when we talk about the equation of the best fit line for Linear Regression it becomes:
And for Multiple Linear regression since we have more than 2 independent variables the equation becomes:
Where β0 is the Y-intercept of the regression line
β1 Is the slope of the regression line
Xi Is the explanatory variable
Now the question that comes into mind is, what error is this? Can we visualize it? How do we find it? In a linear model or any model we don’t have to worry about the mathematical part, everything is done by the model itself.
Let’s interpret the graph above. In linear regression the best fit line will be somewhat like this, the only difference will be the number of data points. To make it easier I have taken a fewer number of data points.
Suppose there’s a variable Yi, The distance between this Yi and the predicted value is what we call “SUM OF SQUARED ESTIMATE OF ERRORS” (SSE) . This is the unexplained variance and we have to minimize it to get the best accuracy.
The distance between the predicted value y_hat and the mean of the dependent variable is called “SUM OF SQUARED RESIDUALS” (SSR). This is the explained variance of our model and we want to maximize it.
The total variation in the model (SSR+SSE=SST) is called “SUM OF SQUARED TOTAL” .
How do you do Linear Regression?
Suppose, we want to know to what degree the tip amount can be predicted by the bill studied. So the Tip is the dependent variable (response variable) and the bill is the independent variable (predictor variable).
To fit the best fit line we need to minimize the sum of squared errors, which means the distance between the predicted value and actual value.
Step 1 – Check if there is a linear relationship between the variables
We know the equation of a line is y=mx+c or y = x*β1+β0. Let’s make a scatter plot and see if we can see any relationship between the variables. Always remember, that the best fit line will always pass through the centroid, which means the intersection of x_bar and y_bar.
We see that there is a positive relationship, as we increase the bill amount, there is an increase in tip amount also. Hence we can use our Linear regression model to predict the response variable.
Step 2 – Check the correlation of the data
After plotting a scatter plot and knowing what type of relationship it has, make sure to calculate the Correlation to know how much strength this direction has. Here in this case the correlation comes out to be 0.866. This tells us that the relationship we see is very strong.
Step 3 – Calculations
Now since we know that the relationship is positive and very strong, we can now start with our calculations.
The equation of best-fit line is : Ŷ = x*β1+β0
where β1 is the coefficient of regression or slope, to predict Ŷ we need to know this coefficient. It will also tell us that if we increase our independent variable by 1 unit then what will be the change in dependent variable. The formula for finding this is:
and is the constant term is calculated by β0=ȳ–x̄*β1
We get x̄ = 74 , ȳ = 10 , Σ (x-x̄)(y-ȳ) = 615 , Σ (x-x̄)^2 = 4206
Putting all the values in β1 we get β1 0.1462
It means if we increase the Bill by 1 unit the tip amount will increase by 0.1462 unit
Similarly β0 = -0.8203, the intercept may or may not have any meaning in real life
Hence the equation of the best fit line is Ŷ = 0.1462x – 0.8203
We see that there is so much calculation here, this is why we use Python libraries to make our work easier. We don’t need to worry about the mathematical part but we need to know what’s going on under the hood.
Interpretation of Linear Regression Result
To get this kind of result for your Linear regression model you can do it like this :
import statsmodels.api as sm X = sm.add_constant(x) model = sm.ols(y,x).fit() model.summary()
Coefficient of Determination (R squared) – R squared states the proportion of the variability explained by our model i.e what per cent of our model represents the real-life model. It is used to know the accuracy of our model. The range of R square is from 0-1, it can be less than 0 only when the best fit line is worst than the average best fit line. Here we can see that it is 0.751 which means that our model could explain 75.1% variance in the data or that our fitted values represent the original values with good accuracy
A good model has a high R squared value. But how high is high?
R square > 80% implies the model is a good fit
60% < R square < 80% model is an okay fit
R square < 60% model needs improvement
If your R-Squared value is less you may need to check your independent variables again and see if there are any outliers in them.
Adjusted R Squared – Every time we add a new input variable, there will be an increase in the R square. So, it is not a good approach to use the R square as a deciding quantity as to whether we should add a new input variable or not. Hence, one more quantity is known as “Adjusted R squared” is used
Where R square – Coefficient of determination
N – Total sample size
p = number of predictors (independent variables) in our model
F- Statistic – The value of F-statistic here reveals to us that not every one of the coefficients of our model may be equivalent to 0. If the overall F-test is significant, we can conclude that R-squared does not equal zero and the correlation between the model and dependent variable is statistically significant.
Now your question might be when can be the value of R square be 0. Well, when the slope or say the coefficient of the variables is equal, only then R square will be 0. If it is 0 then it means that there is no benefit gained from doing this regression.
Suppose, we have only one student in a class and the average of weights in the class is 60. What does this mean? This means that the average weight is itself the weight of that student. In this case, our R squared value will be 0.
The Null hypothesis here is (H0) : the model with no independent variables fits the data as well as our model
The Alternate hypothesis (H1) : our model fits the data better than the intercept only model
The p-value here helps us to check whether there is enough evidence to accept our Alternate hypothesis. If we are testing at a 95% confidence interval then, if:
P-value < level of significance (in this case 5%),we reject H0
> level of significance, we cannot reject H0
Let’s interpret this part and see what we can get from this portion. For a better understanding of Machine Learning, an individual must have a rudimentary knowledge of statistics.
Coefficient: The coefficient of a variable tells us that if we increase our independent variable by 1 unit holding other variables of the model constant (remember Linear Regression assumes that there is no Multicollinearity in our model, that means the independent variables have no collinearity between them), then our dependent variable will increase by that much value.
Here we can see that the coefficient of age is 257.4050, this means that if we increase age by 1 year then it will increase our target variable by 257.4050. If there was a negative coefficient. for example coefficient of “region”, then we would say that if we increase the value of the region by 1 unit then there’s a decrease of -353.4491 in our target variable.
Coefficients can also help us to see how significant the variable is for our model. If our coefficient value is close to zero, we could say that there is no relationship between the variable and the target variable.
Standard. Error – o understand what standard error is we first need to know about standard deviation. Standard deviation tells us the variation of the values from the mean or how spread the data is. About 95% of the values lie within 2 standard deviations of the mean.
Now in layman’s terms if I tell you how you can interpret standard error then I would say, suppose we have a population and from this population we pick enough samples let’s say 10. Now, if we find the mean of these samples and plot them on a standard normal graph, then the standard deviation of these sample means is what we call standard error. It will tell you how accurate the mean of any given sample is from the true population.
Here, in our regression model, the standard error gives the estimated standard deviation of the distribution of coefficients.
t-stat – t-statistic or t-value whatever you like more is calculated by dividing our coefficient from our standard error.
t-stat = Coefficient/Std.Error
Then we will reject the null hypothesis since the value falls under the rejection area.
The null hypothesis here is (H0): each of the coefficients at the population level is 0
The alternate hypothesis (H1): the coefficients are not 0 at population level
The higher the value of t-statistic in magnitude, the more significant the variable is.
P > |t| – p-value is the probability for the Null hypothesis to be true. Here in this table if we see the p-value of age, we can see the probability that the null hypothesis is True is approximately equal to 0. P-value will tell us that it is very unlikely that there is no relationship between the independent variable and the dependent variable, which means that the coefficients are not 0 at the population level.
Typically, a p-value of 5% or less is a good cut-off point. If the p-value is greater than 0.05 (p-value>0.05) then we fail to reject the null hypothesis and say that there is no relationship between the variable and the target variable, if it is less than 0.05 (p-value<0.05) then reject the null hypothesis and say that the coefficients are not equal to 0.
The higher the value of t-statistic, the lower will the p-value, the higher will be the chances that the value of the coefficients are significant and didn’t happen by chance.
Today you learned how to interpret your Linear Regression model from scratch, congratulations! It’s always better to know what is happening under the hood, you need to know what exactly these terms tell in this model only then you’ll be able to proceed and make good predictions.
To check out the full code please refer to my Github repository.
Thank you and have a nice day, Cheers!!
About the Author
For any doubt and queries, feel free to contact me on Email
The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.You can also read this article on our Mobile APP