# 5 Questions which can teach you Multiple Regression (with R and Python)

Sunil Ray 26 Jun, 2020 â€¢ 8 min read

## Introduction

A journey of thousand miles begin with a single step. In a similar way, the journey of mastering machine learning algorithms begins ideally with Regression.Â ItÂ is simple to understand, and gets you started with predictive modeling quickly. While this ease is good for a beginner, I always advice them to also understand the working of regression before they start using it.

Lately, I have seen a lot of beginners, who just focus on learning how to perform regression (in R or Python) butÂ not on the actual science behind it. I am not blaming the beginners alone. Here is a script from a 2 day course on machine learning:

Running regression in Python and R doesn’t take more than 3-4 lines of code. All you need to do is, pass the variables, run the script and get the predicted values. And congratulations! You’ve run your first machine learning algorithm.

The course literally spends no time even explaining this simple algorithm, but covers neural networks as part of the course. What a waste of resources!

So, in this article, I’ve explained regression in a very simple manner. I have covered the basics, so that you not only understand what is regression and how it works, but also how to compute the popular RÂ² and the science behind it.

Just a word of caution, you can’t use it all types of situation. Simple regression has some limitations, which can be overcome by using advanced regression techniques.

## What is Linear Regression?

Linear Regression isÂ used forÂ predictive analysis. ItÂ is a technique which explains the degree of relationship between two or more variables (multiple regression, in that case) using a best fitÂ line / plane. Simple Linear Regression is used when we have, one independent variable and one dependent variable.

Regression technique tries toÂ fit a single line through a scatter plot (see below).Â  The simplest form of regression with one dependent and one independent variable is defined by the formula:

Y = aX + b

Let’s understand this equationÂ using the scatter plot below:

Above, you can see that a black line passes through the data points. Now, you carefully notice that this lineÂ intersectsÂ the data points at coordinatesÂ (0,0), (4,8) and (30,60). Here’s a question. Find the equation that describe this line? Your answer should be:

Y= a * X +Â b

Now, find the value of a and b?

With out going in its working, the outcome after solving these equations is:

Â a = 2, b = 0

Hence, our regression equation becomes: Y= 2*X + 0 i.e. Y= 2*X

Here, Slope = 8/4 =2 or 60/30 =2 and Intercept = 0 (as Y =0 when x is 0). So, equation would be

Y = 2*X + 0

This equation is known as linear regression equation, where Y is target variable, X is input variable. ‘a’ is known as slope and ‘b’ as intercept.Â ItÂ Â is used to estimateÂ real values (cost of houses, number of calls, total sales etc.) based on input variable(s). Here, we establish relationship between independent and dependent variables by fitting a bestÂ line. This best fit line is known as regression line and represented by a linear equation Y= a *X + b.

Now, you might think that in above example, there can be multiple regression lines those can pass through the data points. So, how to choose the best fit line or value of co-efficients a and b.

Let’s look at the methods to find the best fit line.

## How to find the best regression line?

We discussed above that regression line establishes aÂ relationship between independent and dependent variable(s). AÂ line which can explainÂ the relationship better is said to beÂ best fit line.

In other words, the best fit line tends to return most accurate value of Y based on X Â i.e. causing aÂ minimum difference between actual and predicted value of Y (lower prediction error). Make sure you understand the image below.

Here are some methods which check for error:

• Sum of all errors (âˆ‘error)
• Sum of absolute value of all errors (âˆ‘|error|)
• Sum of square of all errorsÂ (âˆ‘error^2)

Let’s evaluate performance of above discussed methods using anÂ example. Below I have plotted three linesÂ (y=2.3x+4, y=1.8x+3.5 and y=2x+8) to find the relationship between y and x.

Table shown belowÂ calculates theÂ error value of each data point and the total error value (E) usingÂ the three methods discussed above:

AfterÂ looking at the table, the following inferences can be generated:

• Sum of all errors (âˆ‘error): Using this method leads to cancellation of positive and negative errors, which certainly isn’t our motive. Hence, it is not the right method.
• The other two methods perform well but, if you notice, âˆ‘error^2, we penalizeÂ the error value much more compared toÂ âˆ‘|error|. You can see that two equations has almost similar value forÂ âˆ‘|error| whereas in case ofÂ âˆ‘error^2 there is significant difference.

Therefore, we can say that these coefficients a and b are derived based on minimizing the sum of squared difference of distance between data points and regression line.

There are twoÂ common algorithms to find the right coefficients for minimum sum of squared errors, first one is Ordinary Least Sqaure (OLS, used in python library sklearn) and other one is gradient descent.

## What are the performance evaluation metrics in Regression?

As discussed above, to evaluate the performance of regression line, we should look at the minimum sum of squared errors (SSE). It works well but when it has one concern!

Let’s understand it using theÂ table shown below:

Above you can see, we’ve removed 4 data points in right table and therefore the SSE has reduced (with same regression line). Further, if you look at the scatter plot, removed data points have almost similar relationship between x and y. It means that SSE is highly sensitive to number of data points.

Other metric to evaluate the performance of linear regression is R-square and mostÂ common metric to judge the performance of regression models. RÂ² measures,Â “How much the change in output variable (y) is explained by the change in input variable(x).

R-squared is always between 0 and 1:

• 0 indicates that the model explains NILÂ variability in theÂ response data around its mean.
• 1 indicates that the model explains fullÂ variability in theÂ response data around its mean.

In general, higher the RÂ², more robust will be the model. However, there are important conditions for this guideline that Iâ€™ll talk about in my future posts..

Let’s Â take the above example again and calculate the value of R-square.

As you can see, RÂ² has less variation in score compare to SSE.

One disadvantage of R-squared is that it can only increase as predictors are added to the regression model. This increase is artificial when predictors are not actually improving the modelâ€™s fit. To cureÂ this, we use “Adjusted R-squared”.

Adjusted R-squared is nothing but the change of R-square that adjusts the number of terms in a model. Adjusted R square calculates the proportion of the variation in the dependent variable accounted by the explanatory variables. It incorporates the modelâ€™s degrees of freedom. Adjusted R-squared will decrease as predictors are added if the increase in model fit does not make up for the loss of degrees of freedom. Likewise, it will increase as predictors are added if the increase in model fit is worthwhile. Adjusted R-squared should always be used with models with more than one predictor variable. It is interpreted as the proportion of total variance that is explained by the model.

## What is Multi-Variate Regression?

Let’s now examine the process to deal withÂ multiple independent variables related to a dependent variable.

Once you have identified the level of significance betweenÂ independent variables(IV) andÂ dependent variables(DV), use these significant IVs to make more powerful and accurate predictions. This technique is known asÂ  “Multi-variate Regression”.

Let’s take an example here to understand this concept further.

We know that, compensation of aÂ person depends on his age i.e. the older one gets, the higher he/she earns as compared to previous year. You build a simple regression model to explain this effect of age on a person’s compensation . You obtain R2 of 27%. What does this mean?

Let’s try to think over it graphically.

Â

In this example, RÂ² as 27%, says, only 27% of variance in compensation is explained by Age.Â In other words, if you know a person’s age, you’ll have 27% informationÂ to make an accurate prediction about their compensation.

Now, let’sÂ take an additional variable as ‘time spent with the company’ to determine the current compensation. By this,Â R2 value increases to 37%. How do we interpret this value now?

Let’s understandÂ this graphically once again:

Notice that a person’s time with company holdsÂ only 10% responsible for his/herÂ earning by profession. In other words, by adding this variable to our study, we improved our understanding of their compensationÂ from 27% to 37%.

Therefore, we learnt, by using two variables rather than one,Â improvedÂ theÂ ability to make accurate predictions about a person’s salary.

Things get much more complicated when your multiple independent variables are related to with each other. This phenomenon is known as Multicollinearity. This is undesirable. Â To avoid such situation, it is advisable to look for Variance Inflation Factor (VIF). For no multicollinearity, VIF should be ( VIF < 2).Â In case of high VIF, look for correlation table to find highly correlated variables and drop one of correlated ones.

Along with multi-collinearity, regression suffers from Autocorrelation, Heteroskedasticity.

In an multiple regression model, we try to predict

Here, b1, b2, b3 …bk are slopes for each independent variables X1, X2, X3….Xk and a is intercept.

Example: Net worth = a+ b1 (Age) +b2 (Time with company)

## How to implement regression in Python and R?

Linear regressionÂ has commonly known implementations in R packages and Python scikit-learn. Letâ€™s look at the codeÂ of loading linear regressionÂ model in R and Python below:

Python Code

R Code

```#Load Train and Test datasets
#Identify feature and response variable(s) and values must be numeric
x_train <- input_variables_values_training_datasets
y_train <- target_variables_values_training_datasets
x_test <- input_variables_values_test_datasets
x <- cbind(x_train,y_train)
# Train the model using the training sets and check score
linear <- lm(y_train ~ ., data = x)
summary(linear)
#Predict Output
predicted= predict(linear,x_test)
Â ```

## End Notes

In this article, we looked at linear regression from basicsÂ followed by methods to find best fit line, evaluation metric, multi-variate regressionÂ and methods toÂ implement in python and R. If you are new to data science, I’d recommend you to master this algorithm, before proceeding to the higher ones.

### If you like what you just read & want to continue your analytics learning,Â subscribe to our emails,Â follow us on twitterÂ or like ourÂ facebookÂ page.

Sunil Ray 26 Jun 2020

I am a Business Analytics and Intelligence professional with deep experience in the Indian Insurance industry. I have worked for various multi-national Insurance companies in last 7 years.

Aiswarya 16 Oct, 2015

Well written article. But i couldn't understand the concept of multicollinearity. Do you mean to say that if two variables are highly correlated, instead of taking both the variables into consideration, take only one of them? If we are to drop one of the variables, how do we choose which to drop? And what happens if we take both of them??

Shashi 16 Oct, 2015

hemanth varma 16 Oct, 2015

Perfectly explained and some of my assumptions and hurdles where clarified with this beautifully tailored article :) Thank you Sunil. If Linear Regression summary is interpreted that would be very much helpful to people like me who just got started into data analytics :)

Sunil Ray 16 Oct, 2015

Thanks Hemanth! Feedback taken, will discuss this in future post!

Sagar 16 Oct, 2015

In formula of R^2, shouldn`t it be like this ( subtracting your formula from 1 ) => r^2= 1 - (sum(actual - predicted)^2/sum(actual - mean)^2) Please correct me if I am wrong.

Ramdas 16 Oct, 2015

Excellent article, i have a quick question: How is Ymean calculated in the calculation of R2. Is it average of just the actual values of y?

Hi Sunil, You done a great job in breaking down the steps for building the regression. Very helpful article and thanks for your efforts.

Deeksith 17 Oct, 2015

Really liked this article. I have been following this website for a while, It would really help if there is a series of posts that can help students ramp up on various topics. I am a current student in analytics and would love to see something like that. Appreciate your efforts !

Sunil Ray 17 Oct, 2015

Deeksith, Thanks! We do have the road map for various topics(Python, SAS, R, Weka, Machine Learning, Qlikview and Tableau). You can refer below link for same! http://www.analyticsvidhya.com/learning-paths-data-science-business-analytics-business-intelligence-big-data/ Regards, Sunil

Chandrashekhar 19 Oct, 2015

Ankita Singh 20 Oct, 2015

Perfectly Explained!

Vinitha Liyanage 21 Oct, 2015

Your explanation makes easy to understand how each variable contribute to the R2. Thank you very much.

Nipun 08 Nov, 2015

Hi Sunil, where can I find the train and test datasets? Thanks, Nipun

Nipun 09 Nov, 2015

Hi Sunil, I am a lil confused. In the above article you mentioned that if VIF is less than two, then the model doesn't suffer from multicollinearity, however in your first comment you also mentioned that the VIF should be less than 5. So, if I have a VIF value in between 2 and 5, then dies my model suffer from multicollinearity? Thanks, Nipun

akash9129 10 Dec, 2015

HI Sunil, I have a doubt regarding the term 'Actual'. The term actual values actually refer to the values that we receive in real life? For example, say sales data, and the model predicts an amount but after a few days it turns out to be slightly lesser or higher, so this lesser or higher is the actual data, if i am not wrong? This is for clarity purpose Thanks

Thanks a lot Sunil

Srinivas 17 Feb, 2016

Hi Sunil, Thanks for very good article on regression.

Parakram 09 Mar, 2016

Properly structured and to the point explanation of the topic, Thanks

Ramakant sharma 08 Sep, 2016

please share how to find the right coefficients for minimum sum of squared errors: 1. OLS 2.gradient Desent. if possible. BTW it's a great article.

Jack Ma 20 Dec, 2016

is this analysis works for logistic regression?

jack 06 Feb, 2017

Could someone tell me why "One disadvantage of R-squared is that it can only increase as predictors are added to the regression model ?" thank you

Hi Sunil. Great article. Can you please suggest hiw to implement muti variate regression in python

Praveen Kumar Telugu 13 Jul, 2017

Well articulated, thanks for your efforts Sunil.

Raghava reddy 20 Sep, 2017

Venkat 26 Nov, 2017

excellent article. well written. answered lot of questions i had on the regression.

RAKESH KUMAR 23 Jan, 2018

Hi Sunil, A wonderful explanation on Regression, Line Fitment & Multicollinearity, but slightly disagree on example of Multi-Variate Regression with "Age". In my opinion Education or Experience along with Tenure with the company will correlate better. Thanks & Regards,