crown icon
Maverick 01 — December 9, 2020
Beginner Linear Regression Machine Learning R Regression Resource

This article was published as a part of the Data Science Blogathon.

Introduction

Can you predict the revenue of the company by analyzing the amount of budget it allocates to its marketing team? Yes, you can, we will discuss one of the simplest machine learning techniques Linear regression. Regression is almost a 200-year-old tool that is still effective in predictive analysis. It is one of the oldest statistical tools still used in Machine learning predictive analysis.

Linear Regression

Table of contents

  • What is Linear regression?

  • Significance of linear regression in predictive analysis.

  • Practical application of linear regression using R.

  • Application on blood pressure and age dataset.

  • Multiple linear regression using R.

  • Application on wine dataset.

  • Conclusion

 

What is a Linear Regression?

Simple linear regression analysis is a technique to find the association between two variables. The two variables involved are a dependent variable which response to the change and the independent variable. Note that we are not calculating the dependency of the dependent variable on the independent variable just the association.

For example, A firm is investing some amount of money in the marketing of a product and it has also collected sales data throughout the years now by analyzing the correlation in the marketing budget and sales data we can predict next year’s sale if the company allocate a certain amount of money to the marketing department. The above idea of prediction sounds magical but it’s pure statistics. Linear regression is basically fitting a straight line to our dataset so that we can predict future events.
The best fit line would be of the form:

Y = B0 + B1X

Where, Y – Dependent variable

X – Independent variable

B0 and B1 – Regression parameter

Predicting Blood pressure using Age by Regression in R

Now we are taking a dataset of Blood pressure and Age and with the help of the data train a linear regression model in R which will be able to predict blood pressure at ages that are not present in our dataset.

Download Dataset from below

Equation of the regression line in our dataset.

BP = 98.7147 + 0.9709 Age

 

Importing dataset

Importing a dataset of Age vs Blood Pressure which is a CSV file using function read.csv( ) in R and storing this dataset into a data frame bp.

bp <- read.csv(“bp.csv”)

 

Creating data frame for predicting values

Creating a data frame which will store Age 53. And this data frame will be used to predict blood pressure at Age 53 after creating a linear regression model.

p <-  as.data.frame(53)
colnames(p) <- "Age"

Creating a scatter plot using ggplot2 library

Taking the help of ggplot2 library in R we can see that there is a correlation between Blood Pressure and Age as we can see that the increase in Age is followed by an increase in blood pressure.

Linear regression - ggplot

It is quite evident by the graph that the distribution on the plot is scattered in a manner that we can fit a straight line through the points.

 

Calculating the correlation between Age and Blood pressure

We can also verify our above analysis that there is a correlation between Blood pressure and Age by taking the help of cor( ) function in R which is used to calculate the correlation between two variables.

cor(bp$BP,bp$Age)
[1] 0.6575673

 

Creating a Linear regression model

Now with the help of lm( ) function, we are going to make a linear model. lm( ) function has two attributes first is a formula where we will use “BP ~ Age” because Age is an independent variable and Blood pressure is a dependent variable and the second is data, where we will give the name of the data frame containing data which is in this case, is data frame bp.

model <- lm(BP ~ Age, data = bp)

Summary of our linear regression model

summary(model)

Output:

##
## Call:
## lm(formula = BP ~ Age, data = bp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.724 -6.994 -0.520 2.931 75.654
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 98.7147 10.0005 9.871 1.28e-10 ***
## Age 0.9709 0.2102 4.618 7.87e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.31 on 28 degrees of freedom
## Multiple R-squared: 0.4324, Adjusted R-squared: 0.4121
## F-statistic: 21.33 on 1 and 28 DF, p-value: 7.867e-05


Interpretation of the model

## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 98.7147 10.0005 9.871 1.28e-10 ***
## Age 0.9709 0.2102 4.618 7.87e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
B0 = 98.7147 (Y- intercept)
B1 = 0.9709 (Age coefficient)
BP = 98.7147 + 0.9709 Age

It means a change in one unit in Age will bring 0.9709 units to change in Blood pressure.

The standard error is variability to expect in coefficient which captures sampling variability so the variation in intercept can be up 10.0005 and variation in Age will be 0.2102 not more than that

T value: t value is Coefficient divided by standard error it is basically how big is estimated relative to error bigger the coefficient relative to Std. error the bigger the t score and t score comes with a p-value because its a distribution p-value is how statistically significant the variable is to the model for a confidence level of 95% we will compare this value with alpha which will be 0.05, so in our case p-value of both intercept and Age is less than alpha (alpha = 0.05) this implies that both are statistically significant to our model.

## Residual standard error: 17.31 on 28 degrees of freedom

## Multiple R-squared: 0.4324, Adjusted R-squared: 0.4121

## F-statistic: 21.33 on 1 and 28 DF, p-value: 7.867e-05

Residual standard error or the standard error of the model is basically the average error for the model which is 17.31 in our case and it means that our model can be off by on an average of 17.31 while predicting the blood pressure. Lesser the error the better the model while predicting.

Multiple R-squared is the ratio of (1-(sum of squared error/sum of squared total))

Adjusted R-squared:

If we add variables no matter if its significant in prediction or not the value of R-squared will increase which the reason Adjusted R-squared is used because if the variable added isn’t significant for the prediction of the model the value of Adjusted R-squared will reduce, it one of the most helpful tools to avoid overfitting of the model.

F – statistics is the ratio of the mean square of the model and mean square of the error, in other words, it is the ratio of how well the model is doing and what the error is doing, and the higher the F value is the better the model is doing on compared to the error.

One is the degrees of freedom of the numerator of the F – statistic and 28 is the degree of freedom of the errors.

 

Predict the value of blood pressure at Age 53

BP = 98.7147 + 0.9709 Age

The above formula will be used to calculate Blood pressure at the age of 53 and this will be achieved by using the predict function( ) first we will write the name of the linear regression model separating by a comma giving the value of new data set at p as the Age 53 is earlier saved in data frame p.

predict(model, newdata = p)

## 1

## 150.1708

So, the predicted value of blood pressure is 150.17 at age 53

As we have predicted the blood pressure with the association of Age now there can be more than one independent variable involved which shows a correlation with a dependent variable which is called Multiple Regression.

 

Multiple Linear Regression Model

Multi-Linear regression analysis is a statistical technique to find the association of multiple independent variables on the dependent variable. For example, revenue generated by a company is dependent on various factors including market size, price, promotion, competitor’s price, etc. basically Multiple linear regression model establishes a linear relationship between a dependent variable and multiple independent variables.

Equation of Multiple Linear Regression is as follows:

Y = B0 + B1X1 + B2X2 + .. + BnXk + E

Where

Y – Dependent variable

X – Independent variable

B0, B1, B3, . – Multiple linear regression coefficients

E- Error

Taking another example of the Wine dataset and with the help of AGST, HarvestRain we are going to predict the price of wine.

 

Importing the dataset

Using the function read.csv( ) import both data set wine.csv as well as wine_test.csv into data frame wine and wine_test respectively.

wine <- read.csv("wine.csv")
wine_test <- read.csv("wine_test.csv")

Download Dataset from below

Finding the correlation between different variable

Using cor( ) function and round( ) function we can round off the correlation between all variables of the dataset wine to two decimal places.

round(cor(wine),2)

Output:

Year Price WinterRain AGST HarvestRain Age FrancePop
## Year 1.00 -0.45 0.02 -0.25 0.03 -1.00 0.99
## Price -0.45 1.00 0.14 0.66 -0.56 0.45 -0.47
## WinterRain 0.02 0.14 1.00 -0.32 -0.28 -0.02 0.00
## AGST -0.25 0.66 -0.32 1.00 -0.06 0.25 -0.26
## HarvestRain 0.03 -0.56 -0.28 -0.06 1.00 -0.03 0.04
## Age -1.00 0.45 -0.02 0.25 -0.03 1.00 -0.99
## FrancePop 0.99 -0.47 0.00 -0.26 0.04 -0.99 1.00

 

Scattered plots

By using the library ggplot2 in R create a scatter plot which can clearly show that AGST and Price of the wine are highly correlated. Similarly, the scattered plot between HarvestRain and the Price of wine also shows their correlation.

ggplot(wine,aes(x = AGST, y = Price)) + geom_point() +geom_smooth(method = "lm")
Scatterplot decision
ggplot(wine,aes(x = HarvestRain, y = Price)) + geom_point() +geom_smooth(method = "lm")
rain harvest scatterplot

 

Creating a Multilinear regression model

model1 <- lm(Price ~ AGST + HarvestRain,data = wine)
summary(model1)

Output:

##
## Call:
## lm(formula = Price ~ AGST + HarvestRain, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.88321 -0.19600 0.06178 0.15379 0.59722
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.20265 1.85443 -1.188 0.247585
## AGST 0.60262 0.11128 5.415 1.94e-05 ***
## HarvestRain -0.00457 0.00101 -4.525 0.000167 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3674 on 22 degrees of freedom
## Multiple R-squared: 0.7074, Adjusted R-squared: 0.6808
## F-statistic: 26.59 on 2 and 22 DF, p-value: 1.347e-06

Interpretation of the Model

## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.20265 1.85443 -1.188 0.247585
## AGST 0.60262 0.11128 5.415 1.94e-05 ***
## HarvestRain -0.00457 0.00101 -4.525 0.000167 ***
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
B0 = 98.7147 (Y- intercept)
B1 = 0.9709 (Age coefficient)
Price = -2.20265 + 0.60262 AGST - 0.00457 HarvestRain

It means for a change in one unit in AGST will bring 0.60262 units to change in Price and one unit change in HarvestRain will bring 0.00457 units to change in Price.

The standard error is variability to expect in coefficient which captures sampling variability so the variation in intercept can be up 1.85443 and variation in AGST will be 0.11128 and variation in HarvestRain is 0.00101 not more than that

T value: t value is Coefficient divided by standard error it is basically how big is estimated relative to error bigger the coefficient relative to Std. error the bigger the t-score and t-score comes with a p-value because its a distribution p-value is how statistically significant the variable is to the model for a confidence level of 95% we will compare this value with alpha which will be 0.05, so in our case p-value of intercept, AGST and HarvestRain is less than alpha (alpha = 0.05) this implies that all are statistically significant to our model.

## Residual standard error: 0.3674 on 22 degrees of freedom

## Multiple R-squared: 0.7074, Adjusted R-squared: 0.6808

## F-statistic: 26.59 on 2 and 22 DF, p-value: 1.347e-06

Residual standard error or the standard error of the model is basically the average error for the model which is 0.3674 in our case and it means that our model can be off by an average of 0.3674 while predicting the Price of wines. Lesser the error the better the model while predicting.

Multiple R-squared is the ratio of (1-(sum of squared error/sum of squared total))

Adjusted R-squared:

If we add variables no matter if its significant in prediction or not the value of R-squared will increase which the reason Adjusted R-squared is used because if the variable added isn’t significant for the prediction of the model the value of Adjusted R-squared will reduce, it one of the most helpful tools to avoid overfitting of the model.

F – statistics is the ratio of the mean square of the model and mean square of the error, in other words, it is the ratio of how well the model is doing and what the error is doing, and the higher the F value is the better the model is doing on compared to the error.

Two is the degrees of freedom of the numerator of the F – statistic and 22 is the degree of freedom of the errors.

 

Predicting values for our test set

prediction <- predict(model1, newdata = wine_test)

Predicted values with the test data set

wine_test

## Year Price WinterRain AGST HarvestRain Age FrancePop
## 1 1979 6.9541 717 16.1667 122 4 54835.83
## 2 1980 6.4979 578 16.0000 74 3 55110.24

prediction

## 1 2
## 6.982126 7.101033

Conclusion

As we can see that from the available dataset we can create a linear regression model and train that model, if enough data is available we can accurately predict new events or in other words future outcomes.

About the Author

Maverick 01
crown icon

Our Top Authors

Download Analytics Vidhya App for the Latest blog/Article

Leave a Reply Your email address will not be published. Required fields are marked *