Maverick 01 — Updated On July 19th, 2023

## Introduction

Can you predict a company’s revenue by analyzing the budget it allocates to its marketing team? Yes, you can. Do you know how to predict using linear regression in R? Not yet? Well, let me show you how. In this article, we will discuss one of the simplest machine-learning techniques, linear regression. Regression is almost a 200-year-old tool that is still effective in data science. It is one of the oldest statistical tools used in machine learning predictive analysis.

#### Learning Objectives

• Understand the definition and significance of Linear regression.
• Explore the various applications of linear regression.
• Learn to implement linear regression algorithms through the sample codes in R found in this tutorial.

This article was published as a part of the Data Science Blogathon.

## What Is Linear Regression?

Simple linear regression analysis is a technique to find the association between two variables. The two variables involved are the dependent variable (response variable), which responds to the change of the independent variable (predictor variable). Note that we are not calculating the dependency of the dependent variable on the independent variable, but just the association.

For example, a firm is investing some amount of money in the marketing of a product, and it has also collected sales data throughout the years. Now, by analyzing the correlation between the marketing budget and the sales data, we can predict next year’s sales if the company allocates a certain amount of money to the marketing department. The above idea of prediction sounds magical, but it’s pure statistics. The linear regression algorithm is basically fitting a straight line to our dataset using the least squares method so that we can predict future events. One limitation of linear regression is that it is sensitive to outliers. The best-fit line would be of the form:

Y = B0 + B1X
Where, Y – Dependent variable
X – Independent variable
B0 and B1 – Regression parameter

## Practical Application of Linear Regression Using R

Let’s try to understand the practical application of linear regression in R with another example.

Let’s say we have a dataset of the blood pressure and age of a certain group of people. With the help of this data, we can train a simple linear regression model in R, which will be able to predict blood pressure at ages that are not present in our dataset.

Equation of the regression line in our dataset.

BP = 98.7147 + 0.9709 Age
where y is BP

Now let’s see how to do this

### Step 1: Import the Dataset

Import the dataset of Age vs. Blood Pressure, a CSV file using function read.csv( ) in R, and store this dataset into a data frame bp.

``bp <- read.csv("bp.csv")``

### Step 2: Create the Data Frame for Predicting Values

Create a data frame that will store Age 53. This data frame will help us predict blood pressure at Age 53 after creating a linear regression model.

``````p <-  as.data.frame(53)
colnames(p) <- "Age"``````

### Step 3: Create a Scatter Plot using the ggplot2 Library

Taking the help of the ggplot2 library in R, we can see that there is a correlation between Blood Pressure and Age, as we can see that the increase in Age is followed by an increase in blood pressure.

We can also use the plot function In R for scatterplot and abline function to plot straight lines.

It is quite evident from the graph that the distribution on the plot is scattered in a manner that we can fit a straight line through the data points.

### Step 4: Calculate the Correlation Between Age and Blood Pressure

We can also verify our above analysis that there is a correlation between Blood Pressure and Age by taking the help of the cor( ) function in R, which is used to calculate the correlation between two variables.

``cor(bp\$BP,bp\$Age)``
[1] 0.6575673

### Step 5: Create a Linear Regression Model

Now, with the help of the lm( ) function, we are going to make a linear model. lm( ) function has two attributes first is a formula where we will use “BP ~ Age” because Age is an independent variable and Blood Pressure is a dependent variable, and the second is data, where we will give the name of the data frame containing data which is in this case, is data frame bp. The model fits the data as follows:

``model <- lm(BP ~ Age, data = bp)``

### Summary of Our Linear Regression Model

``summary(model)``

Output:

``````##
## Call:
## lm(formula = BP ~ Age, data = bp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.724 -6.994 -0.520 2.931 75.654
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 98.7147 10.0005 9.871 1.28e-10 ***
## Age 0.9709 0.2102 4.618 7.87e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.31 on 28 degrees of freedom
## Multiple R-squared: 0.4324, Adjusted R-squared: 0.4121
## F-statistic: 21.33 on 1 and 28 DF, p-value: 7.867e-05``````

#### Interpretation of the Model

``````## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 98.7147 10.0005 9.871 1.28e-10 ***
## Age 0.9709 0.2102 4.618 7.87e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
B0 = 98.7147 (Y- intercept)
B1 = 0.9709 (Age coefficient)
BP = 98.7147 + 0.9709 Age``````

It means a change in one unit in Age will bring 0.9709 units to change in Blood Pressure.

#### Standard Error

The standard error is variability to expect in coefficient, which captures sampling variability, so the variation in intercept can be up to 10.0005, and the variation in Age will be 0.2102, not more.

#### T value

The T value is the coefficient divided by the standard error. It is basically how big the estimate is relative to the error. The bigger the coefficient relative to standard error, the bigger the t score. The t score comes with a p-value because a distribution p-value is how statistically significant the variable is to the model for a confidence level of 95%. We will compare this value with alpha which will be 0.05, so in our case, the p-values of both intercept and Age are less than alpha (alpha = 0.05). This implies that both are statistically significant to our model.

We can calculate the confidence interval using the confint(model, level=.95) method.

## Residual standard error: 17.31 on 28 degrees of freedom
## Multiple R-squared: 0.4324, Adjusted R-squared: 0.4121
## F-statistic: 21.33 on 1 and 28 DF, p-value: 7.867e-05

#### Residual Standard Error

Residual standard error or the standard error of the model is basically the average error for the model, which is 17.31 in our case, and it means that our model can be off by an average of 17.31 while predicting the blood pressure. The lesser the error, the better the model while predicting.

#### Multiple R-squared

Multiple R-squared is the ratio of (1-(sum of squared error/sum of squared total))

Suppose we add variables, no matter if it’s significant in prediction or not. In that case, the value of the R-squared will increase, which is the reason adjusted R-squared is used because if the variable added isn’t significant for the prediction of the model, the value of the adjusted R-squared will reduce. It is one of the most helpful tools to avoid overfitting the model.

#### F – statistics

F – statistics is the ratio of the mean square of the model and the mean square of the error. In other words, it is the ratio of how well the model is doing and what the error is doing, and the higher the F value is, the better the model is doing compared to the error.

One is the degree of freedom of the numerator of the F – statistic, and 28 is the degree of freedom of the errors.

### Step 6: Run a Sample Test

Now, let’s try using our model to predict the value of blood pressure for someone at age 53.

BP = 98.7147 + 0.9709 Age

The above formula will be used to calculate blood pressure at the age of 53, and this will be achieved by using the predict function( ). First, we will write the name of the linear regression model, separated by a comma, giving the value of the new data set at p as the Age 53 is earlier saved in data frame p.

`predict(model, newdata = p)`

Output:

## 1

## 150.1708

So, the predicted value of blood pressure is 150.17 at age 53

As we have predicted Blood Pressure with the association of Age, now there can be more than one independent variable involved, which shows a correlation with a dependent variable. This is called Multiple Regression.

## Multiple Linear Regression Model

Multi-Linear regression analysis is a statistical technique to find the association of multiple independent variables with the dependent variable. For example, revenue generated by a company is dependent on various factors, including market size, price, promotion, competitor’s price, etc. basically Multiple linear regression model establishes a linear relationship between a dependent variable and multiple independent variables.

The equation of Multiple Linear Regression is as follows:
Y = B0 + B1X1 + B2X2 + .. + BnXk + E
Where
Y – Dependent variable
X – Independent variable
B0, B1, B3, . – Multiple linear regression coefficients
E- Error

Taking another example of the Wine dataset and with the help of AGST, HarvestRain, we are going to predict the price of wine. Here AGST and HarvestRain are fitted values.

Here’s how we can build a multiple linear regression model.

### Step 1: Import the Dataset

Using the function read.csv( ), import both data sets wine.csv and wine_test.csv, into the data frame wine and wine_test, respectively.

``````wine <- read.csv("wine.csv")

### Step 2: Find the Correlation Between Different Variables

Using the cor( ) function and round( ) function, we can round off the correlation between all variables of the dataset wine to two decimal places.

``round(cor(wine),2)``

#### Output:

``````Year Price WinterRain AGST HarvestRain Age FrancePop
## Year 1.00 -0.45 0.02 -0.25 0.03 -1.00 0.99
## Price -0.45 1.00 0.14 0.66 -0.56 0.45 -0.47
## WinterRain 0.02 0.14 1.00 -0.32 -0.28 -0.02 0.00
## AGST -0.25 0.66 -0.32 1.00 -0.06 0.25 -0.26
## HarvestRain 0.03 -0.56 -0.28 -0.06 1.00 -0.03 0.04
## Age -1.00 0.45 -0.02 0.25 -0.03 1.00 -0.99
## FrancePop 0.99 -0.47 0.00 -0.26 0.04 -0.99 1.00``````

### Step 3: Create Scatter Plots Using ggplot2 Library

Create a scatter plot using the library ggplot2 in R. This clearly shows that AGST and the Price of the wine are highly correlated. Similarly, the scatter plot between HarvestRain and the Price of wine also shows their correlation.

``ggplot(wine,aes(x = AGST, y = Price)) + geom_point() +geom_smooth(method = "lm")``
``ggplot(wine,aes(x = HarvestRain, y = Price)) + geom_point() +geom_smooth(method = "lm")``

### Step 4: Create a Multilinear Regression Model

``````model1 <- lm(Price ~ AGST + HarvestRain,data = wine)
summary(model1)``````

Output:

``````##
## Call:
## lm(formula = Price ~ AGST + HarvestRain, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.88321 -0.19600 0.06178 0.15379 0.59722
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.20265 1.85443 -1.188 0.247585
## AGST 0.60262 0.11128 5.415 1.94e-05 ***
## HarvestRain -0.00457 0.00101 -4.525 0.000167 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3674 on 22 degrees of freedom
## Multiple R-squared: 0.7074, Adjusted R-squared: 0.6808
## F-statistic: 26.59 on 2 and 22 DF, p-value: 1.347e-06``````

#### Interpretation of the Model

``````## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.20265 1.85443 -1.188 0.247585
## AGST 0.60262 0.11128 5.415 1.94e-05 ***
## HarvestRain -0.00457 0.00101 -4.525 0.000167 ***
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
B0 = 98.7147 (Y- intercept)
B1 = 0.9709 (Age coefficient)
Price = -2.20265 + 0.60262 AGST - 0.00457 HarvestRain``````

It means that a change in one unit in AGST will bring 0.60262 units to change in Price, and one unit change in HarvestRain will bring 0.00457 units to change in Price.

#### Standard Error

The standard error is variability to expect in coefficient, which captures sampling variability, so the variation in intercept can be up to 1.85443, the variation in AGST will be 0.11128, and the variation in HarvestRain is 0.00101, not more.

In this case, the p-value of intercept, AGST, and HarvestRain are less than alpha (alpha = 0.05), which implies that all are statistically significant to our model.

## Residual standard error: 0.3674 on 22 degrees of freedom
## Multiple R-squared: 0.7074, Adjusted R-squared: 0.6808
## F-statistic: 26.59 on 2 and 22 DF, p-value: 1.347e-06

#### Residual Standard Error

The residual standard error or the standard error of the model is 0.3674 in our case, which means that our model can be off by an average of 0.3674 while predicting the Price of wines. The lesser the error, the better the model while predicting. We have also looked at the residuals, which need to follow a normal distribution.

Multiple R-squared is the ratio of (1-(sum of squared error/sum of squared total))

Two is the degree of freedom of the numerator of the F – statistic, and 22 is the degree of freedom of the errors.

### Step 5: Predict the Values for Our Test Set

`prediction <- predict(model1, newdata = wine_test)`

Predicted values with the test data set

wine_test

``````## Year Price WinterRain AGST HarvestRain Age FrancePop
## 1 1979 6.9541 717 16.1667 122 4 54835.83
## 2 1980 6.4979 578 16.0000 74 3 55110.24``````

prediction

```## 1 2
## 6.982126 7.101033```

## Conclusion

Linear regression is a versatile model which is suitable for many situations. As we can see from the available datasets, we can create a simple linear regression model or multiple linear regression model and train that model to accurately predict new events or future outcomes if enough data is available.

Key Takeaways

• Simple linear regression analysis is a statistical technique to find the association between an independent and a dependent variable.
• Multiple linear regression analysis is a technique to find the association of multiple independent variables with a single dependent variable.
• Both of these methods are widely used to design ML models in R for various applications.

Q1. What does LM () do in R?

A. The lm() function is used to fit the linear regression model to the data in R language.

Q2. How do you find the correlation coefficient in R?

A. You can find the correlation coefficient in R by using the cor( ) function.

Q3. What are the slope and the intercept in linear regression?

A. The slope indicates the rate of change in the dependent variable per unit change in the independent variable. The y-intercept indicates the dependent variable when the independent variable is 0.