Can you predict a company’s revenue by analyzing the budget it allocates to its marketing team? Yes, you can. Do you know how to predict using linear regression in R? Not yet? Well, let me show you how. In this article, we will discuss one of the simplest machine-learning techniques, linear regression. Regression is almost a 200-year-old tool that is still effective in data science. It is one of the oldest statistical tools used in machine learning predictive analysis.

- Understand the definition and significance of Linear regression.
- Explore the various applications of linear regression.
- Learn to implement linear regression algorithms through the sample codes in R found in this tutorial.

This article was published as a part of the Data Science Blogathon.

Simple linear regression analysis in R is a powerful technique to uncover associations between two variables. In this method, the dependent variable (response variable) reacts to changes in the independent variable (predictor variable). It’s important to note that we are not just calculating the dependency of the dependent variable on the independent variable, but also exploring the nuanced association. This makes linear regression in R a valuable tool for understanding and interpreting relationships in your data.

For example, a firm is investing some amount of money in the marketing of a product, and it has also collected sales data throughout the years. Now, by analyzing the correlation between the marketing budget and the sales data, we can predict next yearâ€™s sales if the company allocates a certain amount of money to the marketing department. The above idea of prediction sounds magical, but itâ€™s pure statistics. The linear regression algorithm is basically fitting a straight line to our dataset using the least squares method so that we can predict future events. One limitation of linear regression is that it is sensitive to outliers. The best-fit line would be of the form:

*Y = B0 + B1X**Where, Y â€“ Dependent variable**X â€“ Independent variable**B0 and B1 â€“ Regression parameter*

Letâ€™s try to understand the practical application of linear regression in R with another example.

Letâ€™s say we have a dataset of the blood pressure and age of a certain group of people. With the help of this data, we can train a simple linear regression model in R, which will be able to predict blood pressure at ages that are not present in our dataset.

You can download the Dataset from below:

Equation of the regression line in our dataset.

*BP = 98.7147 + 0.9709 Age**where y is BP*

Now letâ€™s see how to do this

Import the dataset of Age vs. Blood Pressure, a CSV file using function read.csv( ) in R, and store this dataset into a data frame bp.

`bp <- read.csv("bp.csv")`

Create a data frame that will store Age 53. This data frame will help us predict blood pressure at Age 53 after creating a linear regression model.

```
p <- as.data.frame(53)
colnames(p) <- "Age"
```

Taking the help of the ggplot2 library in R, we can see that there is a correlation between Blood Pressure and Age, as we can see that the increase in Age is followed by an increase in blood pressure.

We can also use the plot function In R for scatterplot and abline function to plot straight lines.

It is quite evident from the graph that the distribution on the plot is scattered in a manner that we can fit a straight line through the data points.

We can also verify our above analysis that there is a correlation between Blood Pressure and Age by taking the help of the cor( ) function in R, which is used to calculate the correlation between two variables.

`cor(bp$BP,bp$Age)`

[1] 0.6575673
Now, leveraging the lm() function in R, let’s build a linear model. Using ‘BP ~ Age’ as the formula, with Age as the independent variable and Blood Pressure as the dependent variable, we apply this to our dataset named ‘bp’. The model seamlessly fits the data, showcasing the power of linear regression in R.

`model <- lm(BP ~ Age, data = bp)`

`summary(model)`

**Output:**

```
##
## Call:
## lm(formula = BP ~ Age, data = bp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.724 -6.994 -0.520 2.931 75.654
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 98.7147 10.0005 9.871 1.28e-10 ***
## Age 0.9709 0.2102 4.618 7.87e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.31 on 28 degrees of freedom
## Multiple R-squared: 0.4324, Adjusted R-squared: 0.4121
## F-statistic: 21.33 on 1 and 28 DF, p-value: 7.867e-05
```

```
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 98.7147 10.0005 9.871 1.28e-10 ***
## Age 0.9709 0.2102 4.618 7.87e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
B0 = 98.7147 (Y- intercept)
B1 = 0.9709 (Age coefficient)
```*BP = 98.7147 + 0.9709 Age*

It means a change in one unit in Age will bring 0.9709 units to change in Blood Pressure.

The standard error is variability to expect in coefficient, which captures sampling variability, so the variation in intercept can be up to 10.0005, and the variation in Age will be 0.2102, not more.

The T value is the coefficient divided by the standard error. It is basically how big the estimate is relative to the error. The bigger the coefficient relative to standard error, the bigger the t score. The t score comes with a p-value because a distribution p-value is how statistically significant the variable is to the model for a confidence level of 95%. We will compare this value with alpha which will be 0.05, so in our case, the p-values of both intercept and Age are less than alpha (alpha = 0.05). This implies that both are statistically significant to our model.

We can calculate the confidence interval using the confint(model, level=.95) method.

## Residual standard error: 17.31 on 28 degrees of freedom

## Multiple R-squared: 0.4324, Adjusted R-squared: 0.4121

## F-statistic: 21.33 on 1 and 28 DF, p-value: 7.867e-05

Residual standard error or the standard error of the model is basically the average error for the model, which is 17.31 in our case, and it means that our model can be off by an average of 17.31 while predicting the blood pressure. The lesser the error, the better the model while predicting.

Multiple R-squared is the ratio of (1-(sum of squared error/sum of squared total))

Suppose we add variables, no matter if itâ€™s significant in prediction or not. In that case, the value of the R-squared will increase, which is the reason adjusted R-squared is used because if the variable added isnâ€™t significant for the prediction of the model, the value of the adjusted R-squared will reduce. It is one of the most helpful tools to avoid overfitting the model.

F â€“ statistics is the ratio of the mean square of the model and the mean square of the error. In other words, it is the ratio of how well the model is doing and what the error is doing, and the higher the F value is, the better the model is doing compared to the error.

One is the degree of freedom of the numerator of the F â€“ statistic, and 28 is the degree of freedom of the errors.

Now, letâ€™s try using our model to predict the value of blood pressure for someone at age 53.

*BP = 98.7147 + 0.9709 Age*

The above formula will be used to calculate blood pressure at the age of 53, and this will be achieved by using the predict function( ). First, we will write the name of the linear regression model, separated by a comma, giving the value of the new data set at p as the Age 53 is earlier saved in data frame p.

predict(model, newdata = p)

Output:

## 1

## 150.1708

So, the predicted value of blood pressure is 150.17 at age 53

As we have predicted Blood Pressure with the association of Age, now there can be more than one independent variable involved, which shows a correlation with a dependent variable. This is called Multiple Regression.

Multi-Linear regression analysis is a statistical technique to find the association of multiple independent variables with the dependent variable. For example, revenue generated by a company is dependent on various factors, including market size, price, promotion, competitorâ€™s price, etc. basically Multiple linear regression model establishes a linear relationship between a dependent variable and multiple independent variables.

*The equation of Multiple Linear Regression is as follows:**Y = B0 + B1X1 + B2X2 + .. + BnXk + E**Where**Y – Dependent variable**X – Independent variable**B0, B1, B3, . â€“ Multiple linear regression coefficients**E- Error*

Taking another example of the Wine dataset and with the help of AGST, HarvestRain, we are going to predict the price of wine. Here AGST and HarvestRain are fitted values.

Hereâ€™s how we can build a multiple linear regression model.

Using the function read.csv( ), import both data sets wine.csv and wine_test.csv, into the data frame wine and wine_test, respectively.

```
wine <- read.csv("wine.csv")
wine_test <- read.csv("wine_test.csv")
```

You can download the dataset below.

Using the cor( ) function and round( ) function, we can round off the correlation between all variables of the dataset wine to two decimal places.

`round(cor(wine),2)`

```
Year Price WinterRain AGST HarvestRain Age FrancePop
## Year 1.00 -0.45 0.02 -0.25 0.03 -1.00 0.99
## Price -0.45 1.00 0.14 0.66 -0.56 0.45 -0.47
## WinterRain 0.02 0.14 1.00 -0.32 -0.28 -0.02 0.00
## AGST -0.25 0.66 -0.32 1.00 -0.06 0.25 -0.26
## HarvestRain 0.03 -0.56 -0.28 -0.06 1.00 -0.03 0.04
## Age -1.00 0.45 -0.02 0.25 -0.03 1.00 -0.99
## FrancePop 0.99 -0.47 0.00 -0.26 0.04 -0.99 1.00
```

Create a scatter plot using the library ggplot2 in R. This clearly shows that AGST and the Price of the wine are highly correlated. Similarly, the scatter plot between HarvestRain and the Price of wine also shows their correlation.

`ggplot(wine,aes(x = AGST, y = Price)) + geom_point() +geom_smooth(method = "lm")`

`ggplot(wine,aes(x = HarvestRain, y = Price)) + geom_point() +geom_smooth(method = "lm")`

```
model1 <- lm(Price ~ AGST + HarvestRain,data = wine)
summary(model1)
```

**Output:**

```
##
## Call:
## lm(formula = Price ~ AGST + HarvestRain, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.88321 -0.19600 0.06178 0.15379 0.59722
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.20265 1.85443 -1.188 0.247585
## AGST 0.60262 0.11128 5.415 1.94e-05 ***
## HarvestRain -0.00457 0.00101 -4.525 0.000167 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3674 on 22 degrees of freedom
## Multiple R-squared: 0.7074, Adjusted R-squared: 0.6808
## F-statistic: 26.59 on 2 and 22 DF, p-value: 1.347e-06
```

```
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.20265 1.85443 -1.188 0.247585
## AGST 0.60262 0.11128 5.415 1.94e-05 ***
## HarvestRain -0.00457 0.00101 -4.525 0.000167 ***
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
B0 = 98.7147 (Y- intercept)
B1 = 0.9709 (Age coefficient)
```*Price = *-2.20265* + *0.60262* AGST *- 0.00457* HarvestRain*

It means that a change in one unit in AGST will bring 0.60262 units to change in Price, and one unit change in HarvestRain will bring 0.00457 units to change in Price.

**The standard error** is variability to expect in coefficient, which captures sampling variability, so the variation in intercept can be up to 1.85443, the variation in AGST will be 0.11128, and the variation in HarvestRain is 0.00101, not more.

In this case, the p-value of intercept, AGST, and HarvestRain are less than alpha (alpha = 0.05), which implies that all are statistically significant to our model.

## Residual standard error: 0.3674 on 22 degrees of freedom

## Multiple R-squared: 0.7074, Adjusted R-squared: 0.6808

## F-statistic: 26.59 on 2 and 22 DF, p-value: 1.347e-06

**The residual standard error** or the standard error of the model is 0.3674 in our case, which means that our model can be off by an average of 0.3674 while predicting the Price of wines. The lesser the error, the better the model while predicting. We have also looked at the residuals, which need to follow a normal distribution.

**Multiple R-squared** is the ratio of (1-(sum of squared error/sum of squared total))

Two is the degree of freedom of the numerator of the F â€“ statistic, and 22 is the degree of freedom of the errors.

prediction <- predict(model1, newdata = wine_test)

Predicted values with the test data set

wine_test

```
## Year Price WinterRain AGST HarvestRain Age FrancePop
## 1 1979 6.9541 717 16.1667 122 4 54835.83
## 2 1980 6.4979 578 16.0000 74 3 55110.24
```

prediction

## 1 2 ## 6.982126 7.101033

Linear regression in R is a versatile model suitable for various situations. Examining available datasets, we can effortlessly craft a simple linear regression model or a multiple linear regression model. By training these models with ample data, accurate predictions for new events or future outcomes can be achieved.

**Key Takeaways**

- Simple linear regression in r analysis is a statistical technique to find the association between an independent and a dependent variable.
- Multiple linear regression analysis is a technique to find the association of multiple independent variables with a single dependent variable.
- Both of these methods are widely used to design ML models in R for various applications.

A. The lm() function is used to fit the linear regression model to the data in R language.

A. You can find the correlation coefficient in R by using the cor( ) function.

A. The slope indicates the rate of change in the dependent variable per unit change in the independent variable. The y-intercept indicates the dependent variable when the independent variable is 0.

blogathonlinear regressionLinear regression in R is a versatile model suitable for various situations. Examining available datasetslinear regression in r programmingwe can effortlessly craft a simple linear regression model or a multiple linear regression model. By training these models with ample data

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Become a full stack data scientist
##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

Understanding Cost Function
Understanding Gradient Descent
Math Behind Gradient Descent
Assumptions of Linear Regression
Implement Linear Regression from Scratch
Train Linear Regression in Python
Implementing Linear Regression in R
Diagnosing Residual Plots in Linear Regression Models
Generalized Linear Models
Introduction to Logistic Regression
Odds Ratio
Implementing Logistic Regression from Scratch
Introduction to Scikit-learn in Python
Train Logistic Regression in python
Multiclass using Logistic Regression
How to use Multinomial and Ordinal Logistic Regression in R ?
Challenges with Linear Regression
Introduction to Regularisation
Implementing Regularisation
Ridge Regression
Lasso Regression

Introduction to Stacking
Implementing Stacking
Variants of Stacking
Implementing Variants of Stacking
Introduction to Blending
Bootstrap Sampling
Introduction to Random Sampling
Hyper-parameters of Random Forest
Implementing Random Forest
Out-of-Bag (OOB) Score in the Random Forest
IPL Team Win Prediction Project Using Machine Learning
Introduction to Boosting
Gradient Boosting Algorithm
Math behind GBM
Implementing GBM in python
Regularized Greedy Forests
Extreme Gradient Boosting
Implementing XGBM in python
Tuning Hyperparameters of XGBoost in Python
Implement XGBM in R/H2O
Adaptive Boosting
Implementing Adaptive Boosing
LightGBM
Implementing LightGBM in Python
Catboost
Implementing Catboost in Python

Introduction to Clustering
Applications of Clustering
Evaluation Metrics for Clustering
Understanding K-Means
Implementation of K-Means in Python
Implementation of K-Means in R
Choosing Right Value for K
Profiling Market Segments using K-Means Clustering
Hierarchical Clustering
Implementation of Hierarchial Clustering
DBSCAN
Defining Similarity between clusters
Build Better and Accurate Clusters with Gaussian Mixture Models

Introduction to Machine Learning Interpretability
Framework and Interpretable Models
model Agnostic Methods for Interpretability
Implementing Interpretable Model
Understanding SHAP
Out-of-Core ML
Introduction to Interpretable Machine Learning Models
Model Agnostic Methods for Interpretability
Game Theory & Shapley Values

Deploying Machine Learning Model using Streamlit
Deploying ML Models in Docker
Deploy Using Streamlit
Deploy on Heroku
Deploy Using Netlify
Introduction to Amazon Sagemaker
Setting up Amazon SageMaker
Using SageMaker Endpoint to Generate Inference
Deploy on Microsoft Azure Cloud
Introduction to Flask for Model
Deploying ML model using Flask

Hi, Thank you for sharing - it's a great article with detail steps. Where can I download both bp.csv and wine.csv? Please advise. I would like to practice. Thanks. Regards, Marco