Interesting in predictive analytics? Then research artificial intelligence, machine learning, and deep learning.

Let’s take a brief introduction to what linear regression sklearn is. **Regression **is the statistical method used to determine the strength and the relation between the independent and dependent variables. Generally, independent variables are those variables whose values are used to obtain output, and dependent variables are those whose values depend on the independent values. When discussing regression algorithms, you must know some of the multiple linear regression algorithms commonly used in python to train the machine learning model, like simple linear regression, lasso, ridge, etc.

In the following tutorial, we will talk about the multiple linear regression model (MLR) or multilinear regression and understand how simple linear differs from MLR in python.

**Learning objectives**

- Understand the difference between simple linear regression and multiple linear regression in Python’s Scikit-learn library.
- Learn how to read datasets and handle categorical variables for MLR using Scikit-learn.
- Apply Scikit-learn’s linear regression algorithm to train a model for MLR.

*This article was published as a part of the Data Science Blogathon.*

Multiple Linear Regression (MLR) is basically indicating that we will have many features Such as **f1**, **f2**,** f3**, **f4,** and our output feature **f5. **If we take the same example as above we discussed, suppose:

**f1 **is the size of the house,

**f2** is bad rooms in the house,

**f3** is the locality of the house,

**f4** is the condition of the house, and

**f5 **is our output feature, which is the price of the house.

Now, you can see that multiple independent features also make a huge impact on the price of the house, meaning the price can vary from feature to feature. When we are discussing multiple linear regression, then the equation of simple linear regression** y=A+Bx** is converted to something like:

equation: y = A+B1x1+B2x2+B3x3+B4x4

“If we have one dependent feature and multiple independent features then basically call it a

multiple linear regression.”

Now, our aim in using the multiple linear regression is that we have to compute **A, **which is an intercept. The key parameters **B1, B2, B3, and B4 **are the slopes or coefficients concerning this independent feature. This basically indicates that if we increase the value of** x1** by 1 unit, then **B1** will tell you how much it will affect the price of the house. The others **B2, B3, and B4, **also work similarly.

So, this is a small theoretical description of multiple linear regression. Now we will use the scikit learn linear regression library to solve the multiple linear regression problem.

Multiple linear regression is a statistical technique used to analyze the relationship between two or more independent variables and a dependent variable. It’s an extension of simple linear regression, which deals with only one independent variable. Here’s an example of how to use multiple linear regression in Python with the popular library, scikit-learn:

```
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
# Generate a sample dataset (you would typically load your own dataset)
data = {'X1': [1, 2, 3, 4, 5],
'X2': [2, 3, 4, 5, 6],
'Y': [3, 5, 7, 9, 11]}
df = pd.DataFrame(data)
# Split the data into independent variables (X) and the dependent variable (Y)
X = df[['X1', 'X2']]
Y = df['Y']
# Split the dataset into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
# Create a linear regression model
model = LinearRegression()
# Fit the model to the training data
model.fit(X_train, Y_train)
# Make predictions on the test set
Y_pred = model.predict(X_test)
# Evaluate the model
print('Mean Absolute Error:', metrics.mean_absolute_error(Y_test, Y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(Y_test, Y_pred))
print('Root Mean Squared Error:', metrics.mean_squared_error(Y_test, Y_pred, squared=False))
# Print the coefficients and intercept
print('Coefficients:', model.coef_)
print('Intercept:', model.intercept_)
```

In this example:

- We create a sample dataset with two independent variables (X1 and X2) and one dependent variable (Y).
- The data is split into training and testing sets using the train_test_split function.
- A linear regression model is created and fitted to the training data.
- Predictions are made on the test set.
- The model is evaluated using metrics like Mean Absolute Error, Mean Squared Error, and Root Mean Squared Error.
- Finally, the coefficients and intercept of the regression equation are printed.

If you are on the path of learning data science, then you definitely have an understanding of what machine learning is. In today’s digital world, everyone knows what Machine Learning is because it is a trending digital technology across the world. Every step towards the adaptation of the future world is led by this current technology, which, in turn, is led by data scientists like you and me.

Now, for those of you who don’t know what machine learning is, here’s a brief introduction:

**Machine learning **is the study of the algorithms of computers that improve automatically through experience and by the use of data. Its algorithm builds a model based on the data we provide during model building. This is the simple definition of machine learning, and when we go in deeper, we find that huge numbers of algorithms are used in model building. Generally, the most commonly used machine learning algorithms are based on the type of problem, such as **regression**, **classification,** etc. But today, we will only talk about sklearn linear regression algorithms.

We considered a simple linear regression in any machine learning algorithm using an example.

Now, suppose we take a scenario of house prices where our x-axis is the size of the house, and the y-axis is the price of the house. In this example, we have two features – the first one is** f1,** and the second one is **f2**, where

** f1** refers to the size of the house and

**f2 **refers to the price of the house.

So, if **f1 **becomes the independent feature and **f2** becomes the dependent feature, we usually know that whenever the size of the house increases, then the price also increases. Suppose we draw scatter points randomly. Through this scatter point, we try to find the best-fit line, which is given by the equation:

equation: y = A + Bx

Suppose** y **is the price of the house, and** x** is the size of the house; then this equation seems like this:

equation: price = A + B(size)where,

A is an intercept and B is slope on that intercept

When we discuss this equation, in which

In this equation, the intercept indicates what the base price of the house would be when the price of the house is 0. Meanwhile, the slope or coef (coefficient) indicates the unit increase in the slope, with the unit increase in size.

Now, how is it different when compared to multiple linear regression?

There are 4 steps to follow to train a machine-learning model to do multiple linear regression. Let’s look into each of these steps in detail while applying multiple linear regression on the **50_startups** dataset. You can click here to download the dataset.

Most of the datasets are in CSV file format; for reading this file, we use pandas library:

```
df = pd.read_csv('50_Startups.csv')
df
```

Here you can see that there are 5 columns in the dataset where the **state** stores the categorical data points, and the rest are numerical features.

Now, we have to classify independent and dependent features.

**Independent and Dependent Variables**

There are total 5 features in the dataset, of which profit is our dependent feature, and the rest are our independent features.

**Python Code:**

In our dataset, there is one categorical column, **State. **We must handle the categorical values inside this column as part of data preprocessing. For that, we will use pandas’** get_dummies()** function:

# handle categorical variable

states=pd.get_dummies(x,drop_first=True)

# dropping extra column

x= x.drop(‘State’,axis=1)

# concatation of independent variables and new cateorical variable.

x=pd.concat([x,states],axis=1)

x

Now, we have to split the data into training and test sets using the scikit-learn **train_test_split()** function.

```
# importing train_test_split from sklearn
from sklearn.model_selection import train_test_split
# splitting the data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 42)
```

Now, we apply the linear regression model to our training data. First of all, we have to import linear regression from the scikit-learn library. Unlike linear regression, there is no other library to implement MLR.

```
# importing module
from sklearn.linear_model import LinearRegression
# creating an object of LinearRegression class
LR = LinearRegression()
# fitting the training data
LR.fit(x_train,y_train)
```

finally, if we execute this, then our model will be ready. Now we have x_test data, which we will use for the prediction of **profit.**

```
y_prediction = LR.predict(x_test)
y_prediction
```

Now, we have to compare the y_prediction values with the original values because we have to calculate the accuracy of our model, which was implemented by a concept called **r2_score. **Let’s briefly discuss r2_score:

**r2_score:**

It is a function inside sklearn. metrics module, where the value of **r2_score** varies between** 0** and **100** percent, we can say that it is closely related to MSE.

r2 is basically calculated by the formula given below:

formula: r2 = 1 – (SSres /SSmean )

now, when I say **SSres, **it means it is the sum of residuals, and **SSmean **refers to the sum of means.

where,

**y** = original values

**y^** = predicted values. and,

From this equation, we infer that the sum of means is always greater than the sum of residuals. If this condition is satisfied, our model is good for predictions. Its values range between 0.0 to 1.

”The proportion of the variance in the dependent variable ortarget variablethat is predictable from the independent variable(s) orpredictor.”

The best possible score is 1.0, which can be negative because the model can be arbitrarily worse. A constant model that always predicts the expected value of y, disregarding the input features, would get an R2 score of 0.0.

```
# importing r2_score module
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
# predicting the accuracy score
score=r2_score(y_test,y_prediction)
print('r2 socre is ',score)
print('mean_sqrd_error is==',mean_squared_error(y_test,y_prediction))
print('root_mean_squared error of is==',np.sqrt(mean_squared_error(y_test,y_prediction)))
```

You can see that the accuracy score is greater than 0.8, which means we can use this model to solve multiple linear regression, and also mean squared error rate is also low.

Multiple Linear Regression is a statistical method used to study the linear relationship between a dependent variable and multiple independent variables. In the article above, we learned step-by-step how to implement MLR in Python using the Scikit-learn library. We used a simple example of predicting house prices to explain how simple linear regression works and then extended the example to MLR, which involves more than one independent variable. I hope now you have a better understanding of the topic.

**Key Takeaways**

- Multiple linear regression is an extension of simple linear regression, where multiple independent variables are used to predict the dependent variable.
- Scikit-learn, a machine learning library in Python, can be used to implement multiple linear regression models and to read, preprocess, and split data.
- Categorical variables can be handled in multiple linear regression using one-hot encoding or label encoding.

A. Some of the commonly used visualization libraries for Multiple Linear Regression in Python are Matplotlib, Seaborn, Plotly, and ggplot. These libraries can be used to create a range of plots (like the scatter plot) and charts, to better understand relationships between variables, detect patterns and trends, and communicate results to stakeholders.

A. Linear regression is a statistical method used to analyze the relationship between two continuous variables. On the other hand, multiple regression is a statistical method used to analyze the relationship between one dependent variable and two or more independent variables.

A. Follow the steps below to use scikit-learn’s linear regression in Python:

1. First, import the LinearRegression module from scikit-learn’s linear_model library.

2. Then, create an instance of the LinearRegression object and fit your data to the model using the fit() method.

3. Once the model is trained, you can make predictions on new data using the predict() method.

4. Finally, you can evaluate the performance of the model using various metrics, such as R-squared, mean squared error, or mean absolute error.

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Become a full stack data scientist
##

##

##

##

##

##

##

##

##

##

##

##

Understanding Cost Function
Understanding Gradient Descent
Math Behind Gradient Descent
Assumptions of Linear Regression
Implement Linear Regression from Scratch
Train Linear Regression in Python
Implementing Linear Regression in R
Diagnosing Residual Plots in Linear Regression Models
Generalized Linear Models
Introduction to Logistic Regression
Odds Ratio
Implementing Logistic Regression from Scratch
Introduction to Scikit-learn in Python
Train Logistic Regression in python
Multiclass using Logistic Regression
How to use Multinomial and Ordinal Logistic Regression in R ?
Challenges with Linear Regression
Introduction to Regularisation
Implementing Regularisation
Ridge Regression
Lasso Regression
##

##

##

##

##

##

##

##

##

Introduction to Stacking
Implementing Stacking
Variants of Stacking
Implementing Variants of Stacking
Introduction to Blending
Bootstrap Sampling
Introduction to Random Sampling
Hyper-parameters of Random Forest
Implementing Random Forest
Out-of-Bag (OOB) Score in the Random Forest
IPL Team Win Prediction Project Using Machine Learning
Introduction to Boosting
Gradient Boosting Algorithm
Math behind GBM
Implementing GBM in python
Regularized Greedy Forests
Extreme Gradient Boosting
Implementing XGBM in python
Tuning Hyperparameters of XGBoost in Python
Implement XGBM in R/H2O
Adaptive Boosting
Implementing Adaptive Boosing
LightGBM
Implementing LightGBM in Python
Catboost
Implementing Catboost in Python
##

##

##

##

Introduction to Clustering
Applications of Clustering
Evaluation Metrics for Clustering
Understanding K-Means
Implementation of K-Means in Python
Implementation of K-Means in R
Choosing Right Value for K
Profiling Market Segments using K-Means Clustering
Hierarchical Clustering
Implementation of Hierarchial Clustering
DBSCAN
Defining Similarity between clusters
Build Better and Accurate Clusters with Gaussian Mixture Models
##

Introduction to Machine Learning Interpretability
Framework and Interpretable Models
model Agnostic Methods for Interpretability
Implementing Interpretable Model
Understanding SHAP
Out-of-Core ML
Introduction to Interpretable Machine Learning Models
Model Agnostic Methods for Interpretability
Game Theory & Shapley Values
##

Deploying Machine Learning Model using Streamlit
Deploying ML Models in Docker
Deploy Using Streamlit
Deploy on Heroku
Deploy Using Netlify
Introduction to Amazon Sagemaker
Setting up Amazon SageMaker
Using SageMaker Endpoint to Generate Inference
Deploy on Microsoft Azure Cloud
Introduction to Flask for Model
Deploying ML model using Flask

A little paragraph to express the f inal regression line would make it complete! And also a little idea about final visualization complexities would provide beginners such as me a better terrain grasp. But Great Work!!

Great introduction. The code under separating dependent and independent variables should be corrected. y = ['profit'] gives errors. Also, you should allocate the dependent variable 'y' before dropping it from the dataframe. Something like this: #separate the other attributes from the predicting attribute y = df['Profit'] x = df.drop('Profit',axis=1) Thanks for the good work.

while splitting the data I have getting value error ValueError: Found input variables with inconsistent numbers of samples: [50, 1] can you tell me why this is happening in my case? please

Hi Team, Thank you for such an informative blog Just a correction, #separte the predicting attribute into Y for model training i guess it should be y=df['Profit'] instead of y = ['profit'] Thanks Ankita

Why are you getting MSE and RMSE in such a large value? I thought they should be within the range of 0.0 to 1?

In the line "x= x.drop(‘State’,axis=1)", where exactly is "x" initially defined? The article mentions nothing about it before this line.