This article was published as a part of the Data Science Blogathon.

In this article, we will be dealing with multi-linear regression, and we will take a dataset that contains information about 50 startups. Features include R&D Spend, Administration, Marketing Spend, State, and finally, Profit. Here we have to build the machine learning model to predict the profit of the startups.

Let’s get started.

Multiple Linear Regression is a machine learning algorithm where we provide multiple independent variables for a single dependent variable. However, linear regression only requires one independent variable as input.

Let’s start by importing some libraries.

import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score import matplotlib.pyplot as plt import seaborn as sns import warnings warnings.filterwarnings("ignore")

Import train_test_split to split the dataset into training and testing datasets. And Linear Regression is the model on which we have to work. Import this model from scikit learn library. r2_score is to find the accuracy of the model. Matplotlib and seaborn are used for visualizations. Finally, import warnings and set it to ignore so that it will ignore all the warnings that we will come throughout.

Here is the link for the dataset. Download it and import it by passing the path of the dataset file into read_csv().

Let us view our data frame.

**Python Code:**

View the shape of the data frame.

shape=startup_df.shape print("Dataset contains {} rows and {} columns".format(shape[0],shape[1]))

The dataset contains 50 rows and 5 columns.

View all the columns in the data frame.

startup_df.columns

Data frame contains R&D Spend, Administration, Marketing Spend, State, and Profit.

View the statistical description of the dataset which includes the total count of each column, mean of all values, standard deviation, minimum, maximum values, and 25th, 50th, 75th per cent values of the dataset.

#Statistical Details of the dataset startup_df.describe()

This is like extracting dependent and independent variables.

We have to define x and y for the model. x and y are input and output features of the dataset. So taking x features as input values that are independent, our model will predict the outcome which is y that are dependent.

x=startup_df.iloc[:,:4] y=startup_df.iloc[:,4]

We use one-hot encoding when there are categorical values in our dataset. Here for us, there is a state column that is categorical, so we have to use one-hot encoding to convert them.

So, import One-HotEncoder from scikit learn library.

from sklearn.preprocessing import OneHotEncoder ohe=OneHotEncoder(sparse=False) x=ohe.fit_transform(startup_df[['State']])

View x.

x

array([[0., 0., 1.],

[1., 0., 0.],

[0., 1., 0.],

[0., 0., 1.],

[0., 1., 0.],

[0., 0., 1.],

[1., 0., 0.],

[0., 1., 0.],

[0., 0., 1.],

[1., 0., 0.],

[0., 1., 0.],

[1., 0., 0.],

[0., 1., 0.],

[1., 0., 0.],

[0., 1., 0.],

[0., 0., 1.],

[1., 0., 0.],

[0., 0., 1.],

[0., 1., 0.],

[0., 0., 1.],

[1., 0., 0.],

[0., 0., 1.],

[0., 1., 0.],

[0., 1., 0.],

[0., 0., 1.],

[1., 0., 0.],

[0., 1., 0.],

[0., 0., 1.],

[0., 1., 0.],

[0., 0., 1.],

[0., 1., 0.],

[0., 0., 1.],

[1., 0., 0.],

[0., 1., 0.],

[1., 0., 0.],

[0., 0., 1.],

[0., 1., 0.],

[1., 0., 0.],

[0., 0., 1.],

[1., 0., 0.],

[1., 0., 0.],

[0., 1., 0.],

[1., 0., 0.],

[0., 0., 1.],

[1., 0., 0.],

[0., 0., 1.],

[0., 1., 0.],

[1., 0., 0.],

[0., 0., 1.],

[1., 0., 0.]])

It will give an array like this. Let us see what are those three categories.

ohe.categories_[array([‘California’, ‘Florida’, ‘New York’], dtype=object)]

Here [0., 0., 1.] indicates NewYork, [0., 1., 0.] indicates Florida and [1., 0., 0.] indicates California.

For this import make_column_transformer from scikit learn library and pass the column that we want to transfer.

from sklearn.compose import make_column_transformer

col_trans=make_column_transformer( (OneHotEncoder(handle_unknown='ignore'),['State']), remainder='passthrough')

x=col_trans.fit_transform(x)

Now view x.

It will look like this.

Now, split your dataset into two parts in which 80% of the dataset will go to the training set, and 20% of the dataset will go to the testing set. Actually, you can divide it as per your wish by setting the value into test_size.

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=0)

View the shapes of splitter data.

#shapes of splitted data print("X_train:",x_train.shape) print("X_test:",x_test.shape) print("Y_train:",y_train.shape) print("Y_test:",y_test.shape)

X_train: (40, 6)

X_test: (10, 6)

Y_train: (40,)

Y_test: (10,)

To train the model, we have to import the Linear Regression model, which we have already created at the beginning. Use the fit method, and pass the training sets into it to train the model.

linreg=LinearRegression() linreg.fit(x_train,y_train)

Use the predict method to predict the results, then pass the independent variables into it and view the results. It will give the array with all the values in it.

y_pred=linreg.predict(x_test) y_pred

We have different metrics to find the accuracy score of the model, and here we use r2_score to evaluate our model and find its accuracy.

Accuracy=r2_score(y_test,y_pred)*100 print(" Accuracy of the model is %.2f" %Accuracy)

The accuracy of the model is 93.47.

We will plot the scatter plot between actual values and predicted values. Use xlabel to label the x-axis and use ylabel to label the y-axis.

plt.scatter(y_test,y_pred); plt.xlabel('Actual'); plt.ylabel('Predicted');

Regression plot of our model.

A regression plot is useful to understand the linear relationship between two parameters. It creates a regression line in-between those parameters and then plots a scatter plot of those data points.

sns.regplot(x=y_test,y=y_pred,ci=None,color ='red');

Let us create a new data frame that contains actual values, predicted values, and differences between them so that we will understand how near the model predicts its actual value.

pred_df=pd.DataFrame({'Actual Value':y_test,'Predicted Value':y_pred,'Difference':y_test-y_pred})

View the data frame.

pred_df

Here we can see the difference between Actual values and predicted values which are not very high. When values are in the range of lakhs, then the difference in thousands is not much.

We have already seen that the accuracy of this model is about 93 percent.

A. The formula for the slope coefficients (β) in multiple linear regression is:

β = (X’X)^(-1) X’Y

where X is the design matrix (containing the independent variables), Y is the vector of the dependent variable, and “^(-1)” denotes the inverse of a matrix.

A. The equation for multiple regression is:

Y = β0 + β1X1 + β2X2 + … + βkXk + ε

where Y is the dependent variable, X1, X2, …, Xk are the independent variables, β0 is the intercept, β1, β2, …, βk are the coefficients of the independent variables, and ε is the error term.

In this article, we have created a new Linear Regression model, and we learned how to perform One-Hot Encoding and where to perform it. We used a column transformer and then trained the model, predicted the results, evaluated the model using r2_score metrics, and plotted the results.

Hope you guys found it useful.

Read more articles on our website. Click here.

Connect with me on LinkedIn: https://www.linkedin.com/in/amrutha-k-6335231a6vl/

**The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion. **

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Become a full stack data scientist
##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

Understanding Cost Function
Understanding Gradient Descent
Math Behind Gradient Descent
Assumptions of Linear Regression
Implement Linear Regression from Scratch
Train Linear Regression in Python
Implementing Linear Regression in R
Diagnosing Residual Plots in Linear Regression Models
Generalized Linear Models
Introduction to Logistic Regression
Odds Ratio
Implementing Logistic Regression from Scratch
Introduction to Scikit-learn in Python
Train Logistic Regression in python
Multiclass using Logistic Regression
How to use Multinomial and Ordinal Logistic Regression in R ?
Challenges with Linear Regression
Introduction to Regularisation
Implementing Regularisation
Ridge Regression
Lasso Regression

Introduction to Stacking
Implementing Stacking
Variants of Stacking
Implementing Variants of Stacking
Introduction to Blending
Bootstrap Sampling
Introduction to Random Sampling
Hyper-parameters of Random Forest
Implementing Random Forest
Out-of-Bag (OOB) Score in the Random Forest
IPL Team Win Prediction Project Using Machine Learning
Introduction to Boosting
Gradient Boosting Algorithm
Math behind GBM
Implementing GBM in python
Regularized Greedy Forests
Extreme Gradient Boosting
Implementing XGBM in python
Tuning Hyperparameters of XGBoost in Python
Implement XGBM in R/H2O
Adaptive Boosting
Implementing Adaptive Boosing
LightGBM
Implementing LightGBM in Python
Catboost
Implementing Catboost in Python

Introduction to Clustering
Applications of Clustering
Evaluation Metrics for Clustering
Understanding K-Means
Implementation of K-Means in Python
Implementation of K-Means in R
Choosing Right Value for K
Profiling Market Segments using K-Means Clustering
Hierarchical Clustering
Implementation of Hierarchial Clustering
DBSCAN
Defining Similarity between clusters
Build Better and Accurate Clusters with Gaussian Mixture Models

Introduction to Machine Learning Interpretability
Framework and Interpretable Models
model Agnostic Methods for Interpretability
Implementing Interpretable Model
Understanding SHAP
Out-of-Core ML
Introduction to Interpretable Machine Learning Models
Model Agnostic Methods for Interpretability
Game Theory & Shapley Values

Deploying Machine Learning Model using Streamlit
Deploying ML Models in Docker
Deploy Using Streamlit
Deploy on Heroku
Deploy Using Netlify
Introduction to Amazon Sagemaker
Setting up Amazon SageMaker
Using SageMaker Endpoint to Generate Inference
Deploy on Microsoft Azure Cloud
Introduction to Flask for Model
Deploying ML model using Flask