**Introduction**

In a general **Machine Learning Project**, the utmost thing is a better idea about the Pipeline includes mainly the following components:

- Feature Selection,
- Exploratory Data Analysis,
- Feature Engineering,
- Model Building and Evaluation,
- Save the Model and use it, etc.

Therefore, it becomes very important as a beginner to understand the Machine Learning Pipeline do any of the general Data Science Project.

Let’s understand the motivation to do all these things:

**Why we are doing these things?**

After doing all these things, we can implement any of the Machine Learning Project in a stepwise manner which gives better clarity about our project and we can explain that to anyone, means not looks like a **“Black-box”**.

So, In this article, we will be discussing the complete Machine learning pipeline with the help of a machine learning project and see all the detailed steps.

**Table of Contents**

**1. **Import Necessary Dependencies

**2. **Take some knowledge about the data

**3. **Read and Load the Dataset

**4. **Exploratory Data Analysis(EDA)

**5. **Splitting of Data into Training and Testing Subset

**6. **Training the Model using Linear Regression Algorithm

**7.** Predictions on Test Data

**8.** Evaluating the Model

**9.** Explore the** **Residuals

**10. **Conclusion

**Pre-requisites:**

Basic understanding of Linear Regression Algorithm. If you have no idea about the algorithm, please refer to the **link** before going to the later part of the article, so that you have a basic understanding of all the concepts which we will cover.

**Let’s get started,**

__Step-1: Import Necessary Dependencies__

__Step-1: Import Necessary Dependencies__

In this step, we will import the necessary libraries such as:

**For Linear Algebra:**Numpy**For Data Preprocessing, and CSV File I/O:**Pandas**For Model Building and Evaluation:**Scikit-Learn**For Data Visualization:**Matplotlib, and Seaborn, etc.

import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline

__Step-2: Take Some knowledge about the Data__

__Step-2: Take Some knowledge about the Data__

Here we will work on the** E-commerce Customers dataset (CSV file)**. It has Customer information, such as Email, Address, and color Avatar. Then it also has numerical value columns:

**Average Session Length:**Average session of in-store style advice sessions.**Time on App:**Average time spent by the customer on App in minutes**Time on Website:**Average time spent by the customer on Website in minutes**Length of Membership:**From how many years the customer has been a member.

__ __

__Step-3: Read and Load the Dataset__

__Step-3: Read and Load the Dataset__

In this step, we will read and load the dataset using some basic function of pandas such as

**For Load the CSV file:**pd.read_csv( )**To print some initial rows of the dataset:**df.head( )**Statistical Details for Numerical Columns:**df.describe( )**Basic Information About the dataset:**df.info ( )

**3.1: Load the Dataset**

df = pd.read_csv('Ecommerce Customers.csv')

**3.2: Print some initial rows of the dataset**

df.head()

__Output:__

**3.3: Statistical Details for Numerical Columns**

df.describe()

__Output:__

**3.4: Basic Information about the dataset**

df.info()

__Output:__

__Step-4: Exploratory Data Analysis(EDA)__

__Step-4: Exploratory Data Analysis(EDA)__

In this step, we will explore the data and try to find some insights by visualizing the data properly, by using the **Seaborn** library functions such as

**Joint plot:**

- Time on Website vs Yearly Amount Spent
- Time on App vs Yearly Amount Spent
- Time on App vs Length of membership

**Pair plot:** for the complete dataset

**Implot: **Length of Membership vs Yearly Amount Spent

** **

**4.1: Use seaborn to create a joint plot to compare the Time on Website and Yearly Amount Spent columns. **

sns.jointplot(x='Time on Website',y='Yearly Amount Spent',data=df)

__Output:__

**4.2: Do the same but with the Time on App column instead.**

sns.jointplot(x='Time on App',y='Yearly Amount Spent',data=df)

__Output:__

**4.3: Use joint plot to create a 2D hex bin plot comparing Time on App and Length of Membership.**

sns.jointplot(x='Time on App',y='Length of Membership',kind="hex",data=df)

__Output:__

**4.4: Let’s explore these types of relationships across the entire data set. Use Pair plot to recreate the plot below**

sns.pairplot(df)

**4.5: Based on this plot what looks to be the most correlated feature with the Yearly Amount Spent?**

`Length of Membership`

**4.6: Create a linear model plot (using seaborn’s lmplot) of Yearly Amount Spent vs. Length of Membership.**

sns.lmplot(x='Length of Membership',y='Yearly Amount Spent',data=df)

__Output:__

__Step-5: Splitting of data into Training and Testing Data__

__Step-5: Splitting of data into Training and Testing Data__

Now that we have explored the data a bit, it’s time to go ahead and split our initial data into training and testing subsets. Here we set a variable X i.e, independent columns as the numerical features of the customers, and a variable y i.e, dependent column as the “Yearly Amount Spent” column.

**5.1: Separate Dependent and Independent Variable**

X = customers[['Avg. Session Length', 'Time on App', 'Time on Website', 'Length of Membership']] y = customers['Yearly Amount Spent']

**5.2: Use model_selection.train_test_split from sklearn to split the data into training and testing sets. Set test_size=0.20 and random_state=105**

from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=105)

Step-6: Training the Model using Linear Regression

Step-6: Training the Model using Linear Regression

Step-6: Training the Model using Linear Regression

Now, at this step we are able to train our model on our training data using Linear Regression.

**6.1: Import LinearRegression from sklearn.linear_model**

from sklearn.linear_model import LinearRegression

**6.2: Create an instance of a LinearRegression() model named lm.**

lr_model = LinearRegression()

**6.3: Train/fit lm on the training data.**

lr_model.fit(X_train,y_train)

__Output:__

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

**6.4: Print out the coefficients of the model
**

lr_model.coef_

__Output:__

array([25.98154972, 38.59015875, 0.19040528, 61.27909654])

__Step-7: Predictions on Test Data__

__Step-7: Predictions on Test Data__

Now that we have train our model, let’s evaluate its performance by doing the predictions on the unseen data.

**7.1: Use lr_model.predict() to predict off the X_test set of the data.**

predictions = lr_model.predict(X_test)

**7.2: Create a scatterplot of the real test values versus the predicted values.**

plt.scatter(y_test,predictions) plt.xlabel('Y Test') plt.ylabel('Predicted Y')

__Output:__

__Step-8: Evaluating the Model__

__Step-8: Evaluating the Model__

To evaluate our model performance, we will be calculating the residual sum of squares and the explained variance score (R^{2}).

**Determine the metrics such as Mean Absolute Error, Mean Squared Error, and the Root Mean Squared Error.**

from sklearn import metrics print('MAE :'," ", metrics.mean_absolute_error(y_test,predictions)) print('MSE :'," ", metrics.mean_squared_error(y_test,predictions)) print('RMAE :'," ", np.sqrt(metrics.mean_squared_error(y_test,predictions)))

__Output:__

MAE : 7.2281486534308295 MSE : 79.8130516509743 RMAE : 8.933815066978626

**Step-9: Explore the Residuals**

**Step-9: Explore the Residuals**

By observed the metrics calculated in the above steps, we should have a very good model with a good fit. Now, let’s quickly explore the residuals to make sure that everything was okay with our dataset and finalize our model.

**To see the above thing, try to plot a histogram of the residuals and make sure it looks normally distributed. Use either seaborn distplot, or just plt.hist().**

sns.distplot(y_test - predictions,bins=50)

__Output:__

**Step-10: Conclusion**

**Step-10: Conclusion**

Now, it’s time to conclude our model i.e, let’s see the interpretation of all the coefficients of the model to get a better idea.

**10.1: Recreate the dataframe below**

coeffecients = pd.DataFrame(lm.coef_,X.columns) coeffecients.columns = ['Coeffecient'] coeffecients

__Output:__

**10.2: How can you interpret these coefficients?**

- Keeping all other features constant, a one-unit increase in
**Avg. Session Length**is associated with an**increase of 25.98 total dollars spent**. - By Keeping all other features constant, a one-unit increase in
**Time on App**is associated with an**increase of 38.59 total dollars spent**. - Keeping all other features constant, a one-unit increase in
**Time on the Website**is associated with an**increase of 0.19 total dollars spent**. - Also, Keeping all other features constant, a one-unit increase in
**Length of Membership**is associated with an**increase of 61.27 total dollars spent**.

**This completes our discussion!**

**Endnotes**

*Thanks for reading!*

I hope you enjoyed the article and increased your knowledge about How to do End to End Machine Learning Project in Python.

Please feel free to contact me** **on** ****Email**

Something not mentioned or want to share your thoughts? Feel free to comment below And I’ll get back to you.

For the remaining articles, refer to the **link**.

__About the Author__

__About the Author__

**Aashi Goyal**

Currently, I am pursuing my Bachelor of Technology (B.Tech) in Electronics and Communication Engineering from **Guru Jambheshwar University(GJU), Hisar. **I am very enthusiastic about Statistics, and Data Science.

*The media shown in this article on Sign Language Recognition are not owned by Analytics Vidhya and are used at the Author’s discretion.*