This article was published as a part of theÂ Data Science Blogathon

While doing anyÂ **Machine Learning Project**, the utmost thing is Pipeline that includes mainly the following components:

- Data Preprocessing,
- Exploratory Data Analysis,
- Feature Engineering,
- Model Building and Evaluation, etc.

Therefore, for **Machine Learning Engineers** and **Data Scientists **aspirants, it becomes very important to understand the Machine Learning Pipeline.

Let’s understand the motivation behind all these concepts:

After a better idea about the pipeline, we can implement any of the Machine Learning Project which gives better clarity about our project.

So, In this article, we will be discussing the complete Machine learning pipeline with the help of a machine learning project of Medical Dataset.

**Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Image Source: Google Images**

- We will build a Linear regression model for theÂ
**Medical cost dataset**. - The dataset contains age, sex, BMI(body mass

index), children, smokers, and region feature, as independent variables, and charge as a dependent variable. - We will predict individual medical costs billed

by health insurance.

**Linear Regression**is**Supervised learning the algorithm**used when the target/dependent

the variable is continuous in real numbers.- It finds a relationship between the
**dependent variable y**and one or more**independent variable**using the best fit line.

x - It works on the principle of
**Ordinary Least Square(OLS)**or**Means squared Error (MSE)**. - In Statistics, OLS is a method to estimate unknown parameters of the linear regression function, its goal is to minimize the sum of square differences between observed dependent

variables in the given data set and those predicted by the linear regression algorithm.

In this step, we will import the necessary dependencies of Python such as:

**Matrix Manipulation:**Numpy**Data Manipulation:**Pandas**Data Visualization:**Matplotlib**Advanced-Data Visualization:**Seaborn

```
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['figure.figsize'] = [8,5]
plt.rcParams['font.size'] =14
plt.rcParams['font.weight']= 'bold'
plt.style.use('seaborn-whitegrid')
```

Now, we will read and load the dataset using Pandas.

**2.1: Load the Dataset**

df = pd.read_csv('insurance.csv')

**2.2: Number of rows and columns in the dataset**

`print('nNumber of rows and columns in the data set: ',{'Rows':df.shape[0], 'columns':df.shape[1]})`

__Output:__

Number of rows and columns in the data set: {'Rows': 1338, 'columns': 7}

**2.3: Print the first five rows of the dataset**

df.head()

__Output:__

In this step, we will explore the data and try to find some insights by visualizing the data properly, by using the **Pandas **andÂ **Seaborn** library functions.

**3.1: Check for duplicated data**

duplicate=df.duplicated() print(duplicate.sum())

__Output:__

1

**3.2: Remove the duplicated records**

df.drop_duplicates(inplace=True)

**3.3: Now verify if there is any duplicated record left or not**

dp1=df.duplicated() print(dp1.sum())

__Output:__

0

**3.4: Draw boxplot for Outlier Analysis**

df.boxplot();

__Output:__

**3.5: Size of the DataFrame**

print("No of elements in the dataframe is",df.size)

__Output:__

No of elements in the dataframe is 9359

**3.6: Print data Types of all columns**

print(df.dtypes)

__Output:__

3.7: Draw the pairplot for complete Dataset

sns.pairplot(df);

3.8: Visualize the distribution of data for every feature(For plotting histogram)

import matplotlib.pyplot as plt df.hist(bins=50, figsize=(20, 15));

**Output:**

** Conclusion: **Hereafter plotting the histogram for numerical columns, we observe that

normally distributed whereas

**3.9: Memory Usage by each of the columnsÂ **

df.memory_usage()

__Output:__

Index 10696 age 10696 sex 10696 bmi 10696 children 10696 smoker 10696 region 10696 charges 10696 dtype: int64

**3.10: Print Index of the DataFrame**

df.index

**Output:**

Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ... 1328, 1329, 1330, 1331, 1332, 1333, 1334, 1335, 1336, 1337], dtype='int64', length=1337)

**3.11: Print number of unique values per columns**

df.nunique()

**Output:**

age 47 sex 2 bmi 548 children 6 smoker 2 region 4 charges 1337 dtype: int64

**3.12: Brief information about the dataset( coincise information about the data frame)**

df.info()

__Output:__

**3.13: Statistical measure of all the numerical columns**

df.describe()

**Output:**

**3.14:Â Print name of all columns present in the dataset**

df.columns

**Output:**

Index(['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges'], dtype='object')

**3.15: Name for all numerical columns**

num_cols=[col for col in df.columns if df[col].dtypes!='O'] num_cols

**Output:**

['age', 'bmi', 'children', 'charges']

**3.16: Name for all categorical columns**

cat_cols=[col for col in df.columns if df[col].dtypes=='O'] cat_cols

**Output:**

['sex', 'smoker', 'region']

**3.17: Print unique values for categorical columns**

print(df['sex'].unique()) print(df['smoker'].unique()) print(df['region'].unique())

**Output:**

['female' 'male'] ['yes' 'no'] ['southwest' 'southeast' 'northwest' 'northeast']

**3.18: Finding the sum of missing values per column if present**

df.isnull().sum()

**Output:**

age 0 sex 0 bmi 0 children 0 smoker 0 region 0 charges 0 dtype: int64

**3.19: Plotting of heatmap to visualize missing values**

plt.figure(figsize=(12,4)) sns.heatmap(df.isnull(),cbar=False,cmap='viridis',yticklabels=False) plt.title('Missing value in the dataset');

**Output:**

** Conclusion:** There are no missing values in

the dataset.

**3.20: Correlation values b/w numerical columns**

corr_mat=df.corr() corr_mat

**Output:**

**3.21: Correlation of dependent column wrt independent columns**

corr_mat['charges'].sort_values(ascending=False)

**Output:**

charges 1.000000 age 0.298308 bmi 0.198401 children 0.067389 Name: charges, dtype: float64

**3.22: Correlation plot**

sns.heatmap(df.corr(),annot= True);

**Output:**

** Conclusion: **There is not that much correlation

between independent features. So, here we do

not have the problem of multicollinearity.

**3.23: Plot the distribution of the dependent variable**

import warnings warnings.filterwarnings('ignore') f= plt.figure(figsize=(12,4)) ax=f.add_subplot(121) sns.distplot(df['charges'],bins=50,color='y',ax=ax) ax.set_title('Distribution of insurance charges') ax=f.add_subplot(122) sns.distplot(np.log10(df['charges']),bins=40,color='b',ax=ax) ax.set_title('Distribution of insurance charges in $log$ scale') ax.set_xscale('log');

__Output:__

** Conclusion: **If we look at the first plot the charges vary

from 1120 to 63500, the plot is right-skewed. And In the second plot, we will apply a natural log,

then the plot approximately tends to normal. For further analysis, we will apply log on target variable charges.

Machine learning algorithms are not able to work directly with categorical data so we have to convert categorical data into numbers. There are mainly three techniques to do this i.e.,

**Label Encoding:**Label encoding refers to transforming the word labels into numerical

form so that the algorithms can understand how to operate on them.**One hot encoding:**It represents the categorical variables in the form of binary vectors. It allows the representation of categorical data to be more expressive. Firstly,Â the categorical values have been mapped to integer values, which is known as**label encoding**. Then, each integer value is represented as a binary vector that is all zero values except for the index of the integer, which is marked with a1**Dummy variable trap:**This is a scenario when the independent variables are collinear with each other.

Here in this problem, we use a dummy variable trap. By using the pandas **get_dummies** function we can do

all the above three steps in the line of code. We will this function to get dummy variables for sex,

children, smoker, region features. By setting **drop_first =True** function will remove dummy variables traps by dropping one variable and the original variable.

**4.1: Apply the pd.get_dummies() function**

df_encode = pd.get_dummies(data = df, prefix = 'OHE', prefix_sep='_',Â columns = cat_cols,Â drop_first =True,Â dtype='int8')

**4.2 Let’s verify the dummy variable process**

print('Columns in original data frame:n',df.columns.values) print('nNumber of rows and columns in the dataset:',df.shape) print('nColumns in data frame after encoding dummy variable:n',df_encode.columns.values) print('nNumber of rows and columns in the dataset:',df_encode.shape)

**Output:**

Columns in original data frame: ['age' 'sex' 'bmi' 'children' 'smoker' 'region' 'charges'] Number of rows and columns in the dataset: (1337, 7) Columns in data frame after encoding dummy variable: ['age' 'bmi' 'children' 'charges' 'OHE_male' 'OHE_yes' 'OHE_northwest' 'OHE_southeast' 'OHE_southwest'] Number of rows and columns in the dataset: (1337, 9)

__Box-Cox transformation :__

- It is a technique to transform non-normal dependent variables into a normal distribution.
- Most of the time, Normality becomes a crucial assumption for many statistical techniques; so if your data is not normal, then applying a Box-Cox implies that you can run a broader number of tests.
- All that we need to perform this transformation is to find the lambda value and apply the rule shown below to your variable. The trick of Box-Cox transformation is to find lambda value, however, in practice, this is quite affordable.

**4.3:Â Log transform of the dependent variable**

from scipy.stats import boxcox y_bc,lam, ci= boxcox(df_encode['charges'],alpha=0.05) ci,lam

__Output:__

((-0.011576269777122257, 0.09872104960017168), 0.043516942579678274)

**4.4: Log transform**

df_encode['charges'] = np.log(df_encode['charges'])

Here we use the train_test_split() function

with parameters as dependent and independent

variables with test_ratio=0.3 from

model_selection module.

from sklearn.model_selection import train_test_split # Independent variables(predictor) X = df_encode.drop('charges',axis=1) # dependent variable(response) y = df_encode['charges'] # Now, split the data X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=23)

**6.1: add x _{0} =1 to dataset**

```
X_train_0 = np.c_[np.ones((X_train.shape[0],1)),X_train]
X_test_0 = np.c_[np.ones((X_test.shape[0],1)),X_test]
# Step2: build model
theta = np.matmul(np.linalg.inv( np.matmul(X_train_0.T,X_train_0) ), np.matmul(X_train_0.T,y_train))
# The parameters for linear regression model
parameter = ['theta_'+str(i) for i in range(X_train_0.shape[1])]
columns = ['intersect:x_0=1'] + list(X.columns.values)
parameter_df = pd.DataFrame({'Parameter':parameter,'Columns':columns,'theta':theta})
```

**6.2: Scikit Learn module( # Note: x _{0} =1 is no need to add, sklearn will take care of it.)**

from sklearn.linear_model import LinearRegression lin_reg = LinearRegression(fit_intercept=True,normalize=False) lin_reg.fit(X_train,y_train) #Parameter sk_theta = [lin_reg.intercept_]+list(lin_reg.coef_) parameter_df = parameter_df.join(pd.Series(sk_theta, name='Sklearn_theta')) parameter_df

__Output:__

__ Conclusion: __The parameters obtained from both models are the same. So we successfully build our model

using normal equations and verified using the sklearn linear regression module.

**7.1: Normal equation**

y_pred_norm = np.matmul(X_test_0,theta) #Evaluation: MSE J_mse = np.sum((y_pred_norm - y_test)**2)/ X_test_0.shape[0] # R_square calculation sse = np.sum((y_pred_norm - y_test)**2) sst = np.sum((y_test - y_test.mean())**2) R_square = 1 - (sse/sst)

**7.2: sklearn regression module**

y_pred_sk = lin_reg.predict(X_test) #Evaluation: MSE from sklearn.metrics import mean_squared_error J_mse_sk = mean_squared_error(y_pred_sk, y_test) # R_square R_square_sk = lin_reg.score(X_test,y_test)

**8.1: Prediction of test data using the normal equation**

print(y_pred_norm)

`print(y_pred_sk)`

**9.1: Mean Squared Error for Model using Normal Equation**

print('The Mean Square Error(MSE) or J(theta) is: ',J_mse)

**Output:**

The Mean Square Error(MSE) or J(theta) is: 0.19026739560428377

**9.2: R-SquaredÂ for Model using the Normal Equation**

print('The R_2 score by using the normal equation is: ',R_square)

**Output:**

The R_2 score by using the normal equation is: 0.785908962562808

**9.3:Â Mean Squared Error for Model using Sklearn Library**

print('The Mean Square Error(MSE) or J(theta) is: ',J_mse_sk)

**Output:**

The Mean Square Error(MSE) or J(theta) is: 0.19026739560428194

**9.4:Â R-Squared for Model using Sklearn Library**

`print('The R_2 score by using the sklearn library is: ',R_square_sk)`

**Output:**

The R_2 score by using the sklearn library is: 0.78590896256281

Conclusion: Since our sklearn model and normal equation are giving almost the same value of R^{2} and Mean

squared error, these two models are very closely related and the test predictions of both the models

are very close to each other.

To validate the model we need to check a

few assumptions of the linear regression model. The common assumption for the Linear Regression model

are as follows:

**Linear Relationship:**In linear regression the relationship between the dependent and independent

variable to be linear.Â This can be checked by scattering plotting between Actual value Vs Predicted value.- The residual error plot should be
**normally distributed**. - The mean of residual error should be 0 or close to 0 as much as possible.
- Linear regression requires all variables to be multivariate normal. This assumption can best be

checked with a**Q-Q plot**. - Linear regression assumes that there is little or no
**Multicollinearity**in the data.**Multicollinearity**happens when the independent variables are correlated with each other. To identify the correlation between independent variables and the strength of that correlation, we use**Variance Inflation Factor(VIF)**. **VIF=1/1-R**If VIF >1 & VIF <5 moderate correlation, VIF < 5 critical level of multicollinearity.^{2}:Â**Homoscedasticity:**The data are homoscedastic meaning the residuals are equal across the regression

line. We can look at residual Vs fitted value scatter plots. The heteroscedastic plot would exhibit a funnel

shape pattern.

**10.1: Check for Linearity**

f = plt.figure(figsize=(15,5)) ax = f.add_subplot(121) sns.scatterplot(y_test,y_pred_sk,ax=ax,color='r') ax.set_title('Check for Linearity:n Actual Vs Predicted value')

__Output:__

**10.2: Check for Residual normality & mean**

ax = f.add_subplot(122) sns.distplot((y_test - y_pred_sk),ax=ax,color='b') ax.axvline((y_test - y_pred_sk).mean(),color='k',linestyle='--') ax.set_title('Check for Residual normality & mean: n Residual eror');

**10.3: Check for Multivariate Normality**

# Quantile-Quantile plot f,ax = plt.subplots(1,2,figsize=(14,6)) import scipy as sp _,(_,_,r)= sp.stats.probplot((y_test - y_pred_sk),fit=True,plot=ax[0]) ax[0].set_title('Check for Multivariate Normality: nQ-Q Plot') #Check for Homoscedasticity sns.scatterplot(y = (y_test - y_pred_sk), x= y_pred_sk, ax = ax[1],color='r') ax[1].set_title('Check for Homoscedasticity: nResidual Vs Predicted');

__Output:__

**10.4: Check for Multicollinearity**

#Variance Inflation Factor VIF = 1/(1- R_square_sk) VIF

**Output:**

4.670910150983689

The model assumption linear regression as follows:

- In our model, the actual vs predicted plot is curved so the linear assumption fails.
- The residual mean is zero and the residual error plot is right-skewed.
- Q-Q plot shows as the value log value greater than 1.5 trends to increase.
- The plot exhibits heteroscedastic error and will increase after a certain point.
- Variance inflation factor value is less than 5, so no
**multicollinearity**.

I hope you enjoyed the article.

If you want to connect with me, please feel free to contact me** **on** ****Email**

Your suggestions and doubts are welcomed here in the comment section. Thank you for reading my article!

*The media shown in this article are not owned by Analytics Vidhya and are used at the Authorâ€™s discretion.*

Lorem ipsum dolor sit amet, consectetur adipiscing elit,