Machine learning is one of the most advancing technologies in Computer Science in the present era. A lot of Researchers, Academicians, and Industrialists are investing their efforts to innovate in this field. If you find the process of training machines to learn to make decisions on their own fascinating, then this machine learning pipeline tutorial will be very helpful to you. A demonstration is shown on how by providing some data, we can actually train our models to learn something and use it over and over again for decision making.

This process might seem interesting at first but when we actually start to do it step by step, it takes up lots of time. Tinkering the code, changing models, and finding out points of error is very time-consuming and requires a lot of effort. Thus one effective approach that is being followed by the people is to create a machine learning pipeline.

This article was published as a part of the Data Science Blogathon.

Machine Learning pipeline refers to the creation of independent and reusable modules in such a manner that they can be pipelined together to create an entire workflow. Keeping this sophisticated definition aside, what it simply means is that we divide our work into smaller parts and automate it in such a way that we can do the entire task as small subtasks.

Let’s understand the concept with the help of a real-life example that you have encountered in your day to day life:

Think of it as similar to washing clothes. The procedure to segregate clothes into different piles, washing each pile of clothes, drying them, and hanging them. Each step can be done independently and helps us to do tasks faster because they can now be done in an organized manner.

Now , one of the major benefits of it is the flexibility to use these components independently and iteratively.

For example, if you have a dataset and you would like to validate your model on two different criterias, then you don’t need to run the training component again, you can run your validation component separately and create two pipelines. In this manner we can create more accurate algorithms with more ease.

Machine learning models can be created with more ease and efficiency as we have defined these components because implementation can be done in stepwise manner for each subpart.

Pipelining automates the entire workflow from correlating and feeding data into the model and analysing the results . Thus using these well managed implementations is much more robust, scalable and customizable.

It also helps in expanding one’s model as we don’t need to copy paste all previous code, we can use previous components in the pipeline and build upon them. Copying previous code is very cumbersome and is considered a bad practice.

**Automatic Updation**

Manual updation of various scripts is not necessary when we tinker with some specified configuration , it gets updated automatically. This eliminates scope of error which might be present in manual updation.

After knowing the advantages of using a machine learning pipeline, now comes the most interesting part of this tutorial where we are trying to use the same concept in one of the problem statements.

So, let’s walk through various steps involved in its creation. I will use the popular Iris dataset for demonstration purposes. Thus, input to our pipeline would be the Iris dataset.

The Iris dataset contains 3 classes each corresponding to a type of Iris plant . These are Iris Setosa, Iris Versicolour and Iris Virginica. There are 50 instances of each class and each class is linearly separable from other classes. The features in the dataset are basically characteristics of the data and can be numerical or categorical. There are 4 independent variables or what we call the features in machine learning and these are sepal length , sepal width , petal length and petal width. All of these features are numerical.

Our input dataset might have a lot of issues which impact the performance of the machine learning model negatively. Hence we need to deal with the problem of missing values, noisy or duplicate data or sometimes we need to format data in a certain range so as to improve the accuracy of our model. Thus we need to preprocess our database prior to the training model.

For Missing Values, we could delete the entire row or fill in the missing values with the mean value of the feature. In the similar manner we could rescale data, binarise , normalise or encode data. One thing to note is that we use these techniques depending on the dataset as well as the problem that we are trying to address.

Another important step which occurs in data preprocessing is splitting the dataset into training and testing dataset.

For Iris dataset Classification , following preprocessing techniques are used.

**Encoding **

The target variable i.e. Species column is categorical and hence needs to be encoded to numerical data. We will use Label Encoder from sklearn preprocessing library. It will encode Iris – setosa to 0 , Iris – versicolor to 1 and Iris-Virginica to 2.

```
from sklearn.preprocessing import LabelEncoder
encoder=LabelEncoder()
y = encoder.fit_transform(y)
```

**Train-test split**

We will now split our data in 80 : 20 ratio with 80 percent of the original dataset as training one.

```
from sklearn.model_selection import train_test_split
xtrain , xtest, ytrain, ytest = train_test_split(x,y,test_size=0.2, shuffle=True)
xtest.head()
```

Before applying any model, understanding the dataset is a critical step. Exploring data and finding its charctericstics is essentially what we call Exploratory Data Analysis. We can visualise the data using various libraries.

For given Iris Dataset , We can visually see the count of each species.

We can also visualise distribution of each feature.

Identifying the relevant features and selecting them appropriately to design the model is one of the most important tasks of building the pipeline.

A major problem faced while designing a machine learning model is the increase in complexity due a large no. of independent variables or features and decreased accuracy due to presence of irrelevant features. Thus reduced training time, better fit of data and better accuracy are the main objectives of feature selection.

Usage of information gain, chi square test or the correlation matrix are some of popular feature selection techniques. Also regularisation techniques such as L1 regularisation can be employed to make feature selection more optimal.

Since the Iris dataset has only 4 features which are relatively lesser, hence the need to select features is not there.

Once we have the cleaned data with the selected features, it’s time to address the elephant in the room. In order to carry out the intended task, we need to train the selected model on our prepared dataset.

Now a machine learning model is simply a program file that runs on a provided dataset and does some task such as classification , prediction etc. and produces outputs. What makes it unique is its ability to learn throughout this process. Our machine learning model can fall into either one of following categories:

**Supervised learning:**When input data along with correct output is supplied to the model , the learning is known as supervised learning.**Unsupervised learning:**When unlabelled data is provided and the aim of the algorithm is to find patterns in data and cluster them or to find the association, this sort of learning is known as unsupervised learning.**Reinforcement learning:**It’s basic aim is to learn to take some suitable action in a particular environment so as to maximise the reward.

Now we can use any of the existing techniques such as linear regression, KNN, Naive Bayes algorithm, logistic regression, SVMs , decision trees , PCA , Random forest etc. We can even try various models on our dataset to see which one works better.

For our given dataset we can use the decision tree classifier by importing the same from sklearn trees library. Now in simple terms, the decision tree classifier just splits the dataset at each node based on some parameters. Here we can give max_depth as a hyperparameter which refers to the maximum depth of the tree.

```
from sklearn.tree import DecisionTreeClassifier
baseest = DecisionTreeClassifier(max_depth=2)
baseest.fit(xtrain,ytrain)
```

Before we evaluate our model onto a test dataset , it is a good idea to validate the model on the validation set. This is because when we have trained our model , we can’t say for sure that our model works well on unseen dataset i.e. performs with required accuracy.

Thus the process of validation is to get confidence that our model can give desired results on unseen data or to give us an assurance that the way we have assumed relations between data to produce some output are indeed correct.

In case we have large datasets, we can split data into 3 sets i.e. train , validation and test dataset. However sometimes when we have a limited dataset, cross validation is one of the most popular and effective techniques for data validation. To name a few we can say k fold cross validation or stratified k fold which inherently uses sampling techniques. We can use these validation techniques and if output is not satisfactory, we can again tune our hyperparameters and retrain our model or choose a new model accordingly.

Another interesting manner of tuning hyperparameters is to use Grid search CV which helps one to find best parameters in the supplied range of parameters over an estimator based on cross validated grid search.

Thus for Iris dataset we can use stratified 5 fold cross validation.

The reason for using Stratified K-fold Cross Validation is due to limited instances in the dataset. In stratified k fold validation , stratified sampling is done on the dataset. Stratified sampling means picking out dataset values such that this data accurately represents our training data i.e, picking out such rows so that ratio of various classes in the sample remain similar to that of training dataset. Here 5 fold essentially means that the sample is divided into 5 parts and 4 of them are used for training while 1 is left out for testing in cross validation.

```
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
skf=StratifiedKFold(n_splits=5, shuffle=True)
cvscore = cross_val_score(baseest,xtrain,ytrain,cv = skf)
print(cvscore)
print("Mean is :",cv score.mean())
print("Standard deviation is : ",cvscore.std())
```

Here we can find the cross validation score of each fold and mean and standard deviation for it.

Output :

`[1. 1. 0.91666667 0.95833333 1. ]`

Mean is : 0.975

Standard deviation is : 0.03333333333333334

It is the last step of the pipeline where we actually check our model’s performance on our test dataset and predict the target variable.

The evaluation matrix to be used depends on the problem solved by our model.

We can have four kinds of outcomes for classification problems:

- True Positive: When a positive outcome is correctly classified.
- False Positive: When a negative outcome is classified as positive.
- True Negative: When a negative outcome is correctly classified.
- False Negative: When a positive outcome is classified as negative.

Three main metrics that evaluate the performance of a machine learning model for classification are :

- Accuracy: How many of the total predictions turned out to be correct?

Accuracy = correct predictions / total predictions

- Precision: How many of the predicted positive outcomes in fact turned out to be positive i.e How precise the model is?

Precision = True positives/ total positive predictions

- Recall: How many instances of positive class correctly identified or we may say recalled?

Recall= true positives/( true positives+false negatives)

The metrics can be visualised by ROC curves.

Can you think of some of the evaluation metrics used for regression problem statement?

For regression models we can use errors such as mean squared error, explained variance or R square error.

`from sklearn.metrics import accuracy_score`

ypred = baseest.predict(xtest)

print(accuracy_score(ytest,ypred)*100,end='')

print('%')

**Output **

90.0%

Here the accuracy turns out to be pretty low. This is because the decision tree is a weak learner. We can use ensemble learning techniques such as boosting or bagging to create classifiers with very good performance using decision trees.

Now you can go ahead and build your own machine learning pipelines, try other classifiers or tinker these hyperparameters so as to achieve better performance.

As our exploration into the machine learning pipeline comes to an end, it’s clear that mastering this process is essential for data enthusiasts and professionals alike. The Analytics Vidhya Blackbelt Program is a beacon of comprehensive learning, offering a structured curriculum that delves into the depths of machine learning pipelines and beyond. By enrolling in this program, you’re embarking on a transformative journey that equips you with the expertise needed to navigate the complexities of real-world data challenges.

A. A machine learning pipeline is a systematic sequence of tasks that preprocesses data, builds models, and evaluates their performance to automate the end-to-end machine learning process.

A. The steps in a machine learning pipeline include data collection, preprocessing, feature engineering, model selection, training, evaluation, and deployment.

A. In machine learning, pipelines can be categorized into data preprocessing pipelines and model training pipelines, both serving to streamline and automate complex workflows.

A. Machine learning pipelines enhance efficiency by automating data preparation, model building, and evaluation. They ensure consistency, reproducibility, and rapid iteration, expediting the development of effective models.

**The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion. **

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Become a full stack data scientist
##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

Understanding Cost Function
Understanding Gradient Descent
Math Behind Gradient Descent
Assumptions of Linear Regression
Implement Linear Regression from Scratch
Train Linear Regression in Python
Implementing Linear Regression in R
Diagnosing Residual Plots in Linear Regression Models
Generalized Linear Models
Introduction to Logistic Regression
Odds Ratio
Implementing Logistic Regression from Scratch
Introduction to Scikit-learn in Python
Train Logistic Regression in python
Multiclass using Logistic Regression
How to use Multinomial and Ordinal Logistic Regression in R ?
Challenges with Linear Regression
Introduction to Regularisation
Implementing Regularisation
Ridge Regression
Lasso Regression

Introduction to Stacking
Implementing Stacking
Variants of Stacking
Implementing Variants of Stacking
Introduction to Blending
Bootstrap Sampling
Introduction to Random Sampling
Hyper-parameters of Random Forest
Implementing Random Forest
Out-of-Bag (OOB) Score in the Random Forest
IPL Team Win Prediction Project Using Machine Learning
Introduction to Boosting
Gradient Boosting Algorithm
Math behind GBM
Implementing GBM in python
Regularized Greedy Forests
Extreme Gradient Boosting
Implementing XGBM in python
Tuning Hyperparameters of XGBoost in Python
Implement XGBM in R/H2O
Adaptive Boosting
Implementing Adaptive Boosing
LightGBM
Implementing LightGBM in Python
Catboost
Implementing Catboost in Python

Introduction to Clustering
Applications of Clustering
Evaluation Metrics for Clustering
Understanding K-Means
Implementation of K-Means in Python
Implementation of K-Means in R
Choosing Right Value for K
Profiling Market Segments using K-Means Clustering
Hierarchical Clustering
Implementation of Hierarchial Clustering
DBSCAN
Defining Similarity between clusters
Build Better and Accurate Clusters with Gaussian Mixture Models

Introduction to Machine Learning Interpretability
Framework and Interpretable Models
model Agnostic Methods for Interpretability
Implementing Interpretable Model
Understanding SHAP
Out-of-Core ML
Introduction to Interpretable Machine Learning Models
Model Agnostic Methods for Interpretability
Game Theory & Shapley Values

Deploying Machine Learning Model using Streamlit
Deploying ML Models in Docker
Deploy Using Streamlit
Deploy on Heroku
Deploy Using Netlify
Introduction to Amazon Sagemaker
Setting up Amazon SageMaker
Using SageMaker Endpoint to Generate Inference
Deploy on Microsoft Azure Cloud
Introduction to Flask for Model
Deploying ML model using Flask

Very informative and easy to understand !!

Informative!! Easy to understand!!