A Guide to Understand Machine Learning Pipeline with Case Study
This article was published as a part of the Data Science Blogathon.
Machine learning is one of the most advancing technologies in Computer Science in the present era. A lot of Researchers, Academicians, and Industrialists are investing their efforts to innovate in this field. If you find the process of training machines to learn to make decisions on their own fascinating, then this tutorial will be very helpful to you. A demonstration is shown on how by providing some data, we can actually train our models to learn something and use it over and over again for decision making.
This process might seem interesting at first but when we actually start to do it step by step, it takes up lots of time. Tinkering the code, changing models, and finding out points of error is very time-consuming and requires a lot of effort. Thus one effective approach that is being followed by the people is to create a machine learning pipeline.
Table of Contents
1. What is the Machine Learning Pipeline?
2. Advantages of using Machine Learning Pipeline in a Production Environment
3. Case Study to learn more about Machine Learning Pipeline
- Data Preprocessing
- Exploratory Data Analysis
- Feature Selection
- Model Training
What is the Machine Learning Pipeline?
Machine Learning pipeline refers to the creation of independent and reusable modules in such a manner that they can be pipelined together to create an entire workflow. Keeping this sophisticated definition aside, what it simply means is that we divide our work into smaller parts and automate it in such a way that we can do the entire task as small subtasks.
Let’s understand the concept with the help of a real-life example that you have encountered in your day to day life:
Think of it as similar to washing clothes. The procedure to segregate clothes into different piles, washing each pile of clothes, drying them, and hanging them. Each step can be done independently and helps us to do tasks faster because they can now be done in an organized manner.
Advantages of using Machine Learning Pipeline in a Production Environment
Reusability of components
Now , one of the major benefits of it is the flexibility to use these components independently and iteratively.
For example, if you have a dataset and you would like to validate your model on two different criterias, then you don’t need to run the training component again, you can run your validation component separately and create two pipelines. In this manner we can create more accurate algorithms with more ease.
Ease of implementation
Machine learning models can be created with more ease and efficiency as we have defined these components because implementation can be done in stepwise manner for each subpart.
Scalability and Customization
Pipelining automates the entire workflow from correlating and feeding data into the model and analysing the results . Thus using these well managed implementations is much more robust, scalable and customizable.
It also helps in expanding one’s model as we don’t need to copy paste all previous code, we can use previous components in the pipeline and build upon them. Copying previous code is very cumbersome and is considered a bad practice.
Manual updation of various scripts is not necessary when we tinker with some specified configuration , it gets updated automatically. This eliminates scope of error which might be present in manual updation.
After knowing the advantages of using a machine learning pipeline, now comes the most interesting part of this tutorial where we are trying to use the same concept in one of the problem statements.
So, let’s walk through various steps involved in its creation. I will use the popular Iris dataset for demonstration purposes. Thus, input to our pipeline would be the Iris dataset.
Case Study to learn more about Machine Learning Pipeline
The Iris dataset contains 3 classes each corresponding to a type of Iris plant . These are Iris Setosa, Iris Versicolour and Iris Virginica. There are 50 instances of each class and each class is linearly separable from other classes. The features in the dataset are basically characteristics of the data and can be numerical or categorical. There are 4 independent variables or what we call the features in machine learning and these are sepal length , sepal width , petal length and petal width. All of these features are numerical.
Our input dataset might have a lot of issues which impact the performance of the machine learning model negatively. Hence we need to deal with the problem of missing values, noisy or duplicate data or sometimes we need to format data in a certain range so as to improve the accuracy of our model. Thus we need to preprocess our database prior to the training model.
For Missing Values, we could delete the entire row or fill in the missing values with the mean value of the feature. In the similar manner we could rescale data, binarise , normalise or encode data. One thing to note is that we use these techniques depending on the dataset as well as the problem that we are trying to address.
Another important step which occurs in data preprocessing is splitting the dataset into training and testing dataset.
For Iris dataset Classification , following preprocessing techniques are used.
1. Encoding: The target variable i.e. Species column is categorical and hence needs to be encoded to numerical data. We will use Label Encoder from sklearn preprocessing library. It will encode Iris – setosa to 0 , Iris – versicolor to 1 and Iris-Virginica to 2.
from sklearn.preprocessing import LabelEncoder encoder=LabelEncoder() y = encoder.fit_transform(y)
2. Train-test split: We will now split our data in 80 : 20 ratio with 80 percent of the original dataset as training one.
from sklearn.model_selection import train_test_split xtrain , xtest, ytrain, ytest = train_test_split(x,y,test_size=0.2, shuffle=True) xtest.head()
Exploratory Data Analysis
Before applying any model, understanding the dataset is a critical step. Exploring data and finding its charctericstics is essentially what we call Exploratory Data Analysis. We can visualise the data using various libraries.
For given Iris Dataset , We can visually see the count of each species.
Image Source: Author
We can also visualise distrubtion of each feature.
Image Source: Author
Identifying the relevant features and selecting them appropriately to design the model is one of the most important tasks of building the pipeline.
A major problem faced while designing a machine learning model is the increase in complexity due a large no. of independent variables or features and decreased accuracy due to presence of irrelevant features. Thus reduced training time, better fit of data and better accuracy are the main objectives of feature selection.
Usage of information gain, chi square test or the correlation matrix are some of popular feature selection techniques. Also regularisation techniques such as L1 regularisation can be employed to make feature selection more optimal.
Since the Iris dataset has only 4 features which are relatively lesser, hence the need to select features is not there.
Once we have the cleaned data with the selected features, it’s time to address the elephant in the room. In order to carry out the intended task, we need to train the selected model on our prepared dataset.
Now a machine learning model is simply a program file that runs on a provided dataset and does some task such as classification , prediction etc. and produces outputs. What makes it unique is its ability to learn throughout this process. Our machine learning model can fall into either one of following categories :
– Supervised learning: When input data along with correct output is supplied to the model , the learning is known as supervised learning.
– Unsupervised learning: When unlabelled data is provided and the aim of the algorithm is to find patterns in data and cluster them or to find the association, this sort of learning is known as unsupervised learning.
– Reinforcement learning: It’s basic aim is to learn to take some suitable action in a particular environment so as to maximise the reward.
Now we can use any of the existing techniques such as linear regression, KNN, Naive Bayes algorithm, logistic regression, SVMs , decision trees , PCA , Random forest etc. We can even try various models on our dataset to see which one works better.
For our given dataset we can use the decision tree classifier by importing the same from sklearn trees library. Now in simple terms, the decision tree classifier just splits the dataset at each node based on some parameters. Here we can give max_depth as a hyperparameter which refers to the maximum depth of the tree.
from sklearn.tree import DecisionTreeClassifier baseest = DecisionTreeClassifier(max_depth=2) baseest.fit(xtrain,ytrain)
Before we evaluate our model onto a test dataset , it is a good idea to validate the model on the validation set. This is because when we have trained our model , we can’t say for sure that our model works well on unseen dataset i.e. performs with required accuracy.
Thus the process of validation is to get confidence that our model can give desired results on unseen data or to give us an assurance that the way we have assumed relations between data to produce some output are indeed correct.
Why Validation process is required for some of the datasets?
In case we have large datasets, we can split data into 3 sets i.e. train , validation and test dataset. However sometimes when we have a limited dataset, cross validation is one of the most popular and effective techniques for data validation. To name a few we can say k fold cross validation or stratified k fold which inherently uses sampling techniques. We can use these validation techniques and if output is not satisfactory, we can again tune our hyperparameters and retrain our model or choose a new model accordingly.
Another interesting manner of tuning hyperparameters is to use Grid search CV which helps one to find best parameters in the supplied range of parameters over an estimator based on cross validated grid search.
Thus for Iris dataset we can use stratified 5 fold cross validation.
Why Stratified K-fold Cross Validation?
The reason for using Stratified K-fold Cross Validation is due to limited instances in the dataset. In stratified k fold validation , stratified sampling is done on the dataset. Stratified sampling means picking out dataset values such that this data accurately represents our training data i.e, picking out such rows so that ratio of various classes in the sample remain similar to that of training dataset. Here 5 fold essentially means that the sample is divided into 5 parts and 4 of them are used for training while 1 is left out for testing in cross validation.from sklearn.model_selection import cross_val_score from sklearn.model_selection import StratifiedKFold skf=StratifiedKFold(n_splits=5, shuffle=True) cvscore = cross_val_score(baseest,xtrain,ytrain,cv = skf) print(cvscore) print("Mean is :",cv score.mean()) print("Standard deviation is : ",cvscore.std())
Here we can find the cross validation score of each fold and mean and standard deviation for it.
[1. 1. 0.91666667 0.95833333 1. ] Mean is : 0.975 Standard deviation is : 0.03333333333333334
It is the last step of the pipeline where we actually check our model’s performance on our test dataset and predict the target variable.
The evaluation matrix to be used depends on the problem solved by our model.
We can have four kinds of outcomes for classification problems:
– True Positive: When a positive outcome is correctly classified.
– False Positive: When a negative outcome is classified as positive.
– True Negative: When a negative outcome is correctly classified.
– False Negative: When a positive outcome is classified as negative.
Three main metrics that evaluate the performance of a machine learning model for classification are :
– Accuracy: How many of the total predictions turned out to be correct?
Accuracy = correct predictions / total predictions
– Precision: How many of the predicted positive outcomes in fact turned out to be positive i.e How precise the model is?
Precision = True positives/ total positive predictions
– Recall: How many instances of positive class correctly identified or we may say recalled?
Recall= true positives/( true positives+false negatives)
The metrics can be visualised by ROC curves.
Can you think of some of the evaluation metrics used for regression problem statement?
For regression models we can use errors such as mean squared error, explained variance or R square error.
So for our model we can calculate accuracy as follows:from sklearn.metrics import accuracy_score ypred = baseest.predict(xtest) print(accuracy_score(ytest,ypred)*100,end='') print('%')
Here the accuracy turns out to be pretty low. This is because the decision tree is a weak learner. We can use ensemble learning techniques such as boosting or bagging to create classifiers with very good performance using decision trees.
Now you can go ahead and build your own machine learning pipelines, try other classifiers or tinker these hyperparameters so as to achieve better performance.
Read more articles on Machine Learning Pipelines on our blog.
Thanks for reading!
If you have any doubts or suggestions related to the article or in general, feel free to reach out to me.
The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.
Leave a Reply Your email address will not be published. Required fields are marked *