Over the last 12 months, I have been participating in a number of machine learning hackathons on Analytics Vidhya and Kaggle competitions. After the competition, I always make sure to go through the winner’s solution. The winner’s solution usually provide me critical insights, which have helped me immensely in future competitions.
Most of the winners rely on an ensemble of well-tuned individual models along with feature engineering. If you are starting with machine learning, I would advise you to lay emphasis on these two areas as I have found them equally important to do well in a machine learning.
Most of the time, I was able to crack the feature engineering part but probably didn’t use the ensemble of multiple models. If you are a beginner, it’s even better to get familiar with ensembling as early as possible. Chances are that you are already applying it without knowing!
In this article, I’ll take you through the basics of ensemble modeling. Then I will walk you through the advantages of ensembling. Also, to provide you hands-on experience on ensemble modeling, we will use ensembling on a hackathon problem using R.
P.S. For this article, we will assume that you can build individual models in R / Python. If not, you can start your journey with our learning path.
In general, ensembling is a technique of combining two or more algorithms of similar or dissimilar types called base learners. This is done to make a more robust system which incorporates the predictions from all the base learners. It can be understood as conference room meeting between multiple traders to make a decision on whether the price of a stock will go up or not.
Since all of them have a different understanding of the stock market and thus a different mapping function from the problem statement to the desired outcome. Therefore, they are supposed to make varied predictions on the stock price based on their own understandings of the market.
Now we can take all of these predictions into account while making the final decision. This will make our final decision more robust, accurate and less likely to be biased. The final decision would have been opposite if one of these traders would have made this decision alone.
You can consider another example of a candidate going through multiple rounds of job interviews. The final decision of candidate’s ability is generally taken based on the feedback of all the interviewers. Although a single interviewer might not be able to test the candidate for each required skill and trait. But the combined feedback of multiple interviewers usually helps in better assessment of the candidate.
Some of the basic concepts which you should be aware of before we go into further detail are:
Practically speaking, there can be a countless number of ways in which you can ensemble different models. But these are some techniques that are mostly used:
For bootstrapped sample, we choose one out of these three randomly. Say we chose Row 2.
You see that even though Row 2 is chosen from the data to the bootstrap sample, it’s still present in the data. Now, each of the three:
Rows have the same probability of being selected again. Let’s say we choose Row 1 this time.
Again, each row in the data has the same probability to be chosen for Bootstrapped sample. Let’s say we randomly choose Row 1 again.
Thus, we can have multiple bootstrapped samples from the same data. Once we have these multiple bootstrapped samples, we can grow trees for each of these bootstrapped samples and use the majority vote or averaging concepts to get the final prediction. This is how bagging works.
One important thing to note here is that it’s done mainly to reduce the variance. Now, random forest actually uses this concept but it goes a step ahead to further reduce the variance by randomly choosing a subset of features as well for each bootstrapped sample to make the splits while training.
It relies on creating a series of weak learners each of which might not be good for the entire dataset but is good for some part of the dataset. Thus, each model actually boosts the performance of the ensemble.
It’s really important to note that boosting is focused on reducing the bias. This makes the boosting algorithms prone to overfitting. Thus, parameter tuning becomes a crucial part of boosting algorithms to make them avoid overfitting.
Some examples of boosting are XGBoost, GBM, ADABOOST, etc.
Let’s understand it with an example:
Here, we have two layers of machine learning models:
Here, we have used only two layers but it can be any number of layers and any number of models in each layer. Two of the key principles for selecting the models:
One thing that you might have realized is that we have used the top layer model which takes as input the predictions of the bottom layer models. This top layer model can also be replaced by many other simpler formulas like:
I believe you would have a good grasp on ensembling concepts by now. Well, enough of theory now, let’s get down to implementing ensembling and see whether it can help us improve our accuracy for a real machine learning challenge. If you wish to read more about the basics of ensembling, then you can refer to this resource.
For the purpose of implementing ensembling, I have chosen Loan Prediction problem. We have to predict whether the bank should approve the loan based on the applicant profile or not. It’s a binary classification problem. You can read more about the problem here.
I’ll be using caret package in R for training various individual models. It’s the goto package for modeling in R. Don’t worry if you are not familiar with the caret package, you can get through this article to get the comprehensive knowledge of caret package. Let’s get done with getting the data and data cleaning part.
#Loading the required libraries library('caret') #Seeting the random seed set.seed(1) #Loading the hackathon dataset data<-read.csv(url('https://datahack-prod.s3.ap-south-1.amazonaws.com/train_file/train_u6lujuX_CVtuZ9i.csv')) #Let's see if the structure of dataset data str(data) 'data.frame': 614 obs. of 13 variables: $ Loan_ID : Factor w/ 614 levels "LP001002","LP001003",..: 1 2 3 4 5 6 7 8 9 10 ... $ Gender : Factor w/ 3 levels "","Female","Male": 3 3 3 3 3 3 3 3 3 3 ... $ Married : Factor w/ 3 levels "","No","Yes": 2 3 3 3 2 3 3 3 3 3 ... $ Dependents : Factor w/ 5 levels "","0","1","2",..: 2 3 2 2 2 4 2 5 4 3 ... $ Education : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 2 1 1 2 1 1 1 ... $ Self_Employed : Factor w/ 3 levels "","No","Yes": 2 2 3 2 2 3 2 2 2 2 ... $ ApplicantIncome : int 5849 4583 3000 2583 6000 5417 2333 3036 4006 12841 ... $ CoapplicantIncome: num 0 1508 0 2358 0 ... $ LoanAmount : int NA 128 66 120 141 267 95 158 168 349 ... $ Loan_Amount_Term : int 360 360 360 360 360 360 360 360 360 360 ... $ Credit_History : int 1 1 1 1 1 1 1 0 1 1 ... $ Property_Area : Factor w/ 3 levels "Rural","Semiurban",..: 3 1 3 3 3 3 3 2 3 2 ... $ Loan_Status : Factor w/ 2 levels "N","Y": 2 1 2 2 2 2 2 1 2 1 ... #Does the data contain missing values sum(is.na(data)) [1] 86 #Imputing missing values using median preProcValues <- preProcess(data, method = c("medianImpute","center","scale")) library('RANN') data_processed <- predict(preProcValues, data) sum(is.na(data_processed)) [1] 0
#Spliting training set into two parts based on outcome: 75% and 25% index <- createDataPartition(data_processed$Loan_Status, p=0.75, list=FALSE) trainSet <- data_processed[ index,] testSet <- data_processed[-index,]
I have divided the data into two parts which I’ll be using to simulate the training and testing operations. We now define the training controls and the predictor and outcome variables:
#Defining the training controls for multiple models fitControl <- trainControl( method = "cv", number = 5, savePredictions = 'final', classProbs = T) #Defining the predictors and outcome predictors<-c("Credit_History", "LoanAmount", "Loan_Amount_Term", "ApplicantIncome", "CoapplicantIncome") outcomeName<-'Loan_Status'
Now let’s get started with training a random forest and test its accuracy on the test set that we have created:
#Training the random forest model model_rf<-train(trainSet[,predictors],trainSet[,outcomeName],method='rf',trControl=fitControl,tuneLength=3) #Predicting using random forest model testSet$pred_rf<-predict(object = model_rf,testSet[,predictors]) #Checking the accuracy of the random forest model confusionMatrix(testSet$Loan_Status,testSet$pred_rf) Confusion Matrix and Statistics Reference Prediction N Y N 28 20 Y 9 96 Accuracy : 0.8105 95% CI : (0.7393, 0.8692) No Information Rate : 0.7582 P-Value [Acc > NIR] : 0.07566 Kappa : 0.5306 Mcnemar's Test P-Value : 0.06332 Sensitivity : 0.7568 Specificity : 0.8276 Pos Pred Value : 0.5833 Neg Pred Value : 0.9143 Prevalence : 0.2418 Detection Rate : 0.1830 Detection Prevalence : 0.3137 Balanced Accuracy : 0.7922 'Positive' Class : N
Well, as you can see, we got 0.81 accuracy with the individual random forest model. Let’s see the performance of KNN:
#Training the knn model model_knn<-train(trainSet[,predictors],trainSet[,outcomeName],method='knn',trControl=fitControl,tuneLength=3) #Predicting using knn model testSet$pred_knn<-predict(object = model_knn,testSet[,predictors]) #Checking the accuracy of the random forest model confusionMatrix(testSet$Loan_Status,testSet$pred_knn) Confusion Matrix and Statistics Reference Prediction N Y N 29 19 Y 2 103 Accuracy : 0.8627 95% CI : (0.7979, 0.913) No Information Rate : 0.7974 P-Value [Acc > NIR] : 0.0241694 Kappa : 0.6473 Mcnemar's Test P-Value : 0.0004803 Sensitivity : 0.9355 Specificity : 0.8443 Pos Pred Value : 0.6042 Neg Pred Value : 0.9810 Prevalence : 0.2026 Detection Rate : 0.1895 Detection Prevalence : 0.3137 Balanced Accuracy : 0.8899 'Positive' Class : N
It’s great since we are able to get 0.86 accuracy with the individual KNN model. Let’s see the performance of Logistic regression as well before we go on to create ensemble of these three.
#Training the Logistic regression model model_lr<-train(trainSet[,predictors],trainSet[,outcomeName],method='glm',trControl=fitControl,tuneLength=3) #Predicting using knn model testSet$pred_lr<-predict(object = model_lr,testSet[,predictors]) #Checking the accuracy of the random forest model confusionMatrix(testSet$Loan_Status,testSet$pred_lr) Confusion Matrix and Statistics Reference Prediction N Y N 29 19 Y 2 103 Accuracy : 0.8627 95% CI : (0.7979, 0.913) No Information Rate : 0.7974 P-Value [Acc > NIR] : 0.0241694 Kappa : 0.6473 Mcnemar's Test P-Value : 0.0004803 Sensitivity : 0.9355 Specificity : 0.8443 Pos Pred Value : 0.6042 Neg Pred Value : 0.9810 Prevalence : 0.2026 Detection Rate : 0.1895 Detection Prevalence : 0.3137 Balanced Accuracy : 0.8899 'Positive' Class : N
And the logistic regression also gives us the accuracy of 0.86.
Now, let’s try out different ways of forming an ensemble with these models as we have discussed:
#Predicting the probabilities testSet$pred_rf_prob<-predict(object = model_rf,testSet[,predictors],type='prob') testSet$pred_knn_prob<-predict(object = model_knn,testSet[,predictors],type='prob') testSet$pred_lr_prob<-predict(object = model_lr,testSet[,predictors],type='prob') #Taking average of predictions testSet$pred_avg<-(testSet$pred_rf_prob$Y+testSet$pred_knn_prob$Y+testSet$pred_lr_prob$Y)/3 #Splitting into binary classes at 0.5 testSet$pred_avg<-as.factor(ifelse(testSet$pred_avg>0.5,'Y','N'))
#The majority vote testSet$pred_majority<-as.factor(ifelse(testSet$pred_rf=='Y' & testSet$pred_knn=='Y','Y',ifelse(testSet$pred_rf=='Y' & testSet$pred_lr=='Y','Y',ifelse(testSet$pred_knn=='Y' & testSet$pred_lr=='Y','Y','N'))))
#Taking weighted average of predictions testSet$pred_weighted_avg<-(testSet$pred_rf_prob$Y*0.25)+(testSet$pred_knn_prob$Y*0.25)+(testSet$pred_lr_prob$Y*0.5) #Splitting into binary classes at 0.5 testSet$pred_weighted_avg<-as.factor(ifelse(testSet$pred_weighted_avg>0.5,'Y','N'))
Before proceeding further, I would like you to recall about two important criteria that we previously discussed on individual model accuracy and inter-model prediction correlation which must be fulfilled. In the above ensembles, I have skipped checking for the correlation between the predictions of the three models. I have randomly chosen these three models for a demonstration of the concepts. If the predictions are highly correlated, then using these three might not give better results than individual models. But you got the point. Right?
So far, we have used simple formulas at the top layer. Instead, we can use another machine learning model which is essentially what stacking is. We can use linear regression for making a linear formula for making the predictions in regression problem for mapping bottom layer model predictions to the outcome or logistic regression similarly in case of classification problem.
Moreover, we don’t need to restrict ourselves here, we can also use more complex models like GBM, neural nets to develop a non-linear mapping from the predictions of bottom layer models to the outcome.
On the same example let’s try applying logistic regression and GBM as top layer models. Remember, the following steps that we’ll take:
One extremely important thing to note in step 2 is that you should always make out of bag predictions for the training data, otherwise the importance of the base layer models will only be a function of how well a base layer model can recall the training data.
Even, most of the steps have been already done previously, but I’ll walk you through the steps one by one again.
#Defining the training control fitControl <- trainControl( method = "cv", number = 10, savePredictions = 'final', # To save out of fold predictions for best parameter combinantions classProbs = T # To save the class probabilities of the out of fold predictions ) #Defining the predictors and outcome predictors<-c("Credit_History", "LoanAmount", "Loan_Amount_Term", "ApplicantIncome", "CoapplicantIncome") outcomeName<-'Loan_Status' #Training the random forest model model_rf<-train(trainSet[,predictors],trainSet[,outcomeName],method='rf',trControl=fitControl,tuneLength=3 #Training the knn model model_knn<-train(trainSet[,predictors],trainSet[,outcomeName],method='knn',trControl=fitControl,tuneLength=3) #Training the logistic regression model model_lr<-train(trainSet[,predictors],trainSet[,outcomeName],method='glm',trControl=fitControl,tuneLength=3)
#Predicting the out of fold prediction probabilities for training data trainSet$OOF_pred_rf<-model_rf$pred$Y[order(model_rf$pred$rowIndex)] trainSet$OOF_pred_knn<-model_knn$pred$Y[order(model_knn$pred$rowIndex)] trainSet$OOF_pred_lr<-model_lr$pred$Y[order(model_lr$pred$rowIndex)] #Predicting probabilities for the test data testSet$OOF_pred_rf<-predict(model_rf,testSet[predictors],type='prob')$Y testSet$OOF_pred_knn<-predict(model_knn,testSet[predictors],type='prob')$Y testSet$OOF_pred_lr<-predict(model_lr,testSet[predictors],type='prob')$Y
First, let’s start with the GBM model as the top layer model.
#Predictors for top layer models predictors_top<-c('OOF_pred_rf','OOF_pred_knn','OOF_pred_lr') #GBM as top layer model model_gbm<- train(trainSet[,predictors_top],trainSet[,outcomeName],method='gbm',trControl=fitControl,tuneLength=3)
Similarly, we can create an ensemble with logistic regression as the top layer model as well.
#Logistic regression as top layer model model_glm<- train(trainSet[,predictors_top],trainSet[,outcomeName],method='glm',trControl=fitControl,tuneLength=3)
#predict using GBM top layer model testSet$gbm_stacked<-predict(model_gbm,testSet[,predictors_top]) #predict using logictic regression top layer model testSet$glm_stacked<-predict(model_glm,testSet[,predictors_top])
Great! You made your first ensemble.
Note it’s really important to choose the models for the ensemble wisely to get the best out of the ensemble. The two thumb rules that we discussed will greatly help you in that.
By now, you might have developed an in-depth conceptual as well as practical knowledge of ensembling. I would like to encourage you to practice this on machine learning hackathons on Analytics Vidhya, which you can find here.
You’ll probably find this article on top five questions related to ensembling helpful.
Also, if you missed out on the skilltest on ensembling, you can check your understanding of ensembling concepts here.
Ensembling is a very popular and effective technique that is very frequently used by data scientists for beating the accuracy benchmark of even the best of individual algorithms. More often than not it’s the winning recipe in hackathons. The more you’ll use ensembling, the more you’ll admire its beauty.
Did you enjoy reading this article? Do share your views in the comment section below. If you have any doubts / questions feel free to drop them in the comments below.
Awesome tutorial. very nicely explained!
Thanks!
Nicely written!
Thanks, Glad you liked it!
Thank you for the great article! Although you could have shared the result of your ensembles...
Hi Albert, I'm glad you found it helpful. Yes, I didn't shared the results of the ensemble as I wanted to encourage readers to try it out and find it themselves. Whether it gave better performance than any individual model. If it did, great. But if it didn't, then you'll need to think about why it didn't and what could be done to overcome it by thinking on the lines of the important criterias for ensembling that I have mentioned. I think this curiosity will make you try it as well! Best, Saurav.
Simply written and easy to understand,.
Hi Kumar, Glad you liked it!
Kudos, Very well written.
Hi Ravi, Thanks. I'm glad you found this helpful.
thank you. Very good. However I would like to check the accuracy of the ensembled model and you did not show it.
Hi Gonzalo, I’m glad you found it helpful. Yes, I didn’t shared the results of the ensemble as I wanted to encourage readers to try it out and find it themselves. Whether it gave better performance than any individual model. If it did, great. But if it didn’t, then you’ll need to think about why it didn’t and what could be done to overcome it by thinking on the lines of the important criterias for ensembling that I have mentioned. I think this curiosity will make you try it as well! Best, Saurav.
I really wanted to compliment in being so transparent in sharing this knowledge and learning platform. You are strengthening the often overlooked notion when in reality sharing knowledge makes us better. I am an avid reader of your posts as it helps me understand the changing world of advanced analytics and have informed discussions with data scientists. Thanks again Yogi
Hi Yogesh, We at Analytics Vidhya are really happy to help you. Best, Saurav
Simple & Fantastic!
Great article!!It would be great if you could also explain about Monte-Carlo simulation for finding best ensemble weights.
Thanks Saurav. This article came at the right time. I was looking for stacking and blending example and read many article on the subject but didn't find complete hands on example on R. So struggling thru but this article initialized me with the basic.That's why I love this site!!!!.
Thanks Saurav, I have been searching for something like this to learn ensemble modeling and glad i found it. Kudos to your work!
Very well explained .
Hi Saurav, You are nicely presented about ensemble modelling. I really learned lot. I have clarification here.If i look at the disadvantages of ensemble modelling,It giving the sense like no use of ensemble modelling in the real time application an other than data hackthon. Correct me if i am wrong. Regards Arun
Hi Arun, See these are the practical challenges that you're bound to face when you use ensembling. It is definitely more time consuming than using a single model. They way I see it is as a trade-off between accuracy and the training time. You can optimise the parameter search, use higher computation power, etc to reduce the time if ensembling is giving you really good results than any single model. This always boils down to what is the threshold time you want to generate the response within for your real time application. Hope you found it helpful. Best, Saurav.
what have we done without you!! I liked this article so much ..could easily understand.. Thanks
Hi Saurav, I completely agree with you. There are more possibility to improve the accuracy using Ensemble modelling. We are building model only to draw crucial business insights. But as you mentioned in the disadvantages of ensemble modelling "It very difficult to draw any crucial business insights at the end." then how ensemble modelling will benefits us,even have higher accuracy. Thank you.
This is Just awesome explanation.Liked Analytics Vidhya a lot.
Thanks Saurav for keeping it simple. Well, I find that KNN, in terms of accuracy, is the best among others including the stacked models. And glm_stacked is exactly same as the logistic regression.
Hi , Thanks for the article :) When i am trying to use : testSet$pred_rf_prob<-predict(object = model_rf,testSet[,predictors],type='prob') I am getting zero probabilities. Why it is giving zero probabilities and what it actually means?
For the part : model_rf<-train(trainset[, predictors],trainset[,outcomeName],method='rf',trControl=fitControl,tuneLength=3) I am getting error msg as : Error in trainset[, predictors] : incorrect number of dimensions Why?
I am starting to get a Machine Learning overdose here..! As a newbie to Data Science I initially thought that a simple regression model, a decision tree or a knn setting would do the job. And I just found out the reason that I cannot reach the highest positions in the leaderboard by using only them! Now I have to study some more I guess....Anyway, thank you very much for your wonderful presentation and explanation of the topic, you have pointed me the way ahead!
Very Nicely explained
Great article! Want to clarify something though regarding the stacking procedure: There are others that does this: they train the base model i.e RF, KNN and logistic with the testset (the 25% of the trainset) and later, fit the predicted values as features for the next layer model i.e GBM. With this model, we fit the actual test data (that is not read in this tutorial) to get the predicted submission. Is this correct?
Fantastic tutorial. Explained with such ace. Thank you very much. Things are starting to make sense now. Thank you again. I really liked the way you distinguished the Base models and Top model in the diagram and accordingly did the code. Cheers!
Hi Saurav, Thank you so much for sharing you knowledge here. I have learned a bunch from your post. I just wanted to point out that your method for extracting predictions for one particular level won't work out, if we have more than 2 levels. I think the better approach for such cases or overall (even for the 2 level cases), would be to just do the predict(), without declaring the type 'prob'. Below is the sections, I'm referring to (using $Y): #Predicting the out of fold prediction probabilities for training data trainSet$OOF_pred_rf<-model_rf$pred$Y[order(model_rf$pred$rowIndex)] #Predicting probabilities for the test data testSet$OOF_pred_rf<-predict(model_rf,testSet[predictors],type='prob')$Y Please, correct me if I may be missing something here and you may have better overall approach for such cases (mostly with more than 2 levels). Thank you :)
Is there the same article in python? If not, can you add it please! Thank You.
Saurav..i would like to thank you for your post..its very simple to understand and equally simple to implement. However, i faced some minor issues: When we predict the probabilities (for taking the average), we actually get 2 columns.. for e.g. CVSet$pred_rf_prob = predict(object = model_rf,CVSet[,input],type='prob') will actually create 2 columns in the data frame. CVSet$pred_rf_prob.N and CVSet$pred_rf_prob.Y which will contain the probability of getting a N or a Y We can then take the average of the 3 models..but now come 2 problems 1. the command CVSet$pred_avg = ifelse(CVSet$pred_avg>0.5,'Y','N') actually does the opposite...it gives N when prob.Y>0.5 and Y when prob.Y<0.5, which leads me to understand that the conputation is happening on the N column and not on the Y column 2. when converting to Y or N, the value is stored in the prob.N column, leaving prob.Y column as empty.....problem now comes when converting to factor CVSet$pred_avg = factor(CVSet$pred_avg) gives an error "replacement has 306 rows, data has 153" Also, can somebody please check: whether as.factor() or factor() should be used..
How do you check for the correlation between the predictions of the three models ?? any useful links?
In step 2 why did we use order?? Please help #Predicting the out of fold prediction probabilities for training data trainSet$OOF_pred_rf<-model_rf$pred$Y[order(model_rf$pred$rowIndex)] trainSet$OOF_pred_knn<-model_knn$pred$Y[order(model_knn$pred$rowIndex)] trainSet$OOF_pred_lr<-model_lr$pred$Y[order(model_lr$pred$rowIndex)]
Hi Saurav, Thanks for such valuable post, any idea how to do it using java? Regards
Hi, In step 3 #Predictors for top layer models predictors_top<-c('OOF_pred_rf','OOF_pred_knn','OOF_pred_lr') #GBM as top layer model model_gbm<- train(trainSet[,predictors_top],trainSet[,outcomeName],method='gbm',trControl=fitControl,tuneLength=3) when i try to use the code on different data set , i get a error Error in `[.data.frame`(trainSet, , predictors_top) : undefined columns selected Have you defined ('OOF_pred_rf','OOF_pred_knn','OOF_pred_lr') ? how can we use them in model_gbm. They dont consist of any data. Please clarify. Thanks
Hi Saurav, Thanks for the wonderful article, little correction (if I understood ensemble correctly) in the last step--> Step 4: Finally, predict using the top layer model with the predictions of bottom layer models that has been made for testing data I think the last line should be using "gbm_stacked" data for glm prediction As follows: #predict using logictic regression top layer model testSet$glm_stacked<-predict(model_glm,testSet[,gbm_stacked]) let me know if I am being wrong here. Thanks Shubham
Nice explaination!
Excellent!! Very well written!!
Excellent!! Very well explained. Thank tou!!! I have the following error. Please review. Please Help me. ... > confusionMatrix(testSet$Loan_Status,testSet$pred_rf) Error: `data` and `reference` should be factors with the same levels.