This article was published as a part of the Data Science Blogathon.

In this article, we will first discuss some of the common methods of Ensemble and their disadvantages. Then we will discuss how those disadvantages can be taken care of by another way of Ensemble known as Stacking and Blending and how to build it in python. Finally, we will wrap everything and create an easy-to-use function.

Ensemble methods are a machine learning technique that combines different models to make an optimum model. Different machine learning models can extract patterns in different ways, by using all of them a better model can be made. One of the common methods is –Â

- Building different models and taking mean orÂ a majority vote of them

Though this method is very much simple to use and often gives a better result the problem is the weaker models get equal priority as the stronger models which might result in a decrease in a score sometimes.

- Building different models and taking a weighted average of them

In this method, we can assign higher weightage to the stronger models but it is difficult to assume the weights which is also not that much desirable.

**Stacking and Blending:Â **To get rid of the above-mentioned problems stacking and blending can be used for Ensemble Modelling. The steps are –Â

- Divide the training data into equal N parts.
- By Keeping one part of the data aside build the training datasets of N Data Frames.
- Take each of the training data and make a prediction on its left-out part.
- Take the entire training data to predict by the same methods on the testing data.
- Now we have predictions on each part of the training data and the testing data.
- Repeat the same process with other methods.
- Use these new predictions as features to build a new set of training and testing data.
- Make the final prediction on the new testing data with the help of new training data.

import pandas as pd import numpy as np x_train = pd.read_csv("C:\Users\chakr\Desktop\Clean_data\X_train_reg.csv") y_train = pd.read_csv("C:\Users\chakr\Desktop\Clean_data\y_train_reg.csv") x_train.head()

from sklearn.model_selection import train_test_split x_train1, x_train2, y_train1, y_train2 = train_test_split( x_train, y_train, test_size=0.25, random_state=42)

__Step 1: Divide the Datasets into N parts ( here we use 20 Parts)
__

def get_dataset(x_train,y_train,N=5) : merge = pd.concat([x_train,y_train],axis=1) merge = merge.sample(frac=1, random_state=1).reset_index(drop=True) y_train = merge.iloc[:,(merge.shape[1]-1):(merge.shape[1])] x_train = merge.iloc[:,0:(merge.shape[1]-1)] z = int(len(x_train)/N) start = [0] stop = [] for i in range(1,N): start.append(z*i) stop.append(z*i) stop.append(len(x_train)) c = list() train_data = list() test_data = list() y_data = list() for i in range(0,N): c=list(range(start[i],stop[i])) train_data.append(x_train.iloc[[k for k in range(0,len(x_train)) if k not in c],:]) y_data.append(y_train.iloc[[k for k in range(0,len(y_train)) if k not in c],:]) test_data.append(x_train.iloc[c,:]) return(train_data,y_data,test_data,y_train) datasets = get_dataset(x_train1,y_train1,20) train_data = datasets[0] y_data = datasets[1] test_data = datasets[2] final_y = datasets[3]

Now we have the following datasets.

- train_data: 20 sets of training data leaving each part out one at a time.
- y_data: Target column of each of the sets of training data.
- test_data: The remaining part of each of the training datasets
- final_y: Target column of the entire training data.

Here we are using **LinearRegression, DecisionTreeRegressor, KNeighborsRegressor, CatBoostRegressor** of sk-learn.Â We can specify the hyperparameters too inside the model if we want.

from sklearn.metrics import mean_squared_error from sklearn.linear_model import LinearRegression from sklearn.neighbors import KNeighborsRegressor from sklearn.tree import DecisionTreeRegressor from catboost import CatBoostRegressor, Pool models = [LinearRegression(), DecisionTreeRegressor(), KNeighborsRegressor(), CatBoostRegressor(logging_level ='Silent')] code = ['lin_reg','dtree_reg','Knn_reg','cat_reg']

__Step 3: Prediction function for all the models together__

def stack(x_train, y_train , x_test , models,code): def flatten_list(_2d_list): flat_list = [] for element in _2d_list: if type(element) is list: for item in element: flat_list.append(item) else: flat_list.append(element) return flat_list result = list() for i in list(range(len(models))): reg = models[i] reg.fit(x_train,y_train) test_pred = flatten_list(reg.predict(x_test).tolist()) result.append(test_pred) result_df = pd.DataFrame() for i in list(range(len(code))): result_df[code[i]] = result[i] return result_df

__Step 4: Predict for each the chunks to get the final Data Frame__

final_df = pd.DataFrame(columns = code) for i in range(0,len(train_data)): current_df = stack(train_data[i],y_data[i],test_data[i],models,code) final_df = pd.concat([final_df,current_df]) final_test = stack(x_train1,y_train1,x_train2,models,code) final_df.head()

__Step 5: Build the second Layer Model
__

reg2 = CatBoostRegressor(logging_level ='Silent') reg2.fit(final_df,final_y) test_pred = reg2.predict(final_test) mean_squared_error(test_pred,y_train2)**0.5

In the above section, we saw how the stacking and blending are working to help us build an ensemble model. In this section, we will wrap everything up to build a useful function that can return prediction directly.

def stackblend_reg(x_train,y_train,x_test,models,code,N=20,final_layer=LinearRegression()): def get_dataset(x_train,y_train,N=5) : merge = pd.concat([x_train,y_train],axis=1) merge = merge.sample(frac=1, random_state=1).reset_index(drop=True) y_train = merge.iloc[:,(merge.shape[1]-1):(merge.shape[1])] x_train = merge.iloc[:,0:(merge.shape[1]-1)] z = int(len(x_train)/N) start = [0] stop = [] for i in range(1,N): start.append(z*i) stop.append(z*i) stop.append(len(x_train)) c = list() train_data = list() test_data = list() y_data = list() for i in range(0,N): c=list(range(start[i],stop[i])) train_data.append(x_train.iloc[[k for k in range(0,len(x_train)) if k not in c],:]) y_data.append(y_train.iloc[[k for k in range(0,len(y_train)) if k not in c],:]) test_data.append(x_train.iloc[c,:]) return(train_data,y_data,test_data,y_train) datasets = get_dataset(x_train,y_train,N) train_data = datasets[0] y_data = datasets[1] test_data = datasets[2] final_y = datasets[3] def stack(x_train, y_train , x_test , models=models,code=code): def flatten_list(_2d_list): flat_list = [] for element in _2d_list: if type(element) is list: for item in element: flat_list.append(item) else: flat_list.append(element) return flat_list result = list() for i in list(range(len(models))): reg = models[i] reg.fit(x_train,y_train) test_pred = flatten_list(reg.predict(x_test).tolist()) result.append(test_pred) result_df = pd.DataFrame() for i in list(range(len(code))): result_df[code[i]] = result[i] return result_df final_df = pd.DataFrame(columns = code) for i in range(0,len(train_data)): current_df = stack(train_data[i],y_data[i],test_data[i],models,code) final_df = pd.concat([final_df,current_df]) final_test = stack(x_train,y_train,x_test,models,code) reg2 = final_layer reg2.fit(final_df,final_y) test_pred = reg2.predict(final_test) return test_pred

stack_pred = stackblend_reg(x_train1,y_train1,x_train2, models = [LinearRegression(), DecisionTreeRegressor(), KNeighborsRegressor(), CatBoostRegressor(logging_level ='Silent')], code = ['lin_reg','dtree_reg','Knn_reg','cat_reg'],N=20, final_layer=CatBoostRegressor(logging_level ='Silent')) mean_squared_error(stack_pred,y_train2)**0.5

Now we have the function to get prediction directly, we can with different types of the final layer model to see what works the best. A similar function can be made for the classification problems too. These functions should be highly time-saving and easy to use during solving Supervised Learning problems.

**The media shown in this article is not owned by Analytics Vidhya and is used at the Authorâ€™s discretion.**

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Become a full stack data scientist
##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

Understanding Cost Function
Understanding Gradient Descent
Math Behind Gradient Descent
Assumptions of Linear Regression
Implement Linear Regression from Scratch
Train Linear Regression in Python
Implementing Linear Regression in R
Diagnosing Residual Plots in Linear Regression Models
Generalized Linear Models
Introduction to Logistic Regression
Odds Ratio
Implementing Logistic Regression from Scratch
Introduction to Scikit-learn in Python
Train Logistic Regression in python
Multiclass using Logistic Regression
How to use Multinomial and Ordinal Logistic Regression in R ?
Challenges with Linear Regression
Introduction to Regularisation
Implementing Regularisation
Ridge Regression
Lasso Regression

Introduction to Stacking
Implementing Stacking
Variants of Stacking
Implementing Variants of Stacking
Introduction to Blending
Bootstrap Sampling
Introduction to Random Sampling
Hyper-parameters of Random Forest
Implementing Random Forest
Out-of-Bag (OOB) Score in the Random Forest
IPL Team Win Prediction Project Using Machine Learning
Introduction to Boosting
Gradient Boosting Algorithm
Math behind GBM
Implementing GBM in python
Regularized Greedy Forests
Extreme Gradient Boosting
Implementing XGBM in python
Tuning Hyperparameters of XGBoost in Python
Implement XGBM in R/H2O
Adaptive Boosting
Implementing Adaptive Boosing
LightGBM
Implementing LightGBM in Python
Catboost
Implementing Catboost in Python

Introduction to Clustering
Applications of Clustering
Evaluation Metrics for Clustering
Understanding K-Means
Implementation of K-Means in Python
Implementation of K-Means in R
Choosing Right Value for K
Profiling Market Segments using K-Means Clustering
Hierarchical Clustering
Implementation of Hierarchial Clustering
DBSCAN
Defining Similarity between clusters
Build Better and Accurate Clusters with Gaussian Mixture Models

Introduction to Machine Learning Interpretability
Framework and Interpretable Models
model Agnostic Methods for Interpretability
Implementing Interpretable Model
Understanding SHAP
Out-of-Core ML
Introduction to Interpretable Machine Learning Models
Model Agnostic Methods for Interpretability
Game Theory & Shapley Values

Deploying Machine Learning Model using Streamlit
Deploying ML Models in Docker
Deploy Using Streamlit
Deploy on Heroku
Deploy Using Netlify
Introduction to Amazon Sagemaker
Setting up Amazon SageMaker
Using SageMaker Endpoint to Generate Inference
Deploy on Microsoft Azure Cloud
Introduction to Flask for Model
Deploying ML model using Flask