AdaBoost algorithm, introduced by Freund and Schapire in 1997, revolutionized ensemble modeling. Since its inception, AdaBoost has become a widely adopted technique for addressing binary classification challenges. This powerful algorithm enhances prediction accuracy by transforming a multitude of weak learners into robust, strong learners

The principle behind ada boosting algorithms is that we first build a model on the training dataset and then build a second model to rectify the errors present in the first model. This procedure is continued until and unless the errors are minimized and the dataset is predicted correctly. Ada Boosting algorithms work in a similar way, it combines multiple models (weak learners) to reach the final output (strong learners).

- To understand what the AdaBoost algorithm is and how it works.
- To understand what stumps are.
- To find out how boosting algorithms help increase the accuracy of ML models.

*This article was published as a part of the Data Science Blogathon*

There are many machine learning algorithms to choose from for your problem statements. One of these algorithms for predictive modeling is called AdaBoost.

AdaBoost algorithm, short for Adaptive Boosting, is **a Boosting technique used as an Ensemble Method in Machine Learning**. It is called Adaptive Boosting as the weights are re-assigned to each instance, with higher weights assigned to incorrectly classified instances.

What this algorithm does is that it builds a model and gives equal weights to all the data points. It then assigns higher weights to points that are wrongly classified. Now all the points with higher weights are given more importance in the next model. It will keep training models until and unless a lower error is received.

Let’s take an example to understand this, suppose you built a decision tree algorithm on the Titanic dataset, and from there, you get an accuracy of 80%. After this, you apply a different algorithm and check the accuracy, and it comes out to be 75% for KNN and 70% for Linear Regression.

When building different models on the same dataset, we observe variations in accuracy. However, leveraging the power of AdaBoost classifier, we can combine these algorithms to enhance the final predictions. By averaging the results from diverse models, Adaboost allows us to achieve higher accuracy and bolster predictive capabilities effectively.

If you want to understand this visually, I strongly recommend you go through this __article__.

Here we will be more focused on mathematics intuition.

There is another ensemble learning algorithm called the gradient ada boosting algorithm. In this algorithm, we try to reduce the error instead of wights, as in AdaBoost. But in this article, we will only be focussing on the mathematical intuition of Adaptive Boosting.

Let’s understand what and how this algorithm works under the hood with the following tutorial.

The Image shown below is the actual representation of our dataset. Since the target column is binary, it is a classification problem. First of all, these data points will be assigned some weights. Initially, all the weights will be equal.

The formula to calculate the sample weights is:

Where N is the total number of data points

Here since we have 5 data points, the sample weights assigned will be 1/5.

We start by seeing how well “*Gender*” classifies the samples and will see how the variables (Age, Income) classify the samples.

We’ll create a decision stump for each of the features and then calculate the ** Gini Index **of each tree. The tree with the lowest Gini Index will be our first stump.

Here in our dataset, let’s say ** Gender** has the lowest gini index, so it will be our first stump.

We’ll now calculate the **“Amount of Say” **or** “Importance” **or **“Influence” **for this classifier in classifying the data points using this formula:

The total error is nothing but the summation of all the sample weights of misclassified data points.

Here in our dataset, let’s assume there is 1 wrong output, so our total error will be 1/5, and the alpha (performance of the stump) will be:

**Note**: Total error will always be between 0 and 1.

0 Indicates perfect stump, and 1 indicates horrible stump.

From the graph above, we can see that when there is no misclassification, then we have no error (Total Error = 0), so the “amount of say (alpha)” will be a large number.

When the classifier predicts half right and half wrong, then the Total Error = 0.5, and the importance (amount of say) of the classifier will be 0.

If all the samples have been incorrectly classified, then the error will be very high (approx. to 1), and hence our alpha value will be a negative integer.

You might be wondering about the significance of calculating the Total Error (TE) and performance of an Adaboost stump. The reason is straightforward – updating the weights is crucial. If identical weights are maintained for the subsequent model, the output will mirror what was obtained in the initial model.

The wrong predictions will be given more weight, whereas the correct predictions weights will be decreased. Now when we build our next model after updating the weights, more preference will be given to the points with higher weights.

After finding the importance of the classifier and total error, we need to finally update the weights, and for this, we use the following formula:

The amount of, say (alpha) will be ** negative **when the sample is

The amount of, say (alpha) will be ** positive** when the sample is

There are four correctly classified samples and 1 wrong. Here, the ** sample weight** of that datapoint is

New weights for *correctly classified* samples are:

For *wrongly classified* samples, the updated weights will be:

See the sign of alpha when I am putting the values, the **alpha is negative** when the data point is correctly classified, and this *decreases the sample weight* from 0.2 to 0.1004. It is **positive** when there is **misclassification**, and this will *increase the sample weight* from 0.2 to 0.3988

We know that the total sum of the sample weights must be equal to 1, but here if we sum up all the new sample weights, we will get 0.8004. To bring this sum equal to 1, we will normalize these weights by dividing all the weights by the total sum of updated weights, which is 0.8004. So, after normalizing the sample weights, we get this dataset, and now the sum is equal to 1.

Now, we need to make a new dataset to see if the errors decreased or not. For this, we will remove the “sample weights” and “new sample weights” columns and then, based on the “new sample weights,” divide our data points into buckets.

We are almost done. Now, what the algorithm does is selects random numbers from 0-1. Since incorrectly classified records have higher sample weights, the probability of selecting those records is very high.

Suppose the 5 random numbers our algorithm take is 0.38,0.26,0.98,0.40,0.55.

Now we will see where these random numbers fall in the bucket, and according to it, we’ll make our new dataset shown below.

This comes out to be our new dataset, and we see the data point, which was wrongly classified, has been selected 3 times because it has a higher weight.

Now this act as our new dataset, and we need to repeat all the above steps i.e.

- Assign equal weights to all the data points.
- Find the stump that does the best job classifying the new collection of samples by finding their Gini Index and selecting the one with the lowest Gini index.
- Calculate the “Amount of Say” and “Total error” to update the previous sample weights.
- Normalize the new sample weights.

Iterate through these steps until and unless a low training error is achieved.

Suppose, with respect to our dataset, we have constructed 3 decision trees (DT1, DT2, DT3) in a ** sequential manner.** If we send our

for our test dataset.

You have finally mastered this algorithm if you understand each and every line of this article.

We started by introducing you to what Boosting is and what are its various types to make sure that you understand the Adaboost classifier and where AdaBoost algorithm falls exactly. We then applied straightforward math and saw how every part of the formula worked.

In the next article, I will explain Gradient Descent and Xtreme Gradient Descent algorithm, which are a few more important Boosting techniques to enhance the prediction power.

If you want to know about the python implementation for beginners of the AdaBoost classifier machine learning model from scratch, then visit this complete guide from analytics vidhya. This article mentions the difference between bagging and ada boosting, as well as the advantages and disadvantages of the AdaBoost algorithm.

- In this article, we understood how ada boosting works.
- We understood the maths behind adaboost.
- We learned how weak learners are used as estimators to increase accuracy.

A. Adaboost falls under the supervised learning branch of machine learning. This means that the training data must have a target variable. Using the adaboost learning technique, we can solve both classification and regression problems.

A. Lesser preprocessing is required, as you do not need to scale the independent variables. Each iteration in the AdaBoost algorithm uses decision stumps as individual models, so the preprocessing required is the same as decision trees. AdaBoost is less prone to overfitting as well. In addition to ada boost weak learners, we can also fine-tune hyperparameters(learning_rate, for example) in these ensemble techniques to get even better accuracy.

A. Much like random forests, decision trees, logistic regression, and svm classifiers, AdaBoost also requires the training data to have a target variable. This target variable could be either categorical or continuous. The scikit-learn library contains the Adaboost classifiers and regressors; hence we can use sklearn in python to create an adaboost model.

*The media shown in this article on Interactive Dashboard using Bokeh are not owned by Analytics Vidhya and are used at the Author’s discretion.*

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Become a full stack data scientist
##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

Understanding Cost Function
Understanding Gradient Descent
Math Behind Gradient Descent
Assumptions of Linear Regression
Implement Linear Regression from Scratch
Train Linear Regression in Python
Implementing Linear Regression in R
Diagnosing Residual Plots in Linear Regression Models
Generalized Linear Models
Introduction to Logistic Regression
Odds Ratio
Implementing Logistic Regression from Scratch
Introduction to Scikit-learn in Python
Train Logistic Regression in python
Multiclass using Logistic Regression
How to use Multinomial and Ordinal Logistic Regression in R ?
Challenges with Linear Regression
Introduction to Regularisation
Implementing Regularisation
Ridge Regression
Lasso Regression

Introduction to Stacking
Implementing Stacking
Variants of Stacking
Implementing Variants of Stacking
Introduction to Blending
Bootstrap Sampling
Introduction to Random Sampling
Hyper-parameters of Random Forest
Implementing Random Forest
Out-of-Bag (OOB) Score in the Random Forest
IPL Team Win Prediction Project Using Machine Learning
Introduction to Boosting
Gradient Boosting Algorithm
Math behind GBM
Implementing GBM in python
Regularized Greedy Forests
Extreme Gradient Boosting
Implementing XGBM in python
Tuning Hyperparameters of XGBoost in Python
Implement XGBM in R/H2O
Adaptive Boosting
Implementing Adaptive Boosing
LightGBM
Implementing LightGBM in Python
Catboost
Implementing Catboost in Python

Introduction to Clustering
Applications of Clustering
Evaluation Metrics for Clustering
Understanding K-Means
Implementation of K-Means in Python
Implementation of K-Means in R
Choosing Right Value for K
Profiling Market Segments using K-Means Clustering
Hierarchical Clustering
Implementation of Hierarchial Clustering
DBSCAN
Defining Similarity between clusters
Build Better and Accurate Clusters with Gaussian Mixture Models

Introduction to Machine Learning Interpretability
Framework and Interpretable Models
model Agnostic Methods for Interpretability
Implementing Interpretable Model
Understanding SHAP
Out-of-Core ML
Introduction to Interpretable Machine Learning Models
Model Agnostic Methods for Interpretability
Game Theory & Shapley Values

Deploying Machine Learning Model using Streamlit
Deploying ML Models in Docker
Deploy Using Streamlit
Deploy on Heroku
Deploy Using Netlify
Introduction to Amazon Sagemaker
Setting up Amazon SageMaker
Using SageMaker Endpoint to Generate Inference
Deploy on Microsoft Azure Cloud
Introduction to Flask for Model
Deploying ML model using Flask

really it was prety much good explanation with appropriate example

Appreciated efforts. Its very easy to understand

Great n helpful articles. Easy language n understandable.

why we randomly select number in step 6 and not directly the incorrect classified?

What will happen if the total error is 0? How do we calculate performance of stump?

Nice article. Thank you

we're selecting new dataset by bootstrapping technique.

Is this section in this correct in this article: “ The amount of say (alpha) will be negative when the sample is correctly classified. The amount of say (alpha) will be positive when the sample is miss-classified.” Should not be it reversed ?

Thanks Anshul Saini that is a fantastice explanation

When the total error is 0, the model classify all data successfully. So I think the algorithm can be stopped.

I found this blog post very helpful in understanding the AdaBoost algorithm. I found the implementation to be straightforward and I am now able to master the algorithm.

I found this blog post very helpful in understanding the AdaBoost algorithm. I found the implementation to be straightforward and I am now able to master the algorithm.

because we should also take correctly classified instances into consideration along with misclassified instances which is not so convenient if we directly take the misclassified instances only hope it helps........