Gradient boosting algorithm is one of the most powerful algorithms in the field of machine learning. As we know that the errors in machine learning algorithms are broadly classified into two categories i.e. Bias Error and Variance Error. As gradient boosting is one of the boosting algorithms it is used to minimize bias error of the model.

Unlike, Adaboosting algorithm, the base estimator in the gradient boosting algorithm cannot be mentioned by us. The base estimator for the Gradient Boost algorithm is fixed and i.e. *Decision Stump*. Like, AdaBoost, we can tune the n_estimator of the gradient boosting algorithm. However, if we do not mention the value of n_estimator, the default value of n_estimator for this algorithm is 100.

Gradient boosting algorithm can be used for predicting not only continuous target variable (as a Regressor) but also categorical target variable (as a Classifier). When it is used as a regressor, the cost function is Mean Square Error (MSE) and when it is used as a classifier then the cost function is Log loss.

This article is going to cover the following topics related to Gradient Boosting Algorithm.

**This article was published as a part of the Data Science Blogathon.**

Let us now understand the working of the Gradient Boosting Algorithm with the help of one example. In the following example, Age is the Target variable whereas LikesExercising, GotoGym, DrivesCar are independent variables. As in this example, the target variable is continuous, GradientBoostingRegressor is used here.

For estimator-1, the root level (level 0) will consist of all the records. The predicted age at this level is equal to the mean of the entire Age column i.e. 41.33(addition of all values in Age column divided by a number of records i.e. 9). Let us find out what is the MSE for this level. MSE is calculated as the mean of the square of errors. Here error is equal to actual age-predicted age. The predicted age for the particular node is always equal to the mean of age records of that node. So, the MSE of the root node of the 1^{st} estimator is calculated as given below.

**MSE=(âˆ‘(Age _{i }â€“mu)^{2} )/9=577.11**

The cost function hers is MSE and the objective of the algorithm here is to minimize the MSE.

Now, one of the independent variables will be used by the Gradient boosting to create the Decision Stump. Let us suppose that the LikesExercising

is used here for prediction. So, the records with False LikesExercising will go in one child node, and records with True LikesExercising will go in another child node as shown below.

Let us find the means and MSEs of both the nodes of level 1. For the left node, the mean is equal to 20.25 and MSE is equal to 83.19. Whereas, for the right node, the mean is equal to 58.2 and MSE is equal to 332.16. Total MSE for level 1 will be equal to the addition of all nodes of level 1 i.e. 83.19+332.16=415.35. We can see here, the cost function i.e. MSE of level 1 is better than level 0.

Let us now find out the estimator-2. Unlike AdaBoost, in the Gradient boosting algorithm, residues (age_{i} â€“ mu)of the first estimator are taken as root nodes as shown below. Let us suppose for this estimator another dependent variable is used for prediction. So, the records with False GotoGym

will go in one child node, and records with True GotoGym will go in another child node as shown below.

The prediction of age here is slightly tricky. First, the age will be predicted from estimator 1 as per the value of LikeExercising, and then the mean from the estimator is found out with the help of the value of GotoGym and then that means is added to age-predicted from the first estimator and that is the final prediction of Gradient boosting with two estimators.

Let us consider if we want to predict the age for the following records:

Here, LikesExercising is equal to False. So, the predicted age from the first estimator will be 20.25 (i.e. mean of the left node of the first estimator). Now we need to check what is the value of GotoGym for the second predictor and its value is True. So, the mean of True GotoGym in the 2^{nd} estimator is -3.56. This will be added to the prediction of the first estimator i.e. 20.25. So final prediction of this model will be 20.25+(-3.56) = 16.69.

Let us predict of ages of all records we have in the example:

Let us now find out the Final MSE for above all 9 records.

MSE = ((-2.69)^{2} +(-1.69)^{2 }+ (-0.69)^{2} +(-28.64)^{2} +(19.31)^{2} +(-15.33)^{2} + (14.36)^{2} +(6.67)^{2} +(7.67)^{2} )/9

= 194.2478

So, we can see that the final MSE is much better than the MSE of the root node of the 1^{st} Estimator. This is only for 2 estimators. There can be n number of estimators in gradient boosting algorithm.

Let us now write a python code for the same.

As we can see here, MSE reduces as we increase the estimator value. The situation comes where MSE becomes saturated which means even if we increase the estimator value there will be no significant decrease in MSE.

Let us now see, how GridSearchCV can be used to find the best estimator for the above example.

```
from sklearn.model_selection import GridSearchCV
model=GradientBoostingRegressor()
params={'n_estimators':range(1,200)}
grid=GridSearchCV(estimator=model,cv=2,param_grid=params,scoring='neg_mean_squared_error')
grid.fit(X,Y)
print("The best estimator returned by GridSearch CV is:",grid.best_estimator_)
#Output
#The best estimator returned by GridSearch CV is: GradientBoostingRegressor(n_estimators=19)
GB=grid.best_estimator_
GB.fit(X,Y)
Y_predict=GB.predict(X)
Y_predict
#output:
#Y_predict=[27.20639114, 18.98970027, 18.98970027, 46.66697477, 27.20639114,58.34332496, 46.66697477, 58.34332496, 69.58721772]
MSE_best=(sum((Y-Y_predict)**2))/len(Y)
print('MSE for best estimators :',MSE_best)
#Following code is used to find out MSE of prediction for Gradient boosting algorithm with best estimator value given by GridSearchCV
#Output: MSE for best estimators : 164.2298548605391
```

You may think that MSE for n_estimator=50 is better than MSE for n_estimator=19. Still GridSearchCV returns 19 not 50. Actually, we can observe here is until 19 with each increment in estimator value the reduction in MSE was significant, but after 19 there is no significant decrease in MSE with increment in estimators. So, n_estimator=19 was returned by GridSearchSV.

- Gradient Boosting Algorithm is generally used when we want to decrease the Bias error.
- It can be used in regression as well as classification problems. In regression problems, the cost function is MSE whereas, in classification problems, the cost function is Log-Loss.

In this article, I have tried to explain how Gradient Boosting Actually works with the help of a simple example.

Gradient Boosting Algorithm is generally used when we want to decrease the Bias error.

Here, the example of GradientBoostingRegressor is shown. GradientBoostingClassfier is also there which is used for Classification problems. Here, in Regressor MSE is used as cost function there in classification Log-Loss is used as cost function.

The most important thing in this algorithm is to find the best value of n_estimators. In this article, we have seen how GridSearchCV can be used to achieve that.

After reading this article, you should first try to use this algorithm for practice problems and after that, you must learn how to tune the hyper-parameters for this algorithm.

A. Gradient boosting, an ensemble machine learning technique, constructs a robust predictive model by sequentially combining multiple weak models. It minimizes errors using a gradient descent-like approach during training.

A. In regression, gradient boosting fits a series of weak learners, typically decision trees, to residuals, gradually reducing prediction errors through model updates.

A. Light Gradient Boosting, or LightGBM, employs a tree-based learning algorithm, utilizing a histogram-based method for faster training and reduced memory usage.

A. Gradient boosting reduces bias by iteratively adding weak models, focusing on remaining errors and emphasizing previously mispredicted instances. This sequential process refines predictions, ultimately minimizing bias in the final model

*The media shown in this article are not owned by Analytics Vidhya and is used at the Authorâ€™s discretion. *

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Become a full stack data scientist
##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

Understanding Cost Function
Understanding Gradient Descent
Math Behind Gradient Descent
Assumptions of Linear Regression
Implement Linear Regression from Scratch
Train Linear Regression in Python
Implementing Linear Regression in R
Diagnosing Residual Plots in Linear Regression Models
Generalized Linear Models
Introduction to Logistic Regression
Odds Ratio
Implementing Logistic Regression from Scratch
Introduction to Scikit-learn in Python
Train Logistic Regression in python
Multiclass using Logistic Regression
How to use Multinomial and Ordinal Logistic Regression in R ?
Challenges with Linear Regression
Introduction to Regularisation
Implementing Regularisation
Ridge Regression
Lasso Regression

Introduction to Stacking
Implementing Stacking
Variants of Stacking
Implementing Variants of Stacking
Introduction to Blending
Bootstrap Sampling
Introduction to Random Sampling
Hyper-parameters of Random Forest
Implementing Random Forest
Out-of-Bag (OOB) Score in the Random Forest
IPL Team Win Prediction Project Using Machine Learning
Introduction to Boosting
Gradient Boosting Algorithm
Math behind GBM
Implementing GBM in python
Regularized Greedy Forests
Extreme Gradient Boosting
Implementing XGBM in python
Tuning Hyperparameters of XGBoost in Python
Implement XGBM in R/H2O
Adaptive Boosting
Implementing Adaptive Boosing
LightGBM
Implementing LightGBM in Python
Catboost
Implementing Catboost in Python

Introduction to Clustering
Applications of Clustering
Evaluation Metrics for Clustering
Understanding K-Means
Implementation of K-Means in Python
Implementation of K-Means in R
Choosing Right Value for K
Profiling Market Segments using K-Means Clustering
Hierarchical Clustering
Implementation of Hierarchial Clustering
DBSCAN
Defining Similarity between clusters
Build Better and Accurate Clusters with Gaussian Mixture Models

Introduction to Machine Learning Interpretability
Framework and Interpretable Models
model Agnostic Methods for Interpretability
Implementing Interpretable Model
Understanding SHAP
Out-of-Core ML
Introduction to Interpretable Machine Learning Models
Model Agnostic Methods for Interpretability
Game Theory & Shapley Values

Deploying Machine Learning Model using Streamlit
Deploying ML Models in Docker
Deploy Using Streamlit
Deploy on Heroku
Deploy Using Netlify
Introduction to Amazon Sagemaker
Setting up Amazon SageMaker
Using SageMaker Endpoint to Generate Inference
Deploy on Microsoft Azure Cloud
Introduction to Flask for Model
Deploying ML model using Flask

Very informative article, please also write a detailed article for classification using log-loss.