# How the Gradient Boosting Algorithm works?

*This article was published as a part of the Data Science Blogathon.*

## Introduction

Gradient boosting algorithm is one of the most powerful algorithms in the field of machine learning. As we know that the errors in machine learning algorithms are broadly classified into two categories i.e. Bias Error and Variance Error. As gradient boosting is one of the boosting algorithms it is used to minimize bias error of the model.

Unlike, Adaboosting algorithm, the base estimator in the gradient boosting algorithm cannot be mentioned by us. The base estimator for the Gradient Boost algorithm is fixed and i.e. *Decision Stump*. Like, AdaBoost, we can tune the n_estimator of the gradient boosting algorithm. However, if we do not mention the value of n_estimator, the default value of n_estimator for this algorithm is 100.

Gradient boosting algorithm can be used for predicting not only continuous target variable (as a Regressor) but also categorical target variable (as a Classifier). When it is used as a regressor, the cost function is Mean Square Error (MSE) and when it is used as a classifier then the cost function is Log loss.

**This article is going to cover the following topics related to Gradient Boosting Algorithm:**

1) Manual Example for understanding the algorithm

2) Python Code for the same example with different estimators.

3) Finding the best estimators using GridSearchCV.

4) Applications

5) Conclusion.

**1) **__Manual Example for understanding the algorithm:__

__Manual Example for understanding the algorithm:__

Let us now understand the working of the Gradient Boosting Algorithm with the help of one example. In the following example, Age is the Target variable whereas LikesExercising, GotoGym, DrivesCar are independent variables. As in this example, the target variable is continuous, GradientBoostingRegressor is used here.

*1*^{st}-Estimator

^{st}-Estimator

For estimator-1, the root level (level 0) will consist of all the records. The predicted age at this level is equal to the mean of the entire Age column i.e. 41.33(addition of all values in Age column divided by a number of records i.e. 9). Let us find out what is the MSE for this level. MSE is calculated as the mean of the square of errors. Here error is equal to actual age-predicted age. The predicted age for the particular node is always equal to the mean of age records of that node. So, the MSE of the root node of the 1^{st} estimator is calculated as given below.

** MSE=(∑(Age _{i }–mu)^{2} )/9=577.11**

The cost function hers is MSE and the objective of the algorithm here is to minimize the MSE.

Now, one of the independent variables will be used by the Gradient boosting to create the Decision Stump. Let us suppose that the LikesExercising

is used here for prediction. So, the records with False LikesExercising will go in one child node, and records with True LikesExercising will go in another child node as shown below.

Let us find the means and MSEs of both the nodes of level 1. For the left node, the mean is equal to 20.25 and MSE is equal to 83.19. Whereas, for the right node, the mean is equal to 58.2 and MSE is equal to 332.16. Total MSE for level 1 will be equal to the addition of all nodes of level 1 i.e. 83.19+332.16=415.35. We can see here, the cost function i.e. MSE of level 1 is better than level 0.

*2*^{nd}-Estimator:

^{nd}-Estimator:

Let us now find out the estimator-2. Unlike AdaBoost, in the Gradient boosting algorithm, residues (age_{i} – mu)of the first estimator are taken as root nodes as shown below. Let us suppose for this estimator another dependent variable is used for prediction. So, the records with False GotoGym

will go in one child node, and records with True GotoGym will go in another child node as shown below.

The prediction of age here is slightly tricky. First, the age will be predicted from estimator 1 as per the value of LikeExercising, and then the mean from the estimator is found out with the help of the value of GotoGym and then that means is added to age-predicted from the first estimator and that is the final prediction of Gradient boosting with two estimators.

Let us consider if we want to predict the age for the following records:

Here, LikesExercising is equal to False. So, the predicted age from the first estimator will be 20.25 (i.e. mean of the left node of the first estimator). Now we need to check what is the value of GotoGym for the second predictor and its value is True. So, the mean of True GotoGym in the 2^{nd} estimator is -3.56. This will be added to the prediction of the first estimator i.e. 20.25. So final prediction of this model will be 20.25+(-3.56) = 16.69.

Let us predict of ages of all records we have in the example.

Let us now find out the Final MSE for above all 9 records.

MSE= ((-2.69)^{2} +(-1.69)^{2 }+ (-0.69)^{2} +(-28.64)^{2} +(19.31)^{2} +(-15.33)^{2} + (14.36)^{2} +(6.67)^{2} +(7.67)^{2} )/9

= 194.2478

So, we can see that the final MSE is much better than the MSE of the root node of the 1^{st} Estimator. This is only for 2 estimators. There can be n number of estimators in gradient boosting algorithm.

*2) *__Python Code for the same:__

*2)*

__Python Code for the same:__**Let us now write a python code for the same. **

__Observation:__

__Observation:__As we can see here, MSE reduces as we increase the estimator value. The situation comes where MSE becomes saturated which means even if we increase the estimator value there will be no significant decrease in MSE.

*3) *__Finding the best estimator with GridSearchCV:__

*3)*

__Finding the best estimator with GridSearchCV:__Let us now see, how GridSearchCV can be used to find the best estimator for the above example.

from sklearn.model_selection import GridSearchCV model=GradientBoostingRegressor() params={'n_estimators':range(1,200)} grid=GridSearchCV(estimator=model,cv=2,param_grid=params,scoring='neg_mean_squared_error') grid.fit(X,Y) print("The best estimator returned by GridSearch CV is:",grid.best_estimator_) #Output #The best estimator returned by GridSearch CV is: GradientBoostingRegressor(n_estimators=19) GB=grid.best_estimator_ GB.fit(X,Y) Y_predict=GB.predict(X) Y_predict #output: #Y_predict=[27.20639114, 18.98970027, 18.98970027, 46.66697477, 27.20639114,58.34332496, 46.66697477, 58.34332496, 69.58721772] MSE_best=(sum((Y-Y_predict)**2))/len(Y) print('MSE for best estimators :',MSE_best) #Following code is used to find out MSE of prediction for Gradient boosting algorithm with best estimator value given by GridSearchCV #Output: MSE for best estimators : 164.2298548605391

__Observation:__

__Observation:__You may think that MSE for n_estimator=50 is better than MSE for n_estimator=19. Still GridSearchCV returns 19 not 50. Actually, we can observe here is until 19 with each increment in estimator value the reduction in MSE was significant, but after 19 there is no significant decrease in MSE with increment in estimators. So, n_estimator=19 was returned by GridSearchSV.

**4)**__Applications:__

__Applications:__

i) Gradient Boosting Algorithm is generally used when we want to decrease the Bias error.

ii) Gradient Boosting Algorithm can be used in regression as well as classification problems. In regression problems, the cost function is MSE whereas, in classification problems, the cost function is Log-Loss.

**5) **__Conclusion:__

__Conclusion:__

In this article, I have tried to explain how Gradient Boosting Actually works with the help of a simple example.

Gradient Boosting Algorithm is generally used when we want to decrease the Bias error.

Here, the example of GradientBoostingRegressor is shown. GradientBoostingClassfier is also there which is used for Classification problems. Here, in Regressor MSE is used as cost function there in classification Log-Loss is used as cost function.

The most important thing in this algorithm is to find the best value of n_estimators. In this article, we have seen how GridSearchCV can be used to achieve that.

After reading this article, you should first try to use this algorithm for practice problems and after that, you must learn how to tune the hyper-parameters for this algorithm.

*The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion. *