Tree based algorithms are considered to be one of the best and mostly used supervised learning methods. Tree based algorithms empower predictive models with high accuracy, stability and ease of interpretation. Unlike linear models, they map non-linear relationships quite well. They are adaptable at solving any kind of problem at hand (classification or regression).
Methods like tree model machine learning, random forest, gradient boosting are being popularly used in all kinds of data science problems. Hence, for every analyst (fresher also), it’s important to learn these algorithms and use them for modeling.
This tutorial aims to help beginners learn tree based algorithms from scratch. Upon successfully completing this tutorial, individuals are expected to become proficient at using tree based algorithms and building predictive models.
Decision tree is a type of supervised learning algorithm (having a predefined target variable) that is mostly used in classification problems. It works for both categorical and continuous input and output variables. In this technique, we split the population or sample into two or more homogeneous sets (or sub-populations) based on most significant splitter / differentiator in input variables.
Tree based modeling in R and Python
Example:-
Let’s say we have a sample of 30 students with three variables Gender (Boy/ Girl), Class( IX/ X) and Height (5 to 6 ft). 15 out of these 30 play cricket in leisure time. Now, I want to create a model to predict who will play cricket during leisure period? In this problem, we need to segregate students who play cricket in their leisure time based on highly significant input variable among all three.
This is where tree model machine learning helps, it will segregate the students based on all values of three variable and identify the variable, which creates the best homogeneous sets of students (which are heterogeneous to each other). In the snapshot below, you can see that variable Gender is able to identify best homogeneous sets compared to the other two variables.
As mentioned above, decision tree identifies the most significant variable and it’s value that gives best homogeneous sets of population. Now the question which arises is, how does it identify the variable and the split? To do this, decision tree uses various algorithms, which we will discuss in the following section.
Types of tree model machine learning is based on the type of target variable we have. It can be of two types:
Example:- Let’s say we have a problem to predict whether a customer will pay his renewal premium with an insurance company (yes/ no). Here we know that income of customer is a significant variable but insurance company does not have income details for all customers. Now, as we know this is an important variable, then we can build a decision tree to predict customer income based on occupation, product and various other variables. In this case, we are predicting values for continuous variable.
Let’s look at the basic terminology used with Decision tress in python:
These are the terms commonly used for decision trees in python. As we know that every algorithm has advantages and disadvantages, below are the important factors which one should know.
We all know that the terminal nodes (or leaves) lies at the bottom of the decision tree based classifiers. This means that decision trees typically draw upside down, with leaves at the bottom and roots at the top (as shown below).
The decision of making strategic splits heavily affects a tree’s accuracy. The decision criteria is different for classification and regression trees.
Decision trees use multiple algorithms to decide to split a node in two or more sub-nodes. The creation of sub-nodes increases the homogeneity of resultant sub-nodes. In other words, we can say that purity of the node increases with respect to the target variable. Decision tree splits the nodes on all available variables and then selects the split which results in most homogeneous sub-nodes.
The algorithm selection is also based on type of target variables. Let’s look at the four most commonly used algorithms in decision tree:
Gini says, if we select two items from a population at random then they must be of same class and probability for this is 1 if population is pure.
Steps to Calculate Gini for a split
Example: – Referring to example used above, where we want to segregate the students based on target variable ( playing cricket or not ). In the snapshot below, we split the population using two input variables Gender and Class. Now, I want to identify which split is producing more homogeneous sub-nodes using Gini .
Similar for Split on Class:
Above, you can see that Gini score for Split on Gender is higher than Split on Class, hence, the node split will take place on Gender.
You might often come across the term ‘Gini Impurity’ which is determined by subtracting the gini value from 1. So mathematically we can say,
Gini Impurity = 1-Gini
It is an algorithm to find out the statistical significance between the differences between sub-nodes and parent node. We measure it by sum of squares of standardized differences between observed and expected frequencies of target variable.
Example: Let’s work with above example that we have used to calculate Gini.
Split on Gender:
Perform similar steps of calculation for split on Class and you will come up with below table.
Above, you can see that Chi-square also identify the Gender split is more significant compare to Class.
Look at the image below and think which node can be described easily. I am sure, your answer is C because it requires less information as all values are similar. On the other hand, B requires more information to describe it and A requires the maximum information. In other words, we can say that C is a Pure node, B is less Impure and A is more impure.
Now, we can build a conclusion that less impure node requires less information to describe it. And, more impure node requires more information. Information theory is a measure to define this degree of disorganization in a system known as Entropy. If the sample is completely homogeneous, then the entropy is zero and if the sample is an equally divided (50% – 50%), it has entropy of one.
Entropy can be calculated using formula:-
Here p and q is probability of success and failure respectively in that node. Entropy is also used with categorical target variable. It chooses the split which has lowest entropy compared to parent node and other splits. The lesser the entropy, the better it is.
Example: Let’s use this method to identify best split for student example.
Above, you can see that entropy for Split on Gender is the lowest among all, so the tree will split on Gender. We can derive information gain from entropy as 1- Entropy.
Till now, we have discussed the algorithms for categorical target variable. Reduction in variance is an algorithm used for continuous target variables (regression problems). This algorithm uses the standard formula of variance to choose the best split. The split with lower variance is selected as the criteria to split the population:
Above X-bar is mean of the values, X is actual and n is number of values.
Example:- Let’s assign numerical value 1 for play cricket and 0 for not playing cricket. Now follow the steps to identify the right split:
Above, you can see that Gender split has lower variance compare to parent node, so the split would take place on Gender variable.
Until here, we learnt about the basics of decision trees and the decision making process involved to choose the best splits in building a tree model. As I said, decision tree can be applied both on regression and classification problems. Let’s understand these aspects in detail.
Overfitting is one of the key challenges faced while using tree based algorithms. If there is no limit set of a decision tree, it will give you 100% accuracy on training set because in the worse case it will end up making 1 leaf for each observation. Thus, preventing overfitting is pivotal while modeling a decision tree and it can be done in 2 ways:
Let’s discuss both of these briefly.
This can be done by using various parameters which are used to define a tree. First, lets look at the general structure of a decision tree:
The parameters used for defining a tree are further explained below. The parameters described below are irrespective of tool. It is important to understand the role of parameters used in tree modeling. These parameters are available in R & Python.
As discussed earlier, the technique of setting constraint is a greedy-approach. In other words, it will check for the best split instantaneously and move forward until one of the specified stopping condition is reached. Let’s consider the following case when you’re driving:
Let’s analyze these choice. In the former choice, you’ll immediately overtake the car ahead and reach behind the truck and start moving at 30 km/h, looking for an opportunity to move back right. All cars originally behind you move ahead in the meanwhile. This would be the optimum choice if your objective is to maximize the distance covered in next say 10 seconds. In the later choice, you sale through at same speed, cross trucks and then overtake maybe depending on situation ahead. Greedy you!
This is exactly the difference between normal decision tree & pruning. A decision tree with constraints won’t see the truck ahead and adopt a greedy approach by taking a left. On the other hand if we use pruning, we in effect look at a few steps ahead and make a choice.
Note that sklearn’s decision tree classifier does not currently support pruning. Advanced packages like xgboost have adopted tree pruning in their implementation. But the library rpart in R, provides a function to prune. Good for R users!
“If I can use logistic regression for classification problems and linear regression for regression problems, why is there a need to use trees”? Many of us have this question. And, this is a valid one too.
Actually, you can use any algorithm. It is dependent on the type of problem you are solving. Let’s look at some key factors which will help you to decide which algorithm to use:
For R users and Python users, decision tree based algorithm is quite easy to implement. Let’s quickly look at the set of codes that can get you started with this algorithm. For ease of use, I’ve shared standard codes where you’ll need to replace your data set name and variables to get started.
In fact, you can build the decision tree in Python right here! Here’s a live coding window for you to play around the code and generate results:
For R users, there are multiple packages available to implement decision tree such as ctree, rpart, tree etc.
> library(rpart)
> x <- cbind(x_train,y_train)
# grow tree
> fit <- rpart(y_train ~ ., data = x,method="class")
> summary(fit)
#Predict Output
> predicted= predict(fit,x_test)
In the code above:
For Python users, below is the code:
#Import Library
#Import other necessary libraries like pandas, numpy...
from sklearn import tree
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create tree object
model = tree.DecisionTreeClassifier(criterion='gini') # for classification, here you can change the algorithm as gini or entropy (information gain) by default it is gini
# model = tree.DecisionTreeRegressor() for regression
# Train the model using the training sets and check score
model.fit(X, y)
model.score(X, y)
#Predict Output
predicted= model.predict(x_test)
The literary meaning of word ‘ensemble’ is group. Ensemble methods involve group of predictive models to achieve a better accuracy and model stability. Ensemble methods are known to impart supreme boost to tree based models.
Like every other model, a tree based algorithm also suffers from the plague of bias and variance. Bias means, ‘how much on an average are the predicted values different from the actual value.’ Variance means, ‘how different will the predictions of the model be at the same point if different samples are taken from the same population’.
You build a small tree and you will get a model with low variance and high bias. How do you manage to balance the trade off between bias and variance ?
Normally, as you increase the complexity of your model, you will see a reduction in prediction error due to lower bias in the model. As you continue to make your model more complex, you end up over-fitting your model and your model will start suffering from high variance.
A champion model should maintain a balance between these two types of errors. This is known as the trade-off management of bias-variance errors. Ensemble learning is one way to execute this trade off analysis.
Some of the commonly used ensemble methods include: Bagging, Boosting and Stacking. In this tutorial, we’ll focus on Bagging and Boosting in detail.
Bagging is an ensemble technique used to reduce the variance of our predictions by combining the result of multiple classifiers modeled on different sub-samples of the same data set. The following figure will make it clearer:
The steps followed in bagging are:
Note that, here the number of models built is not a hyper-parameters. Higher number of models are always better or may give similar performance than lower numbers. Theoretical analysis demonstrates that the variance of the combined predictions is reduced to 1/n (where n is the number of classifiers) of the original variance, under certain assumptions.
There are various implementations of bagging models. Random forest is one of them and we’ll discuss it next.
Random Forest is considered to be a panacea of all data science problems. On a funny note, when you can’t think of any algorithm (irrespective of situation), use random forest!
Random Forest is a versatile machine learning method capable of performing both regression and classification tasks. It also undertakes dimensional reduction methods, treats missing values, outlier values and other essential steps of data exploration, and does a fairly good job. It is a type of ensemble learning method, where a group of weak models combine to form a powerful model.
In Random Forest, we grow multiple trees as opposed to a single tree in CART model (see comparison between CART and Random Forest here, part1 and part2). To classify a new object based on attributes, each tree gives a classification and we say the tree “votes” for that class. The forest chooses the classification having the most votes (over all the trees in the forest) and in case of regression, it takes the average of outputs by different trees.
It works in the following manner. Each tree is planted & grown as follows:
To understand more in detail about this algorithm using a case study, please read this article “Introduction to Random forest – Simplified“.
Random forests have commonly known implementations in R packages and Python scikit-learn. Let’s look at the code of loading random forest model in R and Python below:
R Code
> library(randomForest)
> x <- cbind(x_train,y_train)
# Fitting model
> fit <- randomForest(Species ~ ., x,ntree=500)
> summary(fit)
#Predict Output
> predicted= predict(fit,x_test)
Definition: The term ‘Boosting’ refers to a family of algorithms which converts weak learner to strong learners.
Let’s understand this definition in detail by solving a problem of spam email identification:
How would you classify an email as SPAM or not? Like everyone else, our initial approach would be to identify ‘spam’ and ‘not spam’ emails using following criteria. If:
Above, we’ve defined multiple rules to classify an email into ‘spam’ or ‘not spam’. But, do you think these rules individually are strong enough to successfully classify an email? No.
Individually, these rules are not powerful enough to classify an email into ‘spam’ or ‘not spam’. Therefore, these rules called weak learner.
To convert weak learner to strong learner, we’ll combine the prediction of each weak learner using methods like:
For example: Above, we have defined 5 weak learners. Out of these 5, 3 are voted as ‘SPAM’ and 2 are voted as ‘Not a SPAM’. In this case, by default, we’ll consider an email as SPAM because we have higher(3) vote for ‘SPAM’.
Now we know that, boosting combines weak learner a.k.a. base learner to form a strong rule. An immediate question which should pop in your mind is, ‘How boosting identify weak rules?‘
To find weak rule, we apply base learning (ML) algorithms with a different distribution. Each time base learning algorithm is applied, it generates a new weak prediction rule. This is an iterative process. After many iterations, the boosting algorithm combines these weak rules into a single strong prediction rule. This is how the ensemble model is built.
Here’s another question which might haunt you, ‘How do we choose different distribution for each round?’
For choosing the right distribution, here are the following steps:
Step 1: The base learner takes all the distributions and assign equal weight or attention to each observation.
Step 2: If there is any prediction error causes by first base learning algorithm, then we pay higher attention to observations having prediction error. Then, we apply the next base learning algorithm.
Step 3: Iterate Step 2 till the limit of base learning algorithm is reached or higher accuracy is achieved.
Finally, it combines the outputs from weak learner and creates a strong learner which eventually improves the prediction power of the model. Boosting pays higher focus on examples which are mis-classified or have higher errors by preceding weak rules.
There are many boosting algorithms which impart additional boost to model’s accuracy. In this tutorial, we’ll learn about the two most commonly used algorithms i.e. Gradient Boosting (GBM) and XGboost.
I’ve always admired the boosting capabilities that xgboost algorithm. At times, I’ve found that it provides better result compared to GBM implementation, but at times you might find that the gains are just marginal. When I explored more about its performance and science behind its high accuracy, I discovered many advantages of Xgboost over GBM:
Standard GBM implementation has no regularization like XGBoost, therefore it also helps to reduce overfitting.
In fact, XGBoost is also known as ‘regularized boosting‘ technique.
XGBoost implements parallel processing and is blazingly faster as compared to GBM.
But hang on, we know that boosting is sequential process so how can it be parallelized? We know that each tree can only be built after the previous one, so what prevents us from creating a tree-based algorithm using all cores? I hope you get where I’m coming from. Check this link out to explore further.
XGBoost also supports implementation on Hadoop.
XGBoost allow users to define custom optimization objectives and evaluation criteria.
This adds a whole new dimension to the model and there is no limit to what we can do.
XGBoost has an in-built routine to handle missing values.
The user must supply a different value than the other observations and pass that as a parameter. XGBoost tries different things as it encounters a missing value on each node and learns which path to take for missing values in future.
A GBM would stop splitting a node when it encounters a negative loss in the split. Thus it is more of a greedy algorithm.
XGBoost on the other hand make splits upto the max_depth specified and then start pruning the tree backwards and remove splits beyond which there is no positive gain.
Another advantage is that sometimes a split of negative loss say -2 may be followed by a split of positive loss +10. GBM would stop as it encounters -2. But XGBoost will go deeper and it will see a combined effect of +8 of the split and keep both.
XGBoost allows user to run a cross-validation at each iteration of the boosting process and thus it is easy to get the exact optimum number of boosting iterations in a single run.
This is unlike GBM where we have to run a grid-search and only a limited values can be tested.
User can start training an XGBoost model from its last iteration of previous run. This can be of significant advantage in certain specific applications.
GBM implementation of sklearn also has this feature so they are even on this point.
Before we start working, let’s quickly understand the important parameters and the working of this algorithm. This will be helpful for both R and Python users. Below is the overall pseudo-code of GBM algorithm for 2 classes:
1. Initialize the outcome
2. Iterate from 1 to total number of trees
2.1 Update the weights for targets based on previous run (higher for the ones mis-classified)
2.2 Fit the model on selected subsample of data
2.3 Make predictions on the full set of observations
2.4 Update the output with current results taking into account the learning rate
3. Return the final output.
This is an extremely simplified (probably naive) explanation of GBM’s working. But, it will help every beginners to understand this algorithm.
I know its a long list of parameters but I have simplified it for you in an excel file which you can download from this GitHub repository.
I’ve shared the standard codes in R and Python. At your end, you’ll be required to change the value of dependent variable and data set name used in the codes below. Considering the ease of implementing GBM in R, one can easily perform tasks like cross validation and grid search with this package.
> library(caret)
> fitControl <- trainControl(method = "cv",
number = 10, #5folds)
> tune_Grid <- expand.grid(interaction.depth = 2,
n.trees = 500,
shrinkage = 0.1,
n.minobsinnode = 10)
> set.seed(825)
> fit <- train(y_train ~ ., data = train,
method = "gbm",
trControl = fitControl,
verbose = FALSE,
tuneGrid = gbmGrid)
> predicted= predict(fit,test,type= "prob")[,2]
XGBoost (eXtreme Gradient Boosting) is an advanced implementation of gradient boosting algorithm. It’s feature to implement parallel computing makes it at least 10 times faster than existing gradient boosting implementations. It supports various objective functions, including regression, classification and ranking.
R Tutorial: For R users, this is a complete tutorial on XGboost which explains the parameters along with codes in R. Check Tutorial.
Python Tutorial: For Python users, this is a comprehensive tutorial on XGBoost, good to get you started. Check Tutorial.
Practice is the one and true method of mastering any concept. Hence, you need to start practicing if you wish to master these algorithms.
Up to this point, you have acquired significant knowledge on tree-based algorithms along with their practical implementations. Therefore, it’s time that you start working on them. Here are open practice problems where you can participate and check your live rankings on the leaderboard:
Practice Problem: Food Demand Forecasting Challenge | Predict the demand of meals for a meal delivery company | |
Practice Problem: HR Analytics Challenge | Identify the employees most likely to get promoted | |
Practice Problem: Predict Number of Upvotes | Predict number of upvotes on a query asked at an online question & answer platform |
Tree based algorithms are important for every data scientist to learn. Indeed, tree models are known to deliver the best model performance among the entire family of machine learning algorithms. In this tutorial, we learnt until GBM and XGBoost. And with this, we come to the end of this tutorial.
We discussed about tree-based algorithms from scratch. Additionally, we learned the importance of decision trees and how that simplistic concept is being used in boosting algorithms. Therefore, for better understanding, I would suggest you to continue practicing these algorithms practically. Also, do keep note of the parameters associated with boosting algorithms. I’m hoping that this tutorial would enrich you with complete knowledge on tree based modeling.
Free Course: Ensemble Learning Course: Ensemble Learning and Ensemble Learning Techniques
Certified program: Certified AI & ML Blackbelt+ Program
A. A tree is a hierarchical data structure that represents and organizes data to facilitate easy navigation and search. It comprises nodes connected by edges, creating a branching structure. The topmost node is the root, and nodes below it are child nodes.
A. A tree search algorithm refers to techniques used to explore and traverse trees. One common example is the depth-first search (DFS), where we explore as far as possible along a branch before backtracking. Another is the breadth-first search (BFS), which explores nodes level by level. These algorithms are essential for tasks like finding paths, analyzing networks, and solving puzzles.
A. Tree traversal algorithms allow us to visit all nodes in a tree systematically. Here are two common methods:
a- Inorder Traversal: Visits nodes in the order left-child, root, right-child. Useful for binary search trees.
b- Preorder Traversal: Visits nodes in the order root, left-child, right-child. Often used for expression evaluation. Other traversal methods include postorder, level-order, and more. These algorithms are fundamental for processing tree-based data.
A. While there isn’t a single “top” tree algorithm, decision trees stand out. Decision trees are machine-learning models used for classification and regression tasks.
Lorem ipsum dolor sit amet, consectetur adipiscing elit,