GSplitting Decision Trees with Gini Impurity
In decision trees, making informed choices is pivotal for accurate and robust predictions. Selecting the optimal split to branch nodes significantly influences a decision tree’s effectiveness. One of the powerful methods employed for this purpose is the Gini Impurity. This article delves into the intricacies of utilizing Gini Impurity to discern the best split in decision trees. We will explore the concepts, calculations, and real-world implications, equipping you with a comprehensive understanding of how it enhances the precision and reliability of decision tree models. Whether you’re a novice or a seasoned data practitioner, uncovering the secrets behind this essential algorithm will empower you to harness the full potential of decision trees in your data analysis endeavors.
Table of contents
What is Gini Impurity?
Gini impurity is a measure used in decision tree algorithms to quantify a dataset’s impurity level or disorder. In binary classification problems, it assesses the likelihood of an incorrect classification when a randomly selected data point is assigned a class label based on the distribution of classes in a particular node. It ranges from 0 to 0.5, where 0 indicates a perfectly pure node (all instances belong to the same class), and 0.5 signifies maximum impurity (an equal distribution of classes). In decision trees, it aids in selecting the optimal split by identifying features that result in more homogeneous subsets of data, ultimately contributing to the creation of accurate and reliable predictive models.
Decision Tree Algorithm for Selecting the Best Split
There are multiple algorithms that are used by the decision tree to decide the best split for the problem. Let’s first look at the most common and popular out of all of them, which is Gini Impurity. It measures the impurity of the nodes and is calculated as:
Let’s first understand what Gini is and then I’ll show you how you can calculate the Gini impurity for split and decide the right split. Let’s say we have a node like this-
So what Gini says is that if we pick two points from a population at random, the pink ones highlighted here, then they must be from the same class. Let’s say we have a completely pure node-
Can you guess what would be the probability that a randomly picked point will belong to the same class? Well, obviously it will be 1 since all the points here belong to the same class. So no matter which two points you picked, they will always belong to that one class and hence the probability will always be 1 if the node is pure. And that is what we want to achieve using Gini.
Gini ranges from zero to one, as it is a probability and the higher this value, the more will be the purity of the nodes. And of course, a lesser value means lesser pure nodes.
Properties of Gini Impurity
Let’s look at its properties before we actually calculate the Gini impurity to decide the best split.
We decide the best split based on the Gini impurity and as we discussed before, Gini impurity is:
Here Gini denotes the purity and hence Gini impurity tells us about the impurity of nodes. Lower the Gini impurity we can safely infer the purity will be more and hence a higher chance of the homogeneity of the nodes.
Gini works only in those scenarios where we have categorical targets. It does not work with continuous targets.
A very important point to note to keep in mind. For example, if you want to predict the house price or the number of bikes that have been rented, Gini is not the right algorithm. It only performs binary splits either yes or no, success or failure, and so on. So it will only split a node into two sub-nodes. These are the properties of Gini impurity.
Steps to Calculate Gini Impurity for a Split
Let’s now look at the steps to calculate the Gini split.
Step 1: Calculate GI for Sub-nodes
First, we calculate the Gini impurity for sub-nodes, as you’ve already discussed Gini impurity is, and I’m sure you know this by now:
Gini impurity = 1 – Gini
Here is the sum of squares of success probabilities of each class and is given as:
Considering that there are n classes.
Step 2: Calculate GI of the Split
Once we’ve calculated the Gini impurity for sub-nodes, we calculate the Gini impurity of the split using the weighted impurity of both sub-nodes of that split. Here the weight is decided by the number of observations of samples in both the nodes. Let’s look at these calculations using an example, which will help you understand this even better.
For the split on the performance in the class, remember this is how the split was?
We have two categories, one is “above average” and the other one was “below average”.
When we focus on the above average, we have 14 students out of which 8 play cricket and 6 do not. The probability of playing cricket would be 8 divided by 14, which is around 0.57, and similarly, for not playing cricket, the probability will be 6 divided by 14, which will be around 0.43. Here for simplicity, I’ve rounded up the calculations rather than taking the exact number.
Similarly, when we look at below average, we calculated all the numbers and here they are- the probability of playing is 0.33 and of not playing is 0.67-
Let’s now calculate the GI of the sub-nodes for above average and here’s the calculation-
It will be, one minus the square of the probability of success for each category, which is 0.57 for playing cricket and 0.43 for not playing cricket. So after this calculation Gini comes out to be around 0.49. The Below average node will do the same calculation as Gini. For below average:
It comes up to be around 0.44. Just take a pause and analyze these numbers.
Now to calculate the Gini impurity of the split, we will take the weighted Gini impurities of both nodes, above average and below average. In this case, the weight of a node is the number of samples in that node divided by the total number of samples in the parent node. So for the above-average node here, the weight will be 14/20, as there are 14 students who performed above the average of the total 20 students that we had.
And the weight for below average is 6/20. So, the weighted Gini impurity will be the weight of that node multiplied by the Gini impurity of that node. The weighted Gini impurity for performance in class split comes out to be:
Step 3: Calculate GI for Split on Class
Similarly, here we have captured the Gini impurity for the split on class, which comes out to be around 0.32–
Now, if we compare the two Gini impurities for each split-
We see that the Gini impurity for the split on Class is less. And hence class will be the first split of this decision tree.
Similarly, for each split, we will calculate the Gini impurities and the split producing minimum Gini impurity will be selected as the split. And you know, that the minimum value of Gini impurity means that the node will be purer and more homogeneous.
In this article, we saw one of the most popular splitting algorithms in Decision Trees- Gini Impurity. It can be used for categorical target variables only. There are other algorithms as well used for splitting, which if you want to understand you can let me know in the comment section.
If you are looking to kick start your Data Science Journey and want every topic under one roof, your search stops here. Check out Analytics Vidhya’s Certified AI & ML BlackBelt Plus Program!
If you have any questions, let me know in the comments section!
Frequently Asked Questions
A. The Gini Impurity formula is: 1 – (p₁)² – (p₂)², where p₁ and p₂ represent the probabilities of the two classes in a binary classification problem.
A. A Gini Impurity of 0 indicates a perfectly pure node, where all instances belong to the same class.
A. The Gini coefficient measures income inequality, while Gini Impurity assesses impurity in decision tree nodes, helping to determine optimal splits for classifying data.
A. Entropy measures a set’s disorder level, while Gini impurity quantifies the probability of misclassifying instances. Both are used in decision trees to determine node splits, but Gini favors larger partitions.
The highest Gini impurity is 0.5. This indicates that a node has an equal distribution of classes, meaning that it is completely impure.