Difference between random forest and decision tree
Python Code Implementation of decision trees
There are various algorithms in Machine learning for both regression and classification problems, but going for the best and most efficient algorithm for the given dataset is the main point to perform while developing a good Machine Learning Model.
One of Such algorithms good for both classification/categorical and Regression problems is the Decision tree
Decision Trees usually implement exactly the human thinking ability while making a decision, so it is easy to understand.
The logic behind the decision tree can be easily understood because it shows a flow chart type structure /tree-like structure which makes it easy to visualize and extract information out of the background process
Table of Contents
What Is a Decision Tree
Elements of Decision Trees
How to build a decision from scratch
How Does the Decision Tree Algorithm works
Acquaintance With EDA( Exploratory Data Analysis)
Decision Trees and Random Forests
Advantages of Decision Forest
Disadvantages of Decision Forest
Python Code Implementation
1. What is a Decision Tree?
A Decision Tree is a supervised Machine learning algorithm. It is used in both classification and regression algorithms. The decision tree is like a tree with nodes. The branches depend on a number of factors. It splits data into branches like these till it achieves a threshold value. A decision tree consists of the root nodes, children nodes, and leaf nodes.
Let’s Understand the decision tree methods by Taking one Real-life Scenario
Imagine that you play football every Sunday and you always invite your friend to come to play with you. Sometimes your friend actually comes and sometimes he doesn’t.
The factor on whether or not to come depends on numerous things, like weather, temperature, wind, and fatigue. We start to take all of these features into consideration and begin tracking them alongside your friend’s decision whether to come for playing or not.
You can use this data to predict whether or not your friend will come to play football or not. The technique you could use is a decision tree. Here’s what the decision tree would look like after implementation:
2. Elements Of a Decision Tree
Every decision tree consists following list of elements:
a Node
b Edges
c Root
d Leaves
a) Nodes: It is The point where the tree splits according to the value of some attribute/feature of the dataset
b) Edges: It directs the outcome of a split to the next node we can see in the figure above that there are nodes for features like outlook, humidity and windy. There is an edge for each potential value of each of those attributes/features.
c) Root: This is the node where the first split takes place
d) Leaves: These are the terminal nodes that predict the outcome of the decision tree
3. How to Build Decision Trees from Scratch?
While building a Decision tree, the main thing is to select the best attribute from the total features list of the dataset for the root node as well as for sub-nodes. The selection of best attributes is being achieved with the help of a technique known as the Attribute selection measure (ASM).
With the help of ASM, we can easily select the best features for the respective nodes of the decision tree.
There are two techniques for ASM:
a) Information Gain
b) Gini Index
a) Information Gain:
1Information gain is the measurement of changes in entropy value after the splitting/segmentation of the dataset based on an attribute.
2 It tells how much information a feature/attribute provides us.
3 Following the value of the information gain, splitting of the node and decision tree building is being done.
4 decision tree always tries to maximize the value of the information gain, and a node/attribute having the highest value of the information gain is being split first. Information gain can be calculated using the below formula:
Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)
Entropy: Entropy signifies the randomness in the dataset. It is being defined as a metric to measure impurity. Entropy can be calculated as:
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)
Where,
S= Total number of samples
P(yes)= probability of yes
P(no)= probability of no.
b) Gini Index:
Gini index is also being defined as a measure of impurity/ purity used while creating a decision tree in the CART(known as Classification and Regression Tree) algorithm.
An attribute having a low Gini index value should be preferred in contrast to the high Gini index value.
It only creates binary splits, and the CART algorithm uses the Gini index to create binary splits.
Gini index can be calculated using the below formula:
Gini Index= 1- ∑jPj2
Where pj stands for the probability
4. How Does the Decision Tree Algorithm works?
The basic idea behind any decision tree algorithm is as follows:
1. Select the best Feature using Attribute Selection Measures(ASM) to split the records.
2. Make that attribute/feature a decision node and break the dataset into smaller subsets.
3 Start the tree-building process by repeating this process recursively for each child until one of the following condition is being achieved :
a) All tuples belonging to the same attribute value.
Decision trees and Random forest are both the tree methods that are being used in Machine Learning.
Decision trees are the Machine Learning models used to make predictions by going through each and every feature in the data set, one-by-one.
Random forests on the other hand are a collection of decision trees being grouped together and trained together that use random orders of the features in the given data sets.
Instead of relying on just one decision tree, the random forest takes the prediction from each and every tree and based on the majority of the votes of predictions, and it gives the final output. In other words, the random forest can be defined as a collection of multiple decision trees.
6. Advantages of the Decision Tree
1 It is simple to implement and it follows a flow chart type structure that resembles human-like decision making.
2 It proves to be very useful for decision-related problems.
3 It helps to find all of the possible outcomes for a given problem.
4 There is very little need for data cleaning in decision trees compared to other Machine Learning algorithms.
5 Handles both numerical as well as categorical values
7. Disadvantages of the Decision Tree
1 Too many layers of decision tree make it extremely complex sometimes.
2 It may result in overfitting ( which can be resolved using the Random Forest algorithm)
3 For the more number of the class labels, the computational complexity of the decision tree increases.
#Split the data set into training data and test data
from sklearn.model_selection import train_test_split
x = raw_data.drop('Kyphosis', axis = 1)
y = raw_data['Kyphosis']
x_training_data, x_test_data, y_training_data, y_test_data = train_test_split(x, y, test_size = 0.3)
#Train the decision tree model
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
model.fit(x_training_data, y_training_data)
predictions = model.predict(x_test_data)
#Measure the performance of the decision tree model
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
print(classification_report(y_test_data, predictions))
print(confusion_matrix(y_test_data, predictions))
With this, I finish this blog. Hello Everyone, Namaste
My name is Pranshu Sharma and I am a Data Science Enthusiast
Thank you so much for taking your precious time to read this blog. Feel free to point out any mistake(I’m a learner after all) and provide respective feedback or leave a comment.
A verification link has been sent to your email id
If you have not recieved the link please goto
Sign Up page again
Loading...
Please enter the OTP that is sent to your registered email id
Loading...
Please enter the OTP that is sent to your email id
Loading...
Please enter your registered email id
This email id is not registered with us. Please enter your registered email id.
Don't have an account yet?Register here
Loading...
Please enter the OTP that is sent your registered email id
Loading...
Please create the new password here
We use cookies on Analytics Vidhya websites to deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you agree to our Privacy Policy and Terms of Use.Accept
Privacy & Cookies Policy
Privacy Overview
This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.