Random Forest is a widely-used machine learning algorithm developed by Leo Breiman and Adele Cutler, which combines the output of multiple decision trees to reach a single result. Its ease of use and flexibility, coupled with its effectiveness as a random forest classifier have, fueled its adoption, as it handles both classification and regression problems. In this article, we will understand how random forest algorithm works, and about its advantages , random forest regression techniques and how it differs from other algorithms and how to use it.
In this article, you will explore the Random Forest model, a powerful machine learning technique. We will provide a random forest numerical example to illustrate its functionality and effectiveness. You’ll learn what is random forest is and how it operates by aggregating predictions from multiple decision trees. Additionally, we will present a practical example to demonstrate its application in real-world scenarios. Finally, we will delve into analysis to uncover insights into variable importance and overall model performance.
This article was published as a part of the Data Science Blogathon.
To effectively use Random Forest, it is important to understand the underlying assumptions of the algorithm:
Random forest, a popular machine learning algorithm developed by Leo Breiman and Adele Cutler, merges the outputs of numerous decision trees to produce a single outcome. Its popularity stems from its user-friendliness and versatility, making it suitable for both classification and regression tasks.
Its widespread popularity stems from its user-friendly nature and adaptability, enabling it to tackle both classification and regression problems effectively. The algorithm’s strength lies in its ability to handle complex datasets and mitigate overfitting, making it a valuable tool for various predictive tasks in machine learning.
One of the most important features of the Random Forest Algorithm is that it can handle the data set containing continuous variables, as in the case of regression, and categorical variables, as in the case of classification. It performs better for classification and regression tasks. In this tutorial, we will understand the working of random forest and implement random forest on a classification task.
Let’s dive into a real-life analogy to understand this concept further. A student named X wants to choose a course after his 10+2, and he cant decide which course fit for his skill set. So he decides to consult various people like his cousins, teachers, parents, degree students, and working people. He asks them varied questions like why he should choose, job opportunities with that course, course fee, etc. Finally, after consulting various people about the course he decides to take the course suggested by most people.
Before understanding the working of the random forest algorithm in machine learning, we must look into the ensemble learning technique. Ensemble simplymeans combining multiple models. Thus a collection of models is used to make predictions rather than an individual model.
Ensemble uses two types of methods:
As mentioned earlier, Random forest Classifier works on the Bagging principle. Now let’s dive in and understand bagging in detail.
Bagging, also known as Bootstrap Aggregation, serves as the ensemble technique in the Random Forest algorithm. Here are the steps involved in Bagging:
Now let’s look at an example by breaking it down with the help of the following figure. Here the bootstrap sample is taken from actual data (Bootstrap sample 01, Bootstrap sample 02, and Bootstrap sample 03) with a replacement which means there is a high possibility that each sample won’t contain unique data. The model (Model 01, Model 02, and Model 03) obtained from this bootstrap sample is trained independently. Each model generates results as shown. Now the Happy emoji has a majority when compared to the Sad emoji. Thus based on majority voting final output is obtained as Happy emoji.
Boosting is one of the techniques that use the concept of ensemble learning. A boosting algorithm combines multiple simple models (also known as weak learners or base estimators) to generate the final output. It is done by building a model by using weak models in series.
There are several boosting algorithms; AdaBoost was the first really successful boosting algorithm that was developed for the purpose of binary classification. AdaBoost is an abbreviation for Adaptive Boosting and is a prevalent boosting technique that combines multiple “weak classifiers” into a single “strong classifier.” There are Other Boosting techniques. For more, you can visit – 4 Boosting Algorithms You Should Know: GBM, XGBoost, LightGBM & CatBoost.
For example:
Consider the fruit basket as the data as shown in the figure below. Now n number of samples are taken from the fruit basket, and an individual decision tree is constructed for each sample. Each decision tree will generate an output, as shown in the figure. The final output is considered based on majority voting. In the below figure, you can see that the majority decision tree gives output as an apple when compared to a banana, so the final output is taken as an apple.
Random forest is a collection of decision trees; still, there are a lot of differences in their behavior.
Decision trees | Random Forest |
1. Decision trees normally suffer from the problem of overfitting if it’s allowed to grow without any control. | 1. Random forests are created from subsets of data, and the final output is based on average or majority ranking; hence the problem of overfitting is taken care of. |
2. A single decision tree is faster in computation. | 2. It is comparatively slower. |
3. When a data set with features is taken as input by a decision tree, it will formulate some rules to make predictions. | 3. Random forest randomly selects observations, builds a decision tree, and takes the average result. It doesn’t use any set of formulas. |
Thus random forests are much more successful than decision trees only if the trees are diverse and acceptable.
Hyperparameters are used in random forests to either enhance the performance and predictive power of models or to make the model faster.
Now let’s implement Random Forest in scikit-learn.
# Importing the required libraries
import pandas as pd, numpy as np
import matplotlib.pyplot as plt, seaborn as sns
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('heart_v2.csv')
print(df.head())
sns.countplot(df['heart disease'])
plt.title('Value counts of heart disease patients')
plt.show()
# Putting feature variable to X
X = df.drop('heart disease',axis=1)
# Putting response variable to y
y = df['heart disease']
# now lets split the data into train and test
from sklearn.model_selection import train_test_split
# Splitting the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)
X_train.shape, X_test.shape
from sklearn.ensemble import RandomForestClassifier
classifier_rf = RandomForestClassifier(random_state=42, n_jobs=-1, max_depth=5,
n_estimators=100, oob_score=True)
%%time
classifier_rf.fit(X_train, y_train)
# checking the oob score
classifier_rf.oob_score_
rf = RandomForestClassifier(random_state=42, n_jobs=-1)
params = {
'max_depth': [2,3,5,10,20],
'min_samples_leaf': [5,10,20,50,100,200],
'n_estimators': [10,25,30,50,100,200]
}
from sklearn.model_selection import GridSearchCV
# Instantiate the grid search model
grid_search = GridSearchCV(estimator=rf,
param_grid=params,
cv = 4,
n_jobs=-1, verbose=1, scoring="accuracy")
%%time
grid_search.fit(X_train, y_train)
grid_search.best_score_
rf_best = grid_search.best_estimator_
rf_best
From hyperparameter tuning, we can fetch the best estimator, as shown. The best set of parameters identified was max_depth=5, min_samples_leaf=10,n_estimators=10
from sklearn.tree import plot_tree
plt.figure(figsize=(80,40))
plot_tree(rf_best.estimators_[5], feature_names = X.columns,class_names=['Disease', "No Disease"],filled=True);
from sklearn.tree import plot_tree
plt.figure(figsize=(80,40))
plot_tree(rf_best.estimators_[7], feature_names = X.columns,class_names=['Disease', "No Disease"],filled=True);
The trees created by estimators_[5] and estimators_[7] are different. Thus we can say that each tree is independent of the other.
rf_best.feature_importances_
imp_df = pd.DataFrame({
"Varname": X_train.columns,
"Imp": rf_best.feature_importances_
})
imp_df.sort_values(by="Imp", ascending=False)
Random forest is a great choice if anyone wants to build the model fast and efficiently, as one of the best things about the random forest Classifier is it can handle missing values. It is one of the best techniques with high performance, widely used in various industries for its efficiency. It can handle binary, continuous, and categorical data. Overall, random forest is a fast, simple, flexible, and robust model with some limitations.
Key Takeaways
A. Random forest is an ensemble learning method combining multiple decision trees, enhancing prediction accuracy, reducing overfitting, and providing insights into feature importance, widely used in classification and regression tasks.
A. Random forest works by first randomly selecting subsets of the training data. For each subset, it constructs decision trees, splitting the data at each node based on the best feature from a random subset of features. Each tree then makes a prediction for a given input. Finally, the random forest combines the predictions of all trees, using averaging for regression tasks or majority voting for classification tasks, to produce the final output.
A. Random Forest tends to have a low bias since it works on the concept of bagging. It works well even with a dataset with a large no. of features since it works on a subset of features. Moreover, it is faster to train as the trees are independent of each other, making the training process parallelizable.
A. Random forest algorithms are used for their superior prediction accuracy, ability to handle large datasets, versatility in tasks, robustness to noise, and capability to provide feature importance insights.
A. Random forest is an ensemble learning method that combines multiple decision trees to improve accuracy and reduce overfitting, used for both classification and regression tasks. Regression, on the other hand, is a statistical technique that models the relationship between dependent and independent variables to predict continuous outcomes.
The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.
Great article. Quiet an interesting read. A good balance of content and visuals presented in an easy to understand manner. Kudos!
Good job! Interesting article, presented in a easy to understand manner. Good luck!
Great Article. Concise and clear. Gives a good understanding of the concept with some interesting visual as well. Kudos!
Good article. Easy to understand for newbies like me.
it is nice and it helped me a lot so I'm thankful for your support
really great article with well explained stuff...thanx Shruthi
Hello and thank you for the your nice article and the material you have shared. Please note that the meaning of hyper parameters of max_features, min_sample_leaf and oob_score are most probably not exact and correct in this article.
I have just one question. If 30% is left out by the model, why did you use train_test_split, splitting 70% of the dataset to train the model? Thanks a lot for the article! It was very enlightening.
The hyperparameter section : "mini_sample_leaf– determines the minimum number of leaves required to split an internal node." Couldn't find any hyperparameter as such.
Thank you for publishing this article, it's so useful. Real-world examples are even more helpful, I really enjoyed reading it.
Very well explained. Thank you.
Well written and informative content. Why did you choose data science after doing civil engg?
Good work COMPRISED of all basic and introductory information. While reading i was thinking that both classification and regression will be covered. but at the end majority of the work was based upon classification. I need regression based work too. But no problem material is still important and good covered.
Can the author explain this in details: "Train-Test split: In a random forest, we don’t have to segregate the data for train and test as there will always be 30% of the data which is not seen by the decision tree." Thanks,
All in all, great product out here and its communicated in a digestible language. Only one concern. if we do not need a train_test split, then why do we have that in the code?
The blog titled "Understanding Random Forest" offers a clear and insightful exploration of the Random Forest algorithm, a powerful tool in machine learning. With a blend of in-depth explanations and practical examples, the blog breaks down the complexity of Random Forest into manageable concepts, making it accessible even to those new to the field.
Good work !!! Simple and clear explanation about Random Forest Algorithm.
This article gave great insights
Great explanation of Random Forest! I love how you broke down the concept and its applications in real-world scenarios. The visuals really helped clarify the process. Looking forward to more posts like this!