While trying to make a better predictive model, we come across a famous ensemble technique in machine learning algorithms, known as Random Forest in Machine Learning. The Random Forest algorithm comes along with the concept of Out-of-Bag Score(OOB_Score).

Random Forest, is a powerful ensemble technique for machine learning and data science, but most people tend to skip the concept of OOB_Score while learning about the algorithm and hence fail to understand the complete importance of Random forest as an ensemble method.

This blog will walk you through the OOB_Score concept with the help of examples.

- Gain an understanding of the motivations behind using Random Forest algorithms in
**machine learning**models, including their advantages over other models. - Learn about the concepts of bootstrapping and the Out-of-Bag (OOB) sample, and how they contribute to the formation and evaluation of a Random Forest.
- Understand how to calculate the Out-of-Bag score, its interpretation, and its role as an internal validation mechanism in Random Forest models.
- Analyze the benefits and limitations of using the Out-of-Bag score for model assessment, including how it compares to other model evaluation techniques.

One of the best interpretable models used for **supervised learning** is Decision Trees, where the algorithm makes decisions and predict the values using an if-else condition, as shown in the example.

Though, Decision trees are easy to understand and in interpretations. One major issue with the decision tree is:

- If you grow a tree to its maximum depth (the default setting), it will capture all the acute details in the training dataset.
- And applying on testing data gives high error due to High Variance (overfitting of Training data)

Hence** ,** to have the best of both worlds, that is less variance and more interpretability. The algorithm of Random Forest was introduced.

Random Forests or Random Decision Forests are an ensemble learning method for classification and regression problems that operate by constructing a multitude of independent decision trees(using bootstrapping) at training time and outputting majority prediction from all the trees as the final output.

Constructing many decision trees in a Random **Forest algorithm** helps the model to generalize the data pattern rather than learn the data pattern and therefore, reduce the variance (reduce overfitting).

But, how to select a training set for every new decision tree made in a Random Forest? This is where **Bootstrapping** kicks in!!

We create new training sets for multiple decision trees in Random Forest using the concept of Bootstrapping, which is essentially random sampling with replacement.

Let us look at an example to understand how bootstrapping works:

Here, the main training dataset consists of five animals, and now to make different samples out of this one main training set.

- Fix the sample size
- Randomly choose a data point for a sample
- After selection, keep it back in the main set (replacement)
- Again choose a data point from the main training set for the sample and after selection, keep it back.
- Perform the above steps, till we reach the specified sample size.

Note:Random forest bootstraps both data points and features while making multiple indepedent decision trees

Total number of trees in random forest, which are also called estimators, can be set using n_estimators.

In the above example, you can observe that we repeated some animals while making the sample, and some animals did not even occur once in the sample.

Here, Sample1 does not have Rat and Cow whereas sample 3 had all the animals equal to the main training set.

While making the samples, data points were chosen randomly and with replacement, and the data points which fail to be a part of that particular sample are known as

points.OUT-OF-BAG

Where does OOB_Score come into the picture??

OOB_Score is a very powerfulValidation Techniqueused especially for the Random Forest algorithm for least Variance results.Note: While using the

cross-validation technique, every validation set has already been seen or used in training by a few decision trees and hence there is a leakage of data, therefore more variance. But, OOB_Score prevents leakage and gives a better model with low variance, so we use OOB_score for validating the model.

Let’s understand OOB_Score through an example:

Here, we have a training set with 5 rows and a classification target variable of whether the animals are domestic/pet?

In the random forest, we build multiple decision trees. Below, we show a bootstrapped sample for one particular decision tree, say DT_1.

Here, Rat and Cat data have been left out. And since, Rat and Cat are OOB for DT_1, we would predict the values for Rat and Cat using DT_1. (Note: Data of Rat and Cat hasn’t been seen by DT_1 while training the tree.)

Just like DT_1, there would be many more decision trees where either rat or cat was left out or maybe both of them were left out.

Let’s say that the 3rd, 7th, and 100th decision trees have ‘Rat’ as an OOB datapoint. This means that none of them saw the ‘Rat’ data before predicting the value for ‘Rat’.

So, we recorded all the predicted values for “Rat” from the trees DT_1, Dt_3, DT_7, and DT_100.

And saw that aggregated/majority prediction is the same as the actual value for “Rat”.

(**To Note: None of the models had seen data before, and still predicted the values for a data point correctly)**

Similarly, every data point is passed for prediction to trees where it would be behaving as OOB and an aggregated prediction is recorded for each row.

The OOB_score is computed as the number of correctly predicted rows from the out-of-bag sample.

And

**OOB Error is the number of wrongly classifying the OOB Sample.**

**No leakage of data:**Since you validate the model on the OOB Sample in Python, which means you haven’t used the data in any way while training the model, there isn’t any leakage of data and this ensures a better predictive model.**Less Variance : [**More**Better Predictive Model:**OOB_Score helps in the least variance and hence it makes a much better predictive model than a model using other validation techniques.**Less Computation:**It requires less computation as it allows one to test the data as it is being trained.

**Time Consuming:**You can test the data as you train it using this method, but it is a bit more time-consuming compared to other validation techniques.**Not good for Large Datasets:**As the process can be a bit time-consuming in comparison with the other techniques, so if the data size is huge, it may take a lot more time while training the model.**Best for Small and medium-size datasets:**Even if the process is time-consuming, you should prefer OOB_Score over other techniques for a much better predictive model, especially if the dataset is medium or small-sized.

- The Out-of-Bag score serves as a reliable validation score, offering an insight into the model’s prediction error without the need for a separate validation dataset.
- By bypassing the necessity for a distinct validation dataset or test set, the Out-of-Bag error provides an efficient means to estimate the prediction error, streamlining the model evaluation process.
- The accuracy of the Out-of-Bag validation score highlights its effectiveness in reflecting the prediction error, making it a valuable tool for assessing model performance in the absence of an external validation dataset.

Random Forest can be a very powerful technique for predicting better values if we use the OOB_Score technique.Even though you spend a bit more time training the random forest model with the OOB_Score parameter set as True, the predictions justify the time consumed.

A. The out-of-bag error is a performance metric that estimates the performance of the Random Forest model using samples not included in the bootstrap sample for training.

A. In Random Forest classification, bagging, or bootstrap aggregation, combines predictions from multiple decision trees to reduce variance and avoid overfitting. By using different subsets of the training data (via sklearn’s RandomForestClassifier), it ensures that individual models generalize better. The model enhances its overall performance by making the final prediction based on a majority vote.

A. In a Random Forest model, each tree within the ensemble calculates the Out-of-Bag (OOB) error using the data samples it did not select for training during the bootstrap sampling process. These samples, referred to as “out-of-bag” samples, are the ones left out for each tree.

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Become a full stack data scientist##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

Understanding Cost Function
Understanding Gradient Descent
Math Behind Gradient Descent
Assumptions of Linear Regression
Implement Linear Regression from Scratch
Train Linear Regression in Python
Implementing Linear Regression in R
Diagnosing Residual Plots in Linear Regression Models
Generalized Linear Models
Introduction to Logistic Regression
Odds Ratio
Implementing Logistic Regression from Scratch
Introduction to Scikit-learn in Python
Train Logistic Regression in python
Multiclass using Logistic Regression
How to use Multinomial and Ordinal Logistic Regression in R ?
Challenges with Linear Regression
Introduction to Regularisation
Implementing Regularisation
Ridge Regression
Lasso Regression

Introduction to Stacking
Implementing Stacking
Variants of Stacking
Implementing Variants of Stacking
Introduction to Blending
Bootstrap Sampling
Introduction to Random Sampling
Hyper-parameters of Random Forest
Implementing Random Forest
Out-of-Bag (OOB) Score in the Random Forest
IPL Team Win Prediction Project Using Machine Learning
Introduction to Boosting
Gradient Boosting Algorithm
Math behind GBM
Implementing GBM in python
Regularized Greedy Forests
Extreme Gradient Boosting
Implementing XGBM in python
Tuning Hyperparameters of XGBoost in Python
Implement XGBM in R/H2O
Adaptive Boosting
Implementing Adaptive Boosing
LightGBM
Implementing LightGBM in Python
Catboost
Implementing Catboost in Python

Introduction to Clustering
Applications of Clustering
Evaluation Metrics for Clustering
Understanding K-Means
Implementation of K-Means in Python
Implementation of K-Means in R
Choosing Right Value for K
Profiling Market Segments using K-Means Clustering
Hierarchical Clustering
Implementation of Hierarchial Clustering
DBSCAN
Defining Similarity between clusters
Build Better and Accurate Clusters with Gaussian Mixture Models

Introduction to Machine Learning Interpretability
Framework and Interpretable Models
model Agnostic Methods for Interpretability
Implementing Interpretable Model
Understanding SHAP
Out-of-Core ML
Introduction to Interpretable Machine Learning Models
Model Agnostic Methods for Interpretability
Game Theory & Shapley Values

Deploying Machine Learning Model using Streamlit
Deploying ML Models in Docker
Deploy Using Streamlit
Deploy on Heroku
Deploy Using Netlify
Introduction to Amazon Sagemaker
Setting up Amazon SageMaker
Using SageMaker Endpoint to Generate Inference
Deploy on Microsoft Azure Cloud
Introduction to Flask for Model
Deploying ML model using Flask

Hello, thank you for your very useful content

I searched for many documents on the internet. I find this article very clearly explains OOB.