Bayesian Optimization: bayes_opt or hyperopt

Rendyk 19 Jul, 2024

8 min read

Introduction

If you have experience in Machine Learning, specifically supervised learning, you should have known that hyper parameter-tuning is an important process to improve model accuracy. This process tunes hyperparameters in a Machine Learning algorithm.

As we have known, every algorithm requires input parameters from observed data and hyperparameters are not from the observed data. Hyperparameters are different for each algorithm. If we do not tune the hyperparameters, the default hyperparameters are used.

There are many ways to do hyper parameter-tuning. This article will later focus on Bayesian Optimization as this is my favorite. There are 2 packages that I usually use for Bayesian Optimization. They are “bayes_opt” and “hyperopt” (Distributed Asynchronous Hyper-parameter Optimization). We will simply compare the two in terms of the time to run, accuracy, and output.

But before that, we will discuss some basic knowledge of hyperparameter-tuning.

This article was published as a part of the Data Science Blogathon.

Introduction
What is Bayesian Optimization?
How to Perform Bayesian Optimization?
Hyperparameter-tuning
Hyper Parameter tuning Using GridSearchCV
- Disadvantages
Hyperparameter Tuning With Bayesian Optimization
Hyper Parameter-Tuning Using RandomizedSearchCV
- Problem With Uninformed Search
Baysian Optimization
- bayes_opt
- hyperopt
How Bayesian Optimization Works?
Conclusion

What is Bayesian Optimization?

Bayesian optimization is a technique used to find the best possible setting (minimum or maximum) for a function, especially when that function is complex, expensive to evaluate, or random. It’s a fancy way of saying it helps you efficiently find the best option by learning from previous evaluations.

How to Perform Bayesian Optimization?

Define the Objective Function:

This is the function you want to optimize. It takes a set of hyperparameters as input and returns a performance metric (e.g., accuracy, error rate) as output.

2. Specify the Search Space:

This defines the valid range of values for each hyperparameter you’re tuning.

3. Choose a Surrogate Model:

A surrogate model is a statistical model (like a Gaussian Process) that approximates the true objective function. It’s usually cheaper to evaluate than the actual function.

4. Select an Acquisition Function:

This function determines which set of hyperparameters to evaluate next. It balances exploring new areas of the search space and exploiting areas that seem promising based on past evaluations. Common choices include Expected Improvement (EI) and Upper Confidence Bound (UCB).

5. Run the Optimization Loop:

Iteration 1:Start with an initial set of hyperparameters (random sampling or informed guess).
Evaluate these hyperparameters using your objective function.
Train the surrogate model on the collected data (hyperparameters and their corresponding performance).
Use the acquisition function to find the next most promising set of hyperparameters.
Subsequent Iterations: Evaluate the new hyperparameters with the objective function.
Update the surrogate model with the new data point.
Use the acquisition function to find the following promising hyperparameter combination.

6. Stopping Criteria:

Define a stopping criteria based on your needs. This could be a maximum number of iterations, achieving a desired performance level, or running out of computational resources.

Hyperparameter-tuning

Hyperparameter-tuning is the process of searching the most accurate hyperparameters for a dataset with a Machine Learning algorithm. To do this, we fit and evaluate the model by changing the hyperparameters one by one repeatedly until we find the best accuracy.

The search methods can be uninformed search and informed search. Uninformed search tries sets of hyperparameters repeatedly and independently. Each search does not inform or suggest the other searches. Examples of uninformed search are GridSearchCV and RandomizedSearchCV.

Hyper Parameter tuning Using GridSearchCV

Now, I want to perform hyperparameter-tuning on GradientBoostingClassifier. The dataset is from Kaggle competition. The hyperparameters to tune are “max_depth”, “max_features”, “learning_rate”, “n_estimators”, and “subsample”.

Note that as mentioned above, these hyperparameters are only for GradientBoostingClassifier, not for the other algorithms. The accuracy metric is the accuracy score. I will run 5 fold cross-validation.

Below is the code for GridSearchCV. We can see that the value options for each hyperparameter are set in the “param_grid”.

For example, the GridSearchCV will try to run with n_estimators of 80, 100, and so on until 150. To know how many times the GridSearchCV will run, just multiply the number of value options in each hyperparameter with one another. It will be 8 x 3 x 3 x 5 x 3 = 1080. And for each of the 1080 GridSearchCV, there will be 5 fold cross-validation. That makes 1080 x 5 = 5400 models should be built to find which is the best.

# Load packages
from scipy.stats import uniform
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
# GridSearchCV
param_grid = {'max_depth':[3,4,5,6,7,8,9,10],
              'max_features':[0.8,0.9,1],
              'learning_rate':[0.01,0.1,1],
              'n_estimators':[80,100,120,140,150],
              'subsample': [0.8,0.9,1]}
grid = GridSearchCV(estimator=GradientBoostingClassifier(), param_grid=param_grid, scoring=acc_score, cv=5)
grid.fit(X_train.iloc[1:100,], y_train.iloc[1:100,])

Disadvantages

The disadvantage of this method is that we can miss good hyperparameter values not set in the beginning. For instance, we do not set an option for the max_features to be 0.85 or the learning_rate to be 0.05. We do not know if that combination can give better accuracy. To overcome this, we can try RandomizedSearchCV.

Hyperparameter Tuning With Bayesian Optimization

Hyperparameter tuning is a crucial step in machine learning where you find the best combination of settings (hyperparameters) for your model to achieve optimal performance. While there are different techniques for this, Bayesian optimization offers a more efficient and effective approach.

Advantages of Bayesian Optimization:

Reduced Evaluation Cost: By strategically choosing which hyperparameter combinations to evaluate, it saves time and resources compared to exhaustive search methods like grid search.
Efficient for Expensive Evaluations: When evaluating each hyperparameter combination is computationally expensive (e.g., training complex models), Bayesian optimization becomes even more valuable

Hyper Parameter-Tuning Using RandomizedSearchCV

Below is the code for that. Examine that the code sets a range of possible values for each hyperparameter. For example, the learning_rate can have any values from 0.01 to 1 distributed uniformly.

from sklearn.model_selection import RandomizedSearchCV
# RandomizedSearhCV
param_rand = {'max_depth':uniform(3,10),
              'max_features':uniform(0.8,1),
              'learning_rate':uniform(0.01,1),
              'n_estimators':uniform(80,150),
              'subsample':uniform(0.8,1)}
rand = RandomizedSearchCV(estimator=GradientBoostingClassifier(), param_distributions=param_rand, scoring=acc_score, cv=5)
rand.fit(X_train.iloc[1:100,], y_train.iloc[1:100,])

Problem With Uninformed Search

The problem with uninformed search is that it takes relatively a long time to build all the models. Informed search can solve this problem. In informed search, the previous models with a certain set of hyperparameter values can inform the later model which hyperparameter values better to select.

One of the methods to do this is coarse-to-fine. This involves running GridSearchCV or RandomizedSearchCV more than once. Each time, the hyperparameter value range is more specific.

For example, we start RandomizedSearchCV with learning_rate ranging from 0.01 to 1. Then, we find out that high accuracy models have their learning_rate around 0.1 to 0.3. Hence, we can run again GridSearchCV focusing on the learning_rate between 0.1 and 0.3. This process can continue until a satisfactory result is achieved. The first trial is coarse because the value range is large, from 0.01 to 1. The later trial is fine as the value range is focused on 0.1 to 0.3.

The drawback of the coarse-to-fine method is that we need to run the code repeatedly and observe the value range of hyperparameters-tuning. You might be thinking if there is a way to automate this. Yes, that is why my favorite is Bayesian Optimization.

Baysian Optimization

Bayesian Optimization also runs models many times with different sets of hyperparameter values, but it evaluates the past model information to select hyperparameter values to build the newer model. This is said to spend less time to reach the highest accuracy model than the previously discussed methods.

bayes_opt

As mentioned in the beginning, there are two packages in python that I usually use for Bayesian Optimization. The first one is bayes_opt. Here is the code to run it.

from bayes_opt import BayesianOptimization
# Gradient Boosting Machine
def gbm_cl_bo(max_depth, max_features, learning_rate, n_estimators, subsample):
    params_gbm = {}
    params_gbm['max_depth'] = round(max_depth)
    params_gbm['max_features'] = max_features
    params_gbm['learning_rate'] = learning_rate
    params_gbm['n_estimators'] = round(n_estimators)
    params_gbm['subsample'] = subsample
    scores = cross_val_score(GradientBoostingClassifier(random_state=123, **params_gbm),
                             X_train, y_train, scoring=acc_score, cv=5).mean()
    score = scores.mean()
    return score
# Run Bayesian Optimization
start = time.time()
params_gbm ={
    'max_depth':(3, 10),
    'max_features':(0.8, 1),
    'learning_rate':(0.01, 1),
    'n_estimators':(80, 150),
    'subsample': (0.8, 1)
}
gbm_bo = BayesianOptimization(gbm_cl_bo, params_gbm, random_state=111)
gbm_bo.maximize(init_points=20, n_iter=4)
print('It takes %s minutes' % ((time.time() - start)/60))

output:

|   iter    |  target   | learni... | max_depth | max_fe... | n_esti... | subsample |
-------------------------------------------------------------------------------------
|  1        |  0.7647   |  0.616    |  4.183    |  0.8872   |  133.8    |  0.8591   |
|  2        |  0.7711   |  0.1577   |  3.157    |  0.884    |  96.71    |  0.8675   |
|  3        |  0.7502   |  0.9908   |  4.664    |  0.8162   |  126.9    |  0.9242   |
|  4        |  0.7681   |  0.2815   |  6.264    |  0.8237   |  85.18    |  0.9802   |
|  5        |  0.7107   |  0.796    |  8.884    |  0.963    |  149.4    |  0.9155   |
|  6        |  0.7442   |  0.8156   |  5.949    |  0.8055   |  111.8    |  0.8211   |
|  7        |  0.7286   |  0.819    |  7.884    |  0.9131   |  99.2     |  0.9997   |
|  8        |  0.7687   |  0.1467   |  7.308    |  0.897    |  108.4    |  0.9456   |
|  9        |  0.7628   |  0.3296   |  5.804    |  0.8638   |  146.3    |  0.9837   |
|  10       |  0.7668   |  0.8157   |  3.239    |  0.9887   |  146.5    |  0.9613   |
|  11       |  0.7199   |  0.4865   |  9.767    |  0.8834   |  102.3    |  0.8033   |
|  12       |  0.7708   |  0.0478   |  3.372    |  0.8256   |  82.34    |  0.8453   |
|  13       |  0.7679   |  0.5485   |  4.25     |  0.8359   |  90.47    |  0.9366   |
|  14       |  0.7409   |  0.4743   |  8.378    |  0.9338   |  110.9    |  0.919    |
|  15       |  0.7216   |  0.467    |  9.743    |  0.8296   |  143.5    |  0.8996   |
|  16       |  0.7306   |  0.5966   |  7.793    |  0.8355   |  140.5    |  0.8964   |
|  17       |  0.772    |  0.07865  |  5.553    |  0.8723   |  113.0    |  0.8359   |
|  18       |  0.7589   |  0.1835   |  9.644    |  0.9311   |  89.45    |  0.9856   |
|  19       |  0.7662   |  0.8434   |  3.369    |  0.8407   |  141.1    |  0.9348   |
|  20       |  0.7566   |  0.3043   |  8.141    |  0.9237   |  94.73    |  0.9604   |
|  21       |  0.7683   |  0.02841  |  9.546    |  0.9055   |  140.5    |  0.8805   |
|  22       |  0.7717   |  0.05919  |  4.285    |  0.8093   |  92.7     |  0.9528   |
|  23       |  0.7676   |  0.1946   |  7.351    |  0.9804   |  108.3    |  0.929    |
|  24       |  0.7602   |  0.7131   |  5.307    |  0.8428   |  91.74    |  0.9193   |
=====================================================================================
It takes 20.90080655813217 minutes

params_gbm = gbm_bo.max['params']
params_gbm['max_depth'] = round(params_gbm['max_depth'])
params_gbm['n_estimators'] = round(params_gbm['n_estimators'])
params_gbm

Output:

{'learning_rate': 0.07864837617488214,
 'max_depth': 6,
 'max_features': 0.8723008386644597,
 'n_estimators': 113,
 'subsample': 0.8358969695415375}

The package bayes_opt takes 20 minutes to build 24 models. The best accuracy is 0.772.

hyperopt

Another package is hyperopt. Here is the code.

from hyperopt import hp, fmin, tpe
# Run Bayesian Optimization from hyperopt
start = time.time()
space_lr = {'max_depth': hp.randint('max_depth', 3, 10),
            'max_features': hp.uniform('max_features', 0.8, 1),
            'learning_rate': hp.uniform('learning_rate',0.01, 1),
            'n_estimators': hp.randint('n_estimators', 80,150),
            'subsample': hp.uniform('subsample',0.8, 1)}
def gbm_cl_bo2(params):
    params = {'max_depth': params['max_depth'],
              'max_features': params['max_features'],
              'learning_rate': params['learning_rate'],
              'n_estimators': params['n_estimators'],
              'subsample': params['subsample']}
    gbm_bo2 = GradientBoostingClassifier(random_state=111, **params)
    best_score = cross_val_score(gbm_bo2, X_train, y_train, scoring=acc_score, cv=5).mean()
    return 1 - best_score
gbm_best_param = fmin(fn=gbm_cl_bo2,
                space=space_lr,
                max_evals=24,
                rstate=np.random.RandomState(42),
                algo=tpe.suggest)
print('It takes %s minutes' % ((time.time() - start)/60))

Output:

100%|██████████| 24/24 [19:53<00:00, 49.74s/trial, best loss: 0.22769091027055077]
It takes 19.897333371639252 minutes

gbm_best_param

Output:

{'learning_rate': 0.03516615427790515,
 'max_depth': 6,
 'max_features': 0.8920776081423815,
 'n_estimators': 148,
 'subsample': 0.9981549036976672}

The package hyperopt takes 19.9 minutes to run 24 models. The best loss is 0.228. It means that the best accuracy is 1 – 0.228 = 0.772. The duration to run bayes_opt and hyperopt is almost the same.

The accuracy is also almost the same although the results of the best hyperparameters are different. But, there is another difference. The package bayes_opt shows the process of tuning the values of the hyperparameters. We can see which values are used for each iteration. The package hyperopt only shows one line of the progress bar, best loss, and duration.

In my opinion, I prefer bayes_opt because, in reality, we may feel the tuning process takes too long time and just want to terminate the process. After stopping the processing, we just want to take the best hyperparameter-tuning result. We can do that with bayes_opt, but not with hyperopt.

There are still other ways of automatic hyperparameter-tuning. Not only the hyperparameter-tuning, but choosing the Machine Learning algorithms also can be automated. I will discuss that next time. The above code is available here.

How Bayesian Optimization Works?

Imagine you’re searching for the best apple in a giant orchard. You don’t have a map, and some apples look greener from afar but turn out to be rotten. This is like tuning a machine learning model – you have various settings (hyperparameters) that affect the model’s performance, but evaluating each setting can be expensive or time-consuming.

Here’s how Bayesian Optimization helps you find the best apple (or hyperparameter value):

Initial Bite: You start by taking random bites (evaluating random hyperparameter combinations). This gives you a preliminary idea of where the good apples might be.
Learning from Tastes: After each bite, you don’t just know if the apple is good or bad, but also how good or bad it is compared to others. Bayesian Optimization uses this information to build a model of where the best apples are likely to be. This model is like a map of the orchard, but based on your taste tests so far.
Smart Tasting: Instead of random bites, you now use the map to guide your search. The model will suggest areas that are either:
- Promising but uncertain: Maybe you tasted a decent apple there, but there could be even better ones nearby. This is called exploration.
- Likely the best: The model might point to a region where you’ve tasted the best apples so far. This is called exploitation.
Refining the Map: With each new apple you taste, you update the model, making it more accurate. This iterative process continues until you’re confident you’ve found the tastiest apple (or the optimal hyperparameter setting).

Benefits of Bayesian Optimization:

Efficient: It focuses on areas likely to have the best results, saving time and resources.
Adapts to Complexity: Works well for problems where the relationship between hyperparameters and performance could be clearer.

Conclusion

Bayesian Optimization, a powerful method for hyperparameter-tuning, offers significant advantages over uninformed searches like GridSearchCV and RandomizedSearchCV. Leveraging tools like bayes_opt and hyperopt, it efficiently explores the parameter space. Despite some disadvantages, Bayesian Optimization remains a superior approach for optimizing model performance in machine learning.

About Author

Connect with me here.

The media shown in this article on Top Machine Learning Libraries in Julia are not owned by Analytics Vidhya and is used at the Author’s discretion.