OVO Tech Blog

Bayesian Optimisation for improving model performance

Introduction

Morley Coulson


Bayesian Optimisation for improving model performance

Posted by Morley Coulson on .
Featured

Bayesian Optimisation for improving model performance

Posted by Morley Coulson on .

Hello whoever is reading this, I'm here with a data science themed blog post for anyone interested in adding a new tool to their predictive modelling/ML toolkit. This post will focus on hyperparameter optimisation - these are the parameters of modelling techniques themselves that are specified by the user - for example the number of trees in a random forest or the the number of nodes in a neural network layer. Bayesian optimisation, then,  is one approach to tuning these hyperparameters in order to get the best performance out of your model. This isn't the only usage of Bayesian optimisation - it has also been applied by the Google Brain team to improve their chocolate chip cookies. While the full mathematical details are too much for this post, there are fortunately lots of libraries that have implemented Bayesian optimisation for us that can be easily used (with one example later in this post).

A machine learning workflow (a typical example illustrated below) involves many steps. It starts with data acquisition and cleaning the data, followed by a recurring loop of iteratively tweaking your data and model to improve its performance. Of these steps, the feature engineering to make your predictive variables as potent as possible is typically most impactful, along with acquiring more data if possible. However you reach a point where you’ve exhausted these options and tuning hyperparameters is the only remaining lever that can be moved to improve your model performance (and modern techniques have quite a number of these to tune).

Conventional strategies for hyperparameter tuning include things like grid search (bad, do not use) and random search, however these approaches become quite time consuming if your dataset or model complexity are large. Ideally you could adopt a more principled way of searching for the best hyperparameters, which is where Bayesian optimisation comes into play (there are other approaches that could be used, such as evolutionary algorithms)

Bayesian optimisation is a general optimisation technique founded on the idea that you have some generic black box function you want to optimise over a number of parameters, but you’re slightly scuppered in that a) it’s a black box function so you can only evaluate the parameters empirically and b) it’s expensive to query the black box function to find the output of the function for any given set of parameters. This is exactly the situation we have when optimising hyperparameters - in this case the black box function is the value of the loss or objective function from building a model with the hyperparameters as inputs:

How does it work?
Broadly, the approach works by approximating the black box function with a surrogate model, which you iteratively sample from using an acquisition function which proposes the next best set of hyperparameters to evaluate (see the sections at the end for detail into what the surrogate model and acquisition function actually are).

This is illustrated in the example below (gif from Wikipedia), which is using a Gaussian Process (GP) as the surrogate model. At each step the GP is updated based on the latest set of evaluated hyperparameters to better approximate your black box function (the approximation in this case is the blue line and shaded purple area representing uncertainty, with the true function (black line) shown for comparison). As more points are observed you can see that the GP more closely approximates the true function. Also shown below are the different acquisition functions that can be used to decide which point to evaluate next (the point that maximises the acquisition function is the suggested point to evaluate next).

How to use it?
Here at OVO we use a lot of python for data science, and fortunately there are a number of good python libraries that have implemented Bayesian optimisation for us already (if you’re interested in implementing from scratch, see this example). To work with most of them you have to specify the limits of your hyperparameters (e.g the number of trees in a random forest must be greater than 0) and create a function that takes as arguments the hyperparameters to optimise, and outputs a single number that evaluates those hyperparameters (e.g RMSE from a model built with them) - see for example the below code snippet using the BayesianOptimisation library:

from bayes_opt import BayesianOptimization
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor

#General function for building a model
def rf_cv(X, y, **kwargs):
    estimator = RandomForestRegressor(**kwargs)
    cval = cross_val_score(estimator, X, y, scoring = 'neg_mean_squared_error', cv = 4, verbose = 2, n_jobs = -1)
    return cval.mean()


def bayesian_optimise_rf(X, y, n_iter = 100):
    def rf_crossval(n_estimators, max_features):
        #Wrapper of RandomForest cross validation.
        #Note the fixing of the inputs so they match the expected type
        #(e.g n_estimators must be an integer)
        return rf_cv(
            X = X,
            y = y,
            n_estimators = int(n_estimators),
            max_features = max(min(max_features, 0.999), 1e-3),
            bootstrap = True
        )
    
    optimizer = BayesianOptimization(
        f = rf_crossval,
        pbounds = {
            "n_estimators" : (10, 400),
            "max_features" : (0.1, 0.999),
        }
    )
    optimizer.maximize(n_iter = n_iter)
    print("Final result:", optimizer.max)

Here rf_crossval is the function we construct that builds the model, which we then simply plug into the BayesianOptimization function provided by the BayesianOptimisation library.

Does it work?
Below you can see the results from two relatively simple models I built recently:

  1. A forest model (left) - tuning 4 hyperparameters: number of trees, minimum terminal node size, proportion of variables evaluated for splitting & proportion of data sampled for each tree.
  2. A gradient boosted model (right) - tuning 6 hyperparameters: similar tree parameters to the random forest, but also now including a learning rate and a maximum tree depth.

As you can see, there was a lot of success with the random forest - a better result was obtained faster than using random search. However the GBM case is less clear cut - initially the Bayesian optimisation outperformed the random search (only just), but after a number of iterations the random search was able to find a better result. However this can be understood if we go back to the reason why you'd use Bayesian optimisation in the first place - it aims to find a good set of hyperparamaters with the smallest amount of tuning, but given enough time the results of a more brute force approach will be able to produce a comparable result as it will also be able to cover a large section of hyperparameter space. Another interpretation in this case would be that the random search got lucky (due to its random nature), and if we rerun the experiment we might get a different result.

Another trade-off is that Bayesian optimisation is a typically a sequential approach as the surrogate model must be updated after each iteration, while random search can be parallelised easily (however this can be addressed with a bit of modification to the acquisition function to allow several test sets to be suggested at once).

Closing remarks
As you can see here and in the examples linked above, Bayesian Optimisation outperformed the benchmark tuning method (random search) for the random forest, but not for the GBM model (although the results are close). Using Bayesian optimisation is relatively easy with modern implementations, so I recommend giving it a spin yourself.


What is the surrogate model?
In the case of Bayesian optimisation, the surrogate model is typically a Gaussian Process (the first half of this video is an excellent introduction to GPs, or this Distill article for a more detailed introduction - if you want to understand in more detail how this works I strongly recommend reading about Gaussian Processes first). Where parametric models attempt to learn the values/distributions of parameters in a model, Gaussian Processes are a non parametric technique which treat every single point in continuous space as a random variable (so they can be infinite dimensional) and attempt to learn a distribution over functions directly (the functions in this case being each of the random variables). This is Bayesian in the sense that you define some prior belief over the functions, this is in the form of assuming the random variables are Gaussian and defining a covariance (Kernel) function which controls ‘smoothness’ (ie how do each of the random variables relate to each other?). As you observe data (as you empirically evaluate certain sets of hyperparameters), you are able to combine the prior beliefs with these observations to update the posterior of the surrogate model at each iteration - in this case the posterior distribution tells us the predicted objective function value and uncertainty in that prediction at each point in hyperparameter space.

What is the acquisition function?
The acquisition function is a sampling function used to decide what point you should evaluate next based on your surrogate model. These borrow a lot from reinforcement learning, and trade off between exploration (evaluate areas of the surrogate model where there is large uncertainty) and exploitation (evaluate areas of the surrogate model where we think the value of the objective function will be high). There are 3 main acquisition functions used (see here for detail):

  1. Probability of improvement - this is a purely exploitative sampling function which suggests the set of hyperparameters which are predicted as most likely to be an improvement of your current best result.
  2. Expected improvement - this is similar to probability of improvement above, but rather than just taking the most likely improvement, it also factors in the magnitude of potential improvement. This is the most widely used acquisition function.
  3. Upper confidence bound - this acquisition function encourages exploration and is defined as the prediction plus the uncertainty at that point multiplied by some user specified tuning parameter.

Morley Coulson

View Comments...