Bayesian Optimization Is More Basis-Dependent Than You Might Think

(A cautionary tale, illustrated via a toy example. If you do Bayesian hyperparameter tuning, you will want to know about this. Here’s a notebook with all of my code.)

Here are two rotations of a function.

Description.

We are going to examine how well a Gaussian Process (GP) models this function.

I have provided example axis labels, but feel free to substitute your own. In my example scenario, we are training a neural network, first running it for some number of epochs at a high learning rate, then running the remainder of epochs at a low learning rate. We configure this training regime using two hyperparameters. The resulting accuracy is plotted in color. Either chart shows that having more epochs is better, and that ideally we will drop the learning rate for only the final epoch. Having more than one low-learning-rate epoch leads to overfitting, shown as a reduction in accuracy left of the diagonal (or when “# of epochs at low learning rate” is 2).

Of course, we don’t have easy access to this function. Evaluating it is expensive. So we try to model the function from a few observations so that we can predict which hyperparameters are worth testing. Below, a few observations of the function are shown, and the predictions of a GP for the remaining values are plotted on the right. I’ve fit this function using the default configuration of Ax, a popular open-source library for Bayesian Optimization.

Description.

That didn’t turn out well. The blue predictions are pretty far removed from the orange actual function. Now let’s try again with the rotated version of the function.

Description.

Much better. As you can see, the parameterization matters a lot. Even just rotating the parameterization can make a big difference in whether the Gaussian Process accurately models the function. This may come as a surprise, since many optimization algorithms (e.g. gradient descent) are rotation-invariant, and multivariate Gaussians are known for having no preferred orientation. Why are Gaussian Processes different?

When you fit a GP to data, the model is not learning how the axes (hyperparameters) covary with the scalar output (accuracy). Instead, it is learning how accuracies for points in hyperparameter space covary with each other, and it typically does this by learning a metric for distances between points. Using this metric, it predicts values using a method similar to a weighted average. Here are some example distances from the metric that the GP would ideally learn, denoted “Near” and “Far”:

Description.

Typically, fitting the GP to the data involves choosing a lengthscale for each axis, normalizing distance per axis. The problem with Parameterization 1 is that the most useful directions in hyperparameter space are not axis-aligned, they are diagonal. In Parameterization 1, changing either hyperparameter in isolation leads to a big change in the scalar output. In Parameterization 2, one of the hyperparameters causes very small changes, while the other causes large changes. A GP whose covariance kernels use only lengthscales can’t effectively fit to the Parameterization 1, but it can fit to Parameterization 2. It is unable to learn that the scalar output changes slowly when both hyperparameters are changed together.

This example is representative of a larger class of examples. You can imagine many other groups of hyperparameters that ought to change together. Learning rate and batch size, batch size and number of epochs, number of channels in one layer and in another layer… the list goes on. Each of these indicate useful directions in hyperparameter space that aren’t axis-aligned.

A straightforward solution would be to use a more powerful covariance kernel that is capable of learning arbitrary linear transformations, rather than just the diagonal matrix of lengthscales. It would then be able to discover a useful set of axes, rather than just scaling the provided axes. Would this be a good default strategy, or are there downsides? I’m not sure. It would be great if fitting a GP to a set of experiment results would automatically find useful directions in hyperparameter space, rather than assuming that you know the useful directions a priori.

Regardless, a top-level takeaway here is that you should not just blindly trust your GP. Only trust its suggestions if it performs well on cross-validation.