Here is a toy function. (To see the code and more plots, check out this notebook.)

A plot showing a set of dots making an arc shape, with some stochastic dropoffs toward the edges of the arch, and a very high orange point amongst these stochastic dropoffs

Figure 1: 80 random observations of a deterministic function (black) and the predicted maximal point in that function (orange), according to a Gaussian process trained on those 80 observations.

Intuitively, it seems clear that this function’s highest value probably occurs when x is in the center region. But a Gaussian Process (GP) thinks the highest value is out in a more mediocre region. This isn’t just a explore-exploit trade-off; the GP thinks the expected value is high at that orange point – higher than any value the GP has ever seen before!

I ran into this issue while exploring a real loss function, tuning hyperparameters with Bayesian Optimization. I found that my GP kept insisting there would be excellent results at unpromising points like the one above. The model sent me through hundreds of iterations of whack-a-mole, so I investigated the issue and came up with this toy example that makes the issue obvious.

This issue occurs because:

  1. Gaussian Processes extrapolate, and they do it over an inferred distance, or “lengthscale”.
  2. Sudden drop-offs in the function cause the GP to choose a low lengthscale. (This issue is most pronounced when testing deterministic functions. With noisy functions, these sudden drop-offs can be characterized as noise.)

So the GP makes local extrapolations that appear goofy given more global context.

Intuition: Why Gaussian Processes extrapolate

Consider a simple scenario with two observations and one prediction.

Three random variables along an axis. A and B are black dots increasing upward. C has no observation, just an orange rectangle with a question mark.

Figure 2: If you were a Gaussian process, what would you predict? You are given three random variables, A, B, and C, and the observed values for the first two. What is your prediction for C?

Gaussian Processes work by modeling this as a 3D multivariate Gaussian, using distance to determine correlation between variables. This clever trick transports us from thinking about functions over arbitrary spaces to thinking about Gaussian distributions with a finite number of variables.

Imagine another multivariate Gaussian that matches this correlation structure. Consider the performance of athletes in three different sports: Swimming (A), Cycling (B), and Running (C). Suppose correlations match distances in the chart above, with cov(A, B) = high, cov(B, C) = high, cov(A, C) = medium-to-low. In other words, elite swimmers are usually good cyclists but aren’t always great at running. Elite cyclists are usually good swimmers and runners. Elite runners are usually good cyclists but might not be great at swimming. A large set of unobserved latent causal factors underlie A, B, and C, and we don’t attempt to model these explicitly.

Suppose an athlete is a good cyclist. If we are given the extra context that she is a bad swimmer, does that increase the probability that she is a good runner? It depends on the actual covariance values, but there are definitely possible realities where that answer is yes. She doesn’t have the latent causal factors that make a good swimmer, and these latent factors also tend to make a good cyclist, yet she is still a good cyclist, so we expect she especially has the latent causal factors that make a good runner, to compensate.

If B is a medium value and A is low, we expect C to compensate for A. That's why Gaussian Processes extrapolate. (You can also show this by inverting a covariance matrix, but I like this playful explanation. See the appendix for something more rigorous.)

This is a good thing, usually.

Why extrapolation sometimes causes goofy predictions

Let’s zoom in on the section with the goofy prediction.

The same plot as the first figure, with an indicator overlaying it showing the region we are zooming into, the space around the organge point. Below this top plot is the zoomed plot, which shows only five points: 4 black points, and the high orange point. Together, they make a bit of an arch shape.

Figure 3: Zooming into the previous chart, we can see that the high prediction can be viewed locally as an extrapolation. Moreover, it is extrapolating from both left and right, which explains why the predicted value is extra high.

The GP is essentially making predictions using only local data. It is doing this because fitting the GP on this dataset caused the GP to treat everything extremely locally, i.e. it selected a very small “lengthscale”. It chose a small lengthscale because the function contains discrete drop-off points. Tiny changes in x sometimes yield large changes in f(x), and the GP is trying to accommodate this.

This leads to something that is undesirable in hyperparameter tuning: if ever you discover a sudden drop-off in performance from mediocre results to very-bad results, the GP will get excited that maybe there are very-good results immediately on the opposite side of the mediocre results. In my experience, it ends up spending nearly all of its time testing these crazy theories.

What does this mean? What can I do?

I’m not sure yet what is the best way to handle this.

One solution is to switch to using a Matern kernel with \(\nu = 0.5\), rather than the often-used value \(\nu = 2.5\). This specifies that the underlying function is not differentiable, allowing for sharp changes in direction on predictions.

A plot whether the predictions are much mare aligned with the observations Figure 4: Modifying the Matern kernel allows for sharp changes in the prediction’s slope.

Here’s a nice blog post on that subject. However, making this change runs the risk of hurting predictions – after all, these sudden dropoffs only occur in some parts of the space, and often those parts of the space are irrelevant.

It might be worth adopting the following view: a deterministic function with sudden discrete changes in output are bad news for a Gaussian Process. A broad solution is: figure out how to get rid of them.

You could do this using clamping / winsorizing tricks on your data; imagine taking the whole bottom row of points from Figure 1 and clamping them to be equal to the middle row.

Alternately, you could change your function so that it is not deterministic, for example by randomly initializing parameters or introducing something analogous to stochastic gradient descent. Then, presumably, these random dropoffs will occur randomly across a whole range, not at specific points, and the GP will capture this as observation noise. Moving away from a deterministic function has other benefits anyway, e.g. greatly reducing the risk of overfitting your hyperparameters to the validation set.

One takeaway is that you really need to pay attention to what your Gaussian Process is doing. It will not always automatically do what you consider intuitive.

Appendix

In case you’re interested, here are the GP predictions at other points, with standard deviations:

The same plot as figure 1, now with a continuous prediction line and standard deviations

To get more rigorous with the simple A, B, C example, suppose we assign constant prior mean 0, as is common in Gaussian Processes. Denoting \(\Sigma_{XY} : cov(X,Y)\), the expected value for C is:

\[E\left[C \mid A=a, B=b\right] = \frac{a(\Sigma_{AC}\Sigma_{BB} - \Sigma_{AB}\Sigma_{BC}) + b(\Sigma_{BC}\Sigma_{AA} - \Sigma_{BA}\Sigma_{AC})}{\Sigma_{AA}\Sigma_{BB} - \Sigma_{AB}^2}\]

This can be derived from these equations, specifically the one for \(\boldsymbol \mu_*\).

It is difficult to understand this function at a glance, but it is interesting to simply ask, “What happens as \(a\) increases?” We see that: if the correlative chain from \(A \to B \to C\) is strong, but the direct correlation between \(A \to C\) is weak, i.e. \(\Sigma_{AB}\Sigma_{BC} > \Sigma_{AC}\Sigma_{BB}\), then increasing \(a\) will decrease \(E[C \mid ...]\). Thus, reducing \(a\) in isolation will increase \(E[C \mid ...]\), which is an example of extrapolation.

(This work was supported by ML Collective via their donated GCP compute resources which have been super helpful. Thanks, also, to Rosanne Liu for useful feedback on early drafts of this post.)