Marcus Lewis

Extended material for 'Expressions are Pragmatic Model Visualizations'

Wed, 10 Jan 2024 01:00:00 -0800

This is the extended material for Expressions are Pragmatic Model Visualizations.

Followup on Example 1

Let’s further loosen the prior on the parameters.

Model Visualization

Gaussian Process with the following kernel.

Mean: constant

Covariance: Use distance between points as follows:

 * sum([
   
   # Kernel: Factorized scalar vs choice parameters
   * sum([
     
     # Scalar parameters
     * matern_25(
      norm_l2([
         compare('log_epochs') / ,
         compare('log_batch_size') / ,
         compare('log_conv1_weight_decay') / ,
         compare('log_conv2_weight_decay') / ,
         compare('log_conv3_weight_decay') / ,
         compare('log_dense1_weight_decay') / ,
         compare('log_dense2_weight_decay') / ,
         compare('log_1cycle_initial_lr_pct') / ,
         compare('log_1cycle_final_lr_pct') / ,
         compare('log_1cycle_pct_warmup') / ,
         compare('log_1cycle_max_lr') / ,
         compare('log_1cycle_momentum_max_damping_factor') / ,
         compare('log_1cycle_momentum_min_damping_factor_pct') / ,
         compare('log_1cycle_beta1_max_damping_factor') / ,
         compare('log_1cycle_beta1_min_damping_factor_pct') / ,
         compare('log_beta2_damping_factor') / ,
         compare('log_conv1_channels') / ,
         compare('log_conv2_channels') / ,
         compare('log_conv3_channels') / ,
         compare('log_dense1_units') / ])),
     
     # Choice parameters
     * exp(
      -norm_l1([
         compare('choice_nhot0') / ,
         compare('choice_nhot1') / ,
         compare('choice_nhot2') / ,
         compare('choice_nhot3') / ]))]),
   
   # Kernel: Joint scalar and choice parameters
   * prod([
     matern_25(
      norm_l2([
         compare('log_epochs') / ,
         compare('log_batch_size') / ,
         compare('log_conv1_weight_decay') / ,
         compare('log_conv2_weight_decay') / ,
         compare('log_conv3_weight_decay') / ,
         compare('log_dense1_weight_decay') / ,
         compare('log_dense2_weight_decay') / ,
         compare('log_1cycle_initial_lr_pct') / ,
         compare('log_1cycle_final_lr_pct') / ,
         compare('log_1cycle_pct_warmup') / ,
         compare('log_1cycle_max_lr') / ,
         compare('log_1cycle_momentum_max_damping_factor') / ,
         compare('log_1cycle_momentum_min_damping_factor_pct') / ,
         compare('log_1cycle_beta1_max_damping_factor') / ,
         compare('log_1cycle_beta1_min_damping_factor_pct') / ,
         compare('log_beta2_damping_factor') / ,
         compare('log_conv1_channels') / ,
         compare('log_conv2_channels') / ,
         compare('log_conv3_channels') / ,
         compare('log_dense1_units') / ])),
     exp(
      -norm_l1([
         compare('choice_nhot0') / ,
         compare('choice_nhot1') / ,
         compare('choice_nhot2') / ,
         compare('choice_nhot3') / ]))])])

When comparing a point to itself, add noise value: (log scale)

Visualization 3: Model parameters now that the priors on the "lengthscales" have been loosened even further.

Compared to Visualization 2 in the original post, the parameters are more free. The discrete change in behavior as the dataset grows as been further reduced. Parameters are now reaching higher absolute values than before.

How were results impacted?

Chart 1: Hold-one-out cross-validation results, now adding a third configuration.

There is a significant benefit for very small datasets. However, for most dataset sizes, this leads to worse results. It seems that having priors is important, especially as the dataset gets larger.

Followup on Example 2

Here’s a plot of the training run, using batch training again. This time I overlay the mean of all of the models’ losses.

Chart 2: Each of the 60 models’ negative losses after each step of training.

We find something even more interesting. Yes, the vast majority of the models converge hundreds of steps before the final model converges. But we also see that individual models often get worse on single training steps; only the average of all scores is monotonically improving. This may seem normal if you are accustomed to optimization in neural networks, but BoTorch’s optimization (scipy.optimize with L-BFGS-B) is only ever supposed to take a step if it improves the model.

BoTorch trains multiple models in parallel by adding all of their loss functions to create a single scalar loss. This type of composability is a valid strategy with optimizers like SGD or Adam, since those follow the gradient wherever it leads. But other optimizers, even if they are gradient-based, decide whether to accept a parameter update depending on the loss at the destination point. For these optimizers, you can’t simply add loss functions, or you will get a change in the optimizer’s behavior. After adding losses, only the sum of those losses is guaranteed to improve monotonically, while updates can harm the individual losses. For some model types like neural networks, we might describe this as “regularization” and treat it as a good thing, but in those cases we should just use a different optimizer.

So, not only are we evaluating converged models unnecessarily; we are taking indirect paths to the optimum. When I count model evaluations, 18,817 total evaluations happen when training sequentially, while 92,820 happen when training in parallel, so we are doing approximately 5 times too many operations. In my experiments, on GPUs it is still worth training in batch, rather than sequential, but on CPUs it is better to do simply loop over models and optimize them independently.

The effect of all of this is that cross-validation in BoTorch is much slower than it needs to be. As you increase the size of the dataset, the following slowdowns occur:

More models are trained.
Each of those models is more expensive, since it has a larger dataset.
The optimization trajectory becomes longer and longer as more models are trained in parallel. (Unnecessary)
More pointless evaluations of converged models occur, due to the increased likelihood of a single random slow-to-converge model. (Unnecessary)

These four factors multiply to create a slow experience.

I am inclined to implement batch training differently, maybe by implementing a single training run and then using something like JAX’s vmap. This would eliminate the factor 3, and maybe it could be used in conjunction with JAX’s while_loop to also solve factor 4.

(That’s it for the appendix! Here’s a link back to the main post.)

Expressions are Pragmatic Model Visualizations

Wed, 10 Jan 2024 01:00:00 -0800

Most machine learning models are never visualized. Visualizing a model and its parameters often leads to immediate insights or bugfixes, but getting a good visual requires a lot of one-off work.

How do we get useful visualizations without requiring too much human overhead? I think code is underrated as a visualization. In this blog post I show a family of pragmatic visualizations that are each created by simply printing code expressions and rendering their parameters inline. Often, the printed code is not runnable, but is instead a visually optimized version of the model’s code.

I start with two real-life examples where these visualizations provided valuable insights, using these examples to demonstrate the visualizations. Then I turn to my main focus: eliminating the one-off work. I share rows2prose, a Python library that generates these visuals from a dataframe of model parameters and any styled text (not just code). This leaves only the problems of tracing model parameters to a dataframe and printing visually optimized model code. I show how scientific computing frameworks can help with this, in this case using a Lisp-macro-like approach which I demonstrate with Vexpr.

Example 1: The Model That Was Good, Then Bad, Then Good

Last year I ran a machine learning experiment and noticed that the baseline model from BoTorch had a weird curve.

Chart 1: Scaling curve for a Gaussian Process performing hold-one-out cross-validation. Given a dataset, the model is trained on all but one point in the dataset, then it predicts the output for the held-out point. I repeat the experiment on 50 different random subsets of a larger dataset and plot the 10th percentile, 90th percentile, and geometric mean.

The model is a Gaussian Process, but that detail really doesn’t matter for this blog post. This is a strange scaling curve for any machine learning model. We would expect a model to continually get better at prediction as it receives larger datasets, with diminishing returns only at the end. Instead, this one has a big lull in the middle, giving it an almost staircase shape. What could this mean? (Consider pausing here and guessing the reason. I had a guess, and my guess turned out to be wrong.)

An obvious next step is to look at the model itself, not just the model’s results. How do the final trained model parameters change as we scale up the size of the dataset? I visualize each individual parameter in diagrams like this:

The vertical axis separates different experiments—in this case, different dataset sizes—from top to bottom. The horizontal axis shows parameter values across different repetitions of that experiment. To visualize the entire model, I print a simplified version of the the model’s code, rendering each parameter in-place using one of these diagrams for each parameter.

Here’s what the model looked like for each point in Chart 1. Feel free to zoom in, but don’t get bogged down in the details, just observe that each parameter had interesting changes about halfway down, coinciding with the interesting changes from the experiment.

Model Visualization

Predict using a Gaussian Process with the following covariance and mean.

Covariance kernel: Use distance between points as follows:

 * sum([
   
   # Kernel: Factorized scalar vs choice parameters
   * sum([
     
     # Scalar parameters
     * matern_25(
      norm_l2([
         compare('log_epochs') / ,
         compare('log_batch_size') / ,
         compare('log_conv1_weight_decay') / ,
         compare('log_conv2_weight_decay') / ,
         compare('log_conv3_weight_decay') / ,
         compare('log_dense1_weight_decay') / ,
         compare('log_dense2_weight_decay') / ,
         compare('log_1cycle_initial_lr_pct') / ,
         compare('log_1cycle_final_lr_pct') / ,
         compare('log_1cycle_pct_warmup') / ,
         compare('log_1cycle_max_lr') / ,
         compare('log_1cycle_momentum_max_damping_factor') / ,
         compare('log_1cycle_momentum_min_damping_factor_pct') / ,
         compare('log_1cycle_beta1_max_damping_factor') / ,
         compare('log_1cycle_beta1_min_damping_factor_pct') / ,
         compare('log_beta2_damping_factor') / ,
         compare('log_conv1_channels') / ,
         compare('log_conv2_channels') / ,
         compare('log_conv3_channels') / ,
         compare('log_dense1_units') / ])),
     
     # Choice parameters
     * exp(
      -norm_l1([
         compare('choice_nhot0') / ,
         compare('choice_nhot1') / ,
         compare('choice_nhot2') / ,
         compare('choice_nhot3') / ]))]),
   
   # Kernel: Joint scalar and choice parameters
   * prod([
     matern_25(
      norm_l2([
         compare('log_epochs') / ,
         compare('log_batch_size') / ,
         compare('log_conv1_weight_decay') / ,
         compare('log_conv2_weight_decay') / ,
         compare('log_conv3_weight_decay') / ,
         compare('log_dense1_weight_decay') / ,
         compare('log_dense2_weight_decay') / ,
         compare('log_1cycle_initial_lr_pct') / ,
         compare('log_1cycle_final_lr_pct') / ,
         compare('log_1cycle_pct_warmup') / ,
         compare('log_1cycle_max_lr') / ,
         compare('log_1cycle_momentum_max_damping_factor') / ,
         compare('log_1cycle_momentum_min_damping_factor_pct') / ,
         compare('log_1cycle_beta1_max_damping_factor') / ,
         compare('log_1cycle_beta1_min_damping_factor_pct') / ,
         compare('log_beta2_damping_factor') / ,
         compare('log_conv1_channels') / ,
         compare('log_conv2_channels') / ,
         compare('log_conv3_channels') / ,
         compare('log_dense1_units') / ])),
     exp(
      -norm_l1([
         compare('choice_nhot0') / ,
         compare('choice_nhot1') / ,
         compare('choice_nhot2') / ,
         compare('choice_nhot3') / ]))])])

When comparing a point to itself, add noise value: (log scale)

Mean: constant

Visualization 1: All of the parameters of a Gaussian Process model rendered in context. The covariance kernel at the top contains a few parameter types: a multiplicative positive scale (top-left), four multiplicative mixing weights (the other four parameters along the left side), while the rest are "lengthscale" parameters that are used to scale distances. Additionally there are noise and mean parameters (bottom). The noise parameter always ends up being very low, due to the default prior pushing it toward zero, and because all the points in the dataset are spaced apart by pseudorandom Sobol generation so the model is never forced to incorporate variance at a single location.

The sudden jump in accuracy in Chart 1 corresponds to a number of jumps in the parameters. Perhaps the most dramatic change was in the top half of the covariance kernel, where we see that a number of parameters stay fixed at about 0.3, until they suddenly jump up to large values.

This strongly suggests a theory: the model’s priors are too strong. With small dataset sizes, the gradients from better predicting the dataset are not powerful enough to overpower the gradient from the priors. After the dataset size crosses some threshold, the parameters are able to break free.

Let’s loosen the prior on the parameters and see if the issue is solved.

Chart 2: Hold-one-out cross-validation results, comparing the BoTorch baseline to a new configuration with a weaker lengthscale prior.

The model improved significantly. How do its parameters look?

Model Visualization

Predict using a Gaussian Process with the following covariance and mean.

Covariance kernel: Use distance between points as follows:

 * sum([
   
   # Kernel: Factorized scalar vs choice parameters
   * sum([
     
     # Scalar parameters
     * matern_25(
      norm_l2([
         compare('log_epochs') / ,
         compare('log_batch_size') / ,
         compare('log_conv1_weight_decay') / ,
         compare('log_conv2_weight_decay') / ,
         compare('log_conv3_weight_decay') / ,
         compare('log_dense1_weight_decay') / ,
         compare('log_dense2_weight_decay') / ,
         compare('log_1cycle_initial_lr_pct') / ,
         compare('log_1cycle_final_lr_pct') / ,
         compare('log_1cycle_pct_warmup') / ,
         compare('log_1cycle_max_lr') / ,
         compare('log_1cycle_momentum_max_damping_factor') / ,
         compare('log_1cycle_momentum_min_damping_factor_pct') / ,
         compare('log_1cycle_beta1_max_damping_factor') / ,
         compare('log_1cycle_beta1_min_damping_factor_pct') / ,
         compare('log_beta2_damping_factor') / ,
         compare('log_conv1_channels') / ,
         compare('log_conv2_channels') / ,
         compare('log_conv3_channels') / ,
         compare('log_dense1_units') / ])),
     
     # Choice parameters
     * exp(
      -norm_l1([
         compare('choice_nhot0') / ,
         compare('choice_nhot1') / ,
         compare('choice_nhot2') / ,
         compare('choice_nhot3') / ]))]),
   
   # Kernel: Joint scalar and choice parameters
   * prod([
     matern_25(
      norm_l2([
         compare('log_epochs') / ,
         compare('log_batch_size') / ,
         compare('log_conv1_weight_decay') / ,
         compare('log_conv2_weight_decay') / ,
         compare('log_conv3_weight_decay') / ,
         compare('log_dense1_weight_decay') / ,
         compare('log_dense2_weight_decay') / ,
         compare('log_1cycle_initial_lr_pct') / ,
         compare('log_1cycle_final_lr_pct') / ,
         compare('log_1cycle_pct_warmup') / ,
         compare('log_1cycle_max_lr') / ,
         compare('log_1cycle_momentum_max_damping_factor') / ,
         compare('log_1cycle_momentum_min_damping_factor_pct') / ,
         compare('log_1cycle_beta1_max_damping_factor') / ,
         compare('log_1cycle_beta1_min_damping_factor_pct') / ,
         compare('log_beta2_damping_factor') / ,
         compare('log_conv1_channels') / ,
         compare('log_conv2_channels') / ,
         compare('log_conv3_channels') / ,
         compare('log_dense1_units') / ])),
     exp(
      -norm_l1([
         compare('choice_nhot0') / ,
         compare('choice_nhot1') / ,
         compare('choice_nhot2') / ,
         compare('choice_nhot3') / ]))])])

When comparing a point to itself, add noise value: (log scale)

Mean: constant

Visualization 2: Model parameters now that the priors on the "lengthscales" have been loosened. Specifically, I changed the prior on the lengthscales—the parameters along the right side—from its default value Gamma(3.0, 6.0) to Gamma(1.125, 0.375), a distribution with the same mode but higher variance, so during training there is a weaker gradient pushing each parameter toward the mode. (The model tunes its mixing weights to favor the top half of the kernel, so the parameters in the top half are the ones that increase, while those in the bottom are still pulled toward 0.3.)

The discrete change in parameters is now mostly gone. I ran further experiments with extra weak priors to make the discrete change disappear more, and I found that it worked, further improving results for small datasets, however it began harming results for large datasets.

So I’ve learned that with this model, I’ll get best results if I use weaker priors, at least for small-to-medium datasets. BoTorch / Ax’s built-in priors did not serve me well. That doesn’t mean the default priors are wrong, rather it suggests that users need to be willing to look closely at their machine learning models if they want to get good results. If using a machine learning model always gave users visualizations like these, I think many more people would use them well.

Example 2: The Pitfalls of Parallel Cross-Validation

I think people ought to always see their model, including while it trains. I built this experience for myself, and it quickly provided an insight. Here is a visualization I watched in realtime as my model above trained. This is a batch cross-validation task with 60 datapoints, so I am training 60 models in parallel and visualizing their parameters (hence, multiple dots per parameter). Click the button below to watch the models train.

Model Visualization

Predict using a Gaussian Process with the following covariance and mean.

Covariance kernel: Use distance between points as follows:

 * sum([
   
   # Kernel: Factorized scalar vs choice parameters
   * sum([
     
     # Scalar parameters
     * matern_25(
      norm_l2([
         compare('log_epochs') / ,
         compare('log_batch_size') / ,
         compare('log_conv1_weight_decay') / ,
         compare('log_conv2_weight_decay') / ,
         compare('log_conv3_weight_decay') / ,
         compare('log_dense1_weight_decay') / ,
         compare('log_dense2_weight_decay') / ,
         compare('log_1cycle_initial_lr_pct') / ,
         compare('log_1cycle_final_lr_pct') / ,
         compare('log_1cycle_pct_warmup') / ,
         compare('log_1cycle_max_lr') / ,
         compare('log_1cycle_momentum_max_damping_factor') / ,
         compare('log_1cycle_momentum_min_damping_factor_pct') / ,
         compare('log_1cycle_beta1_max_damping_factor') / ,
         compare('log_1cycle_beta1_min_damping_factor_pct') / ,
         compare('log_beta2_damping_factor') / ,
         compare('log_conv1_channels') / ,
         compare('log_conv2_channels') / ,
         compare('log_conv3_channels') / ,
         compare('log_dense1_units') / ])),
     
     # Choice parameters
     * exp(
      -norm_l1([
         compare('choice_nhot0') / ,
         compare('choice_nhot1') / ,
         compare('choice_nhot2') / ,
         compare('choice_nhot3') / ]))]),
   
   # Kernel: Joint scalar and choice parameters
   * prod([
     matern_25(
      norm_l2([
         compare('log_epochs') / ,
         compare('log_batch_size') / ,
         compare('log_conv1_weight_decay') / ,
         compare('log_conv2_weight_decay') / ,
         compare('log_conv3_weight_decay') / ,
         compare('log_dense1_weight_decay') / ,
         compare('log_dense2_weight_decay') / ,
         compare('log_1cycle_initial_lr_pct') / ,
         compare('log_1cycle_final_lr_pct') / ,
         compare('log_1cycle_pct_warmup') / ,
         compare('log_1cycle_max_lr') / ,
         compare('log_1cycle_momentum_max_damping_factor') / ,
         compare('log_1cycle_momentum_min_damping_factor_pct') / ,
         compare('log_1cycle_beta1_max_damping_factor') / ,
         compare('log_1cycle_beta1_min_damping_factor_pct') / ,
         compare('log_beta2_damping_factor') / ,
         compare('log_conv1_channels') / ,
         compare('log_conv2_channels') / ,
         compare('log_conv3_channels') / ,
         compare('log_dense1_units') / ])),
     exp(
      -norm_l1([
         compare('choice_nhot0') / ,
         compare('choice_nhot1') / ,
         compare('choice_nhot2') / ,
         compare('choice_nhot3') / ]))])])

When comparing a point to itself, add noise value: (log scale)

Mean: constant

Visualization 3: Cross-validation with dataset size 60. I used the looser lengthscale prior from above, and I used a different noise prior that doesn't endlessly push noise toward 0.

As I watched this in my notebook, I got a sudden impression: many of these points are converging much faster than others. In fact, I think there are hundreds of steps where 59 of the 60 models have converged, and we’re just waiting for the last one. This is particularly evident if you watch the parameters in the lower half of the kernel, where one single faint blue dot slowly approaches the cluster of overlapping dots. This is concerning because all 60 models are being evaluated on every step, even though the last few hundred steps are unnecessary for most of the models.

I tested this theory by comparing the two training approaches.

Chart 3: Training trajectory for 60 models trained in batch, compared to that of training each of them separately.

The problem is worse than I thought. Not only do some optimizations finish well before others, but every optimization takes many more steps when trained in batch. When I count model evaluations, 18,817 total evaluations happen when training models one at a time, while 92,820 happen when training in parallel, so we are doing approximately 5 times too many operations. I describe this in more depth in this post’s extended material. I am inclined to implement batch training differently, maybe by implementing a single training run and then using something like JAX’s vmap in conjunction with JAX’s while_loop.

I wouldn’t have noticed this problem if I hadn’t been able to see my model during training. Of course, other standard visualizations could have revealed this issue; Chart 3 is fairly standard, and it would have also done the job. But I didn’t have Chart 3, I didn’t know I should be building it, and bulding it is actually difficult and inefficient with BoTorch. I think a visualized expression is a useful jumping off point, and maybe we should try to always have it available to us.

A Recipe for Pragmatic Model Visualization

How do we give ourselves a playful environment where our models are always visualized by default?

I solve part of the problem with a new Python library called rows2prose.

You give rows2prose two things:

a dataframe containing scalars that should be visualized
a visually-useful string that describes the model (for example, the model’s code), including placeholder text for visualized values

You can use rows2prose in Jupyter notebooks, and it can output HTML visualization files from arbitrary Python scripts. It generated every visual in this blog post, and you can use it today.

The remaining gap has multiple solutions

How do you take your model and get a visually-useful summary of it? There are, of course, many ways to do this, including crazy new approaches like asking an LLM to generate one for you.

But here’s how I did it.

I created Vexpr, a Python library that takes inspiration from Lisp. In Vexpr, you build up expression data structures (“Vexprs”) similar to Lisp S-expressions. The expressions you see in these visualizations are simply printed Vexprs.

(JAX fans may be familiar with “Jaxprs”; Vexprs are similar, but they are more user-facing. A user of Vexpr is intentionally building up an elegant Vexpr, while a user of JAX doesn’t really care what their Jaxpr looks like. Vexprs and Jaxprs two solve different problems and I plan to use them together.)

Just like Lisp, Vexpr lets you use macros to modify these expressions. For my previous post, I used macros to vectorize expressions, and for this post, I used them to visually optimize the expressions. For example, the actual Vexpr program for my model uses an elementwise division between two arrays to divide N different distances by N different lengthscales, but I wanted to visualize this as N different divisions, with each parameter rendered next to its corresponding “compare” feature. I changed the code using a macro, thus I actually unvectorized the division operations to make them prettier. Here’s another example: my printed Vexpr was more verbose than I wanted to be, so I used macros to convert it into pseudocode! Each visualization above contains the function call compare('log_epochs'), which doesn’t actually exist. In the runnable expression, this compare term is replaced with a larger subexpression that extracts a "log_epochs" feature from two vectors (x1 and x2) and computes the distance. I wanted a succinct visual, so I “visually optimized” the expression, removing those details, and I never implemented compare. Code can be much more succinct when it doesn’t actually need to run.

In addition to macros, another useful idea that filled this gap was partial evaluation. I take a Vexpr, plug in its parameters, then evaluate all parts of the expression that are ready to be evaluated. This computes the “unvectorize” from the previous paragraph, taking arrays of divisors and indexing into them. This is also useful when machine learning models put constraints on parameters; often they implement constraints by storing “raw” versions of the parameters and passing them through an exp or sigmoid to move them into a constrained interval. Partial evaluation moves the values into the constrained interval so that they are ready to be visualized. One thing that made me laugh, putting the ideas of these two paragraphs together: even after converting my runnable Vexpr into pseudocode, I still ran partial evaluation on it, evaluating all expressions that could be evaluated. It feels funny when you tell your computer to “evaluate the parts of this code that are not pseudocode”.

Putting all of this together, here is the final architecture underlying these visualizations.

The “Macros + Partial Evaluation” functionality I used is all present in Vexpr, but the ideas are still baking. In a future post I might try to convince you to use them.

But what about Deep Learning?

This blog post featured human-comprehensible machine learning models like Gaussian Processes. In these models each parameter has a very clear meaning. Is this blog post applicable to Deep Learning?

First, let me appeal to you that comprehensible models are important, and I think people playing with Deep Learning ought to be among the most enthusiastic users of comprehensible models. Suppose I grant you the extreme position that a deep learning model is a black box that isn’t worth looking into. In that extreme, you have a great use case for comprehensible models: exploring the space of Deep Learning architectures and training regimes. You get to take the giant space of models and regimes, design your own hand-engineered features of that space like “learning rate” or “number of attention heads”, generate your own datasets of experiment results, and conduct symphonies of computers to explore the space. Deep Learning system design is what got me into these models in the first place.

Regardless, I think it’s possible to build useful, pragmatic visuals for Deep Networks. My main design goals would be: (1.) enable the user to detect when something in the network is broken / not being used, and (2.) put an expression in front of the user to encourage playful tweaking of the architecture.

Conclusion

These visualizations were immediately useful, and they are pragmatic because they are not specific to any model type. If you can extract a text description of your model, and if you can trace a set of useful-to-visualize scalars, then you can visualize your model.

I think that somehow we should give all users of machine learning models access to visuals like these. Using some combination of our shared frameworks, our example code, and our crazy LLM tools, we should take on the responsibility of not only performing our desired computation, but also rendering a useful expression of it.

(This post has an appendix. This project is supported by a GCP cloud compute grant from ML Collective, which has been super helpful. Thanks, also, to Rosanne Liu for useful feedback on drafts of this post.)

What happens when you vectorize wide PyTorch expressions?

Thu, 19 Oct 2023 07:00:00 -0700

In scientific computing, code is often naturally expressed as wide, tree-like expressions. Often different branches of that tree contain similar chunks of logic, so there is potential to run many different branches together in parallel vectorized operations. What happens when you take your nice tree-like code and mangle it into hard-to-read vectorized code? How would a person do that?

I created Vexpr and used it to take real experiment code and convert its readable expressions into vectorized expressions at runtime. In this post, I present the results. Topics include:

What is the immediate impact?
How does this relate to torch.compile?
What is the more detailed breakdown of the impact on the GPU and CPU?

Introduction: Wide expressions

Mathematical expressions naturally form trees. Here’s a toy example.

\[\sqrt{a^2 + b^2} + \sqrt{c^2 + d^2}\]

Here’s Python code implementing this expression and highlighting its wide tree-like structure:

sum([math.sqrt(sum([a ** 2,
                    b ** 2])),
     math.sqrt(sum([c ** 2,
                    d ** 2]))])

For this simple function, we can write vectorized PyTorch code by hand:

torch.tensor([a, b, c, d]).pow(2).view((2, 2)).sum(dim=0).sqrt().sum()

The former code calls pow 4 times, calls sum twice, calls sqrt twice, then calls sum one last time. The latter code flattens the tree into a single pipeline so that one operation occurs for each level of the tree. Previously we ran one Python function call per node of the tree, and afterward we run one call per level of the tree – a number that is exponentially smaller.

PyTorch is often used in a pipelined way as shown above, but that’s usually because the expression is inherently deep; neural networks are the obvious example. Here we are concerned with expressions that are wide.

Imagine scaling up this toy example: use actual, larger expressions; let each variable represent a list of vectors, not just a number; within the expression, call functions like SciPy’s cdist which compute pairwise distances and return large matrices; do all of this on batches of inputs. With these changes, we now have a wide expression that is worth running on a GPU.

This scenario comes up often in scientific computing. For example, the kernel of a Gaussian Process (GP) takes in two lists of vectors and returns pairwise similarities. We can encode some of our intuition into these kernels by composing them via weighted sums and products. This leads to giant tree-like expressions, and because those expressions have many similar operations in each parallel branch, there is a lot of potential for vectorization.

Writing vectorized code for giant expressions is hard, so I created Vexpr to make it easy. Vexpr takes readable-but-slow PyTorch, NumPy, and JAX expressions and compiles them into fast, delightfully ugly vectorized expressions.

How to vectorize one level of an expression tree

Wide expressions naturally form a tree. When we vectorize that tree, we create a new narrower expression that invokes a series of Python functions, one for each level of the original tree.

How do we collapse a tree and reduce operations to be one per level? Each of the operations now will have to take in a tensor that contains every input to that level and return a tensor that contains all of the outputs.

We can see some operations already support such a change.

Elementwise operations like pow and sqrt can run on multiple inputs as-is.
Parallel sums can be implemented as .view(...).sum(dim=0)

What about more difficult cases? For example, what if the parallel sums are of different lengths? On GPUs, fast parallel reductions only work when inputs all have the same length. Would providing a single-input single-output interface still have a benefit? The answer is yes, even if the operation internally just loops over the sums. Collapsing one level of the tree allows subsequent levels to be collapsed, and low-hanging fruit tends to appear in the deeper levels. Moreover, we can do better than looping over the sums, even when the lengths are different. Vexpr’s vectorizer groups the inputs by length and performs a reduced number of operations—one for each unique length. For example, this decreases the number of torch.cdist operations in my hands-on kernel from 49 to 5. These 5 calls happen invisibly inside of a cdist_multi function that the code calls once.

These collapsed operations often introduce some overhead. Batch operations often require permuting tensors (e.g. for batched pairwise distances) or reordering values (e.g. putting equal-length sums next to each other). For operations that are grouped by size, we must split input tensors then re-concatenate the output tensors. Thus, while vectorizing has many benefits, it also introduces extra work for the GPU that is not necessary when using a non-vectorized expression. Is this overhead worth it? Let’s turn to experiments to find out.

Vectorizing leads to 4x-7x speed-up on one set of benchmarks

Here I run a pair of benchmarks on an NVIDIA V100 GPU. I describe the benchmarks within the context of real Gaussian Process use cases, but you don’t need to understand Gaussian Processes to understand these results; I am simply running this big vectorized expression on inputs of different shapes.

When you use Gaussian Processes for Bayesian Optimization, you first fit a GP’s parameters to the training data, then you optimize a set of candidate points to maximize expected improvement. Both of these steps include backpropagating gradients through the GP kernel.

Benchmark 1: Fit the kernel’s parameters. I run the forward and backward pass of the kernel, using x1.shape == x2.shape == (379, 26), i.e. a training set of 379 26-dimensional vectors. I repeat 100 times, then wait for a final CUDA synchronize. I test with and without torch.compile.

	Don’t compile	Compile
Baseline	6.43s	3.0s	Compile speed-up: 2.1x
Vectorized	0.99s	0.76s	Compile speed-up: 1.3x
	Vectorize speed-up: 6.5x	Vectorize speed-up: 3.9x	Combined speed-up: 8.5x

Benchmark 2: The optimization loop. In this experiment, x1.shape == (60, 2, 26) and x2.shape == (60, 381, 26). That means we’re searching for 60 single candidate points, with an additional point included in each (hence the 2, not 1) because the Noisy Expected Improvement algorithm also gets predictions for a set of previous potentially best points. These two points are appended to 379 training points to give us 381 points. Again I run the kernel’s forward and backward pass 100 times.

	Don’t compile	Compile
Baseline	6.03s	2.57s	Compile speed-up: 2.3x
Vectorized	0.83s	0.63s	Compile speed-up: 1.3x
	Vectorize speed-up: 7.2x	Vectorize speed-up: 4.1x	Combined speed-up: 9.6x

In these experiments, vectorizing my kernel caused a 4x speed-up when I used torch.compile, and a ~7x speed-up with straight pytorch code.

…but this speed-up doesn’t happen in all benchmarks

Now I test the kernel’s performance in another scenario: hold-one-out cross-validation. During cross-validation, we fit \(N\) models on \(N\) slightly different datasets. We can test this by just rerunning Benchmark 1 on \(N\) models in parallel. To demonstrate a surprising phenomenon, I set \(N=20\).

Benchmark 3: Cross-validation. Repeat Benchmark 1, but train 20 models in parallel rather than 1. So we have x1.shape == x2.shape == (20, 379, 26), and the shapes of every parameter, e.g. lengthscale, have (20,) prepended to them.

	Don’t compile	Compile
Baseline	15.6s	7.50s	Compile speed-up: 2.1x
Vectorized	17.7s	6.15s	Compile speed-up: 2.9x
	Vectorize slow-down: 1.1x	Vectorize speed-up: 1.2x	Combined speed-up: 2.5x

This result surprised me. The vectorized version is sometimes slower than the baseline, at least when you examine only the GP kernel in an isolated benchmark. The end-to-end performance of the larger system is still often faster due to the freed up CPU (see next section), but it is noteworthy that vectorizing code doesn’t always directly speed up that code.

To understand where the change occurs, I rerun this benchmark for different \(N\). Note that \(N=1\) and \(N=20\) correspond to the original Benchmark 1 and Benchmark 3 results, respectively. I used the same hardware, but I enabled some additional profiling, hence the slower times compared to above.

The vectorized kernel initially has a huge advantage over the baseline, but this advantage diminishes as we give it more parallel work, and without torch.compile the vectorized kernel is eventually slower when viewed in isolation. Why does this happen?

The CPU is a bottleneck. Vectorizing removes that bottleneck.

To really understand the performance impact of vectorization, we need to understand the CPU and GPU usage before and after. First, note that CUDA / PyTorch are built on an asynchronous relationship between the CPU and GPU, where the CPU ideally should always run ahead, always queueing the GPU’s future work, while the GPU is always working through its queue. However, there are two events that cause the CPU to wait for the GPU:

When the CPU chooses to read a value from the GPU. This is a decision made by your code.
When the CPU gets too far ahead of the GPU. This is a decision made by CUDA.

I profiled the kernel using NVIDIA’s nsys, and I used that trace to obtain GPU and CPU active time. My so-called CPU “active” time is actually an inferred value; PyTorch uses CUDA in a way that spins the CPU 100% constantly, even when the CPU is just waiting for the GPU, so I use heuristics to detect these waits and subtract it them from the actual active time. (2023-10-31 update: Thanks to gregjm for pointing out that CUDA offers a way to avoid this CPU-spinning, and that this is actually a PyTorch issue. The initial version of this blog post blamed CUDA for this unnecessary power consumption.)

Here is the same plot from above, but with the CPU and GPU time overlaid.

First, to understand the result of Benchmark 1 above, look at the left side of both plots. Both the baseline and vectorized models have low total active GPU time. But the baseline model puts a much larger workload on the CPU, which is responsible for orchestrating the set of operations that are sent to the GPU. Thus, the GPU spends the vast majority of the time idle, waiting for the CPU to give it more work. For the vectorized model the amount of CPU work is almost always less than the amount of GPU work, so the GPU is almost never idle; the benchmark time is roughly equal to the GPU active time.

Now we focus on the surprising result from Benchmark 3, where the baseline model did slightly better than the vectorized model. As we scale up the number of models, we see an interesting phenomenon. When training many models simultaneously, even the baseline model is able to keep the GPU busy. Once the GPU active time exceeds CPU active time. one of the key selling points of vectorized code is eliminated, because the nonvectorized code becomes good enough. This was a fun, surprising fact; even unvectorized code can outrun the GPU if you pass in large enough tensors. I expect this phenomenon to occur in other large-batch scenarios like training on very large datasets or doing Bayesian Optimizations with very large sets of candidate points. Of course, the vectorized code is still superior when using torch.compile, and in all cases its CPU usage is far superior.

Finally, let’s look at the GPU workload. Independent of the effect on CPU, what is the impact of vectorization on the total amount of work that the GPU has to do? Looking at slopes of the “GPU time” lines, we see that for some models, e.g. my non-compiled model, vectorization increases the total workload, and for other models it decreases the workload. I studied the CUDA traces closely and found that vectorization does indeed reduce many aspects of the GPU workload, greatly reducing the number of operations and decreasing the total amount of time spent on the fundamental computations of the algorithm. However it also introduces overhead (mentioned above) by interspersing operations that permute and reorder the tensors, or splitting them into groups then concatenating results. Sometimes the reduced “fundamental” time outweighs the additional overhead, while other times the overhead outweighs the reduction in fundamental time.

So we see that vectorization has three effects:

It lets us keep the GPU busy even when inputs are small.
It frees up the CPU to do other work.
It can slightly change the total amount of GPU work, sometimes for the worse.

The speed-up from vectorization can be great, but it can be underwhelming in scenarios where none of the benefits are needed.

The benefits of vectorization increase as GPU speed increases

Here is a point that follows naturally from everything above, but it might not be immediately obvious. It is quite striking when you experience it.

As I built Vexpr, I tested on an NVIDIA T4. Then for this experiment, I upgraded to a much-faster NVIDIA V100, and the benefits of vectorization greatly improved.

You always want your CPU to stay ahead of the GPU so that the GPU is never idle. Code that is good enough to stay ahead of this year’s GPU might not be good enough for next year’s GPU. With each upgrade, you need the dotted CPU line from these charts to be lower and lower.

This means that vectorizing your code is good strategy for future-proofing it.

GPyTorch’s “structure” kernels show similar results

I also tested GPyTorch’s limited support for vectorization. GPyTorch lets you take sets of identically-shaped kernels and run them as a single vectorized kernel. This capability is easy to use when summing single-feature kernels, so I created a partially vectorized kernel by replacing sums of single-feature Matern kernels with single Additive Structure kernels.

(Don’t focus too much on comparing absolute speed of the Vexpr and the GPyTorch kernels. There are many small differences between the two that have nothing to do with vectorization. For example, unlike my Vexpr kernel, GPyTorch has a nice optimization of putting this line before this line.)

Vexpr has been optimized much more for this use case than GPyTorch, but we see that the same fundamental phenomena occur. Vectorization leads to wins at small batch sizes, but the advantage diminishes at large batch sizes. In this experiment, vectorization increased the total GPU workload, so with large batch sizes the only advantage of vectorization is a freed up CPU.

Interestingly, neither GPyTorch kernel ever reaches a point where GPU time is equal to the benchmark time. The GPU always spends at least 1-2 seconds idle. This happens because GPyTorch’s kernels do a synchronous equality check on every call, forcing a GPU synchronize, which causes the CPU to fall behind the GPU immediately afterward. So when you use GPyTorch kernels on GPUs, you don’t get the ideal fully asynchronous execution that you’re supposed to get with CUDA / PyTorch. (There is good news: PyTorch has recently introduced torch.cuda.set_debug_mode(2) which detects these unwanted synchronize events. Like others, I think every PyTorch library developer should become friends with this API.)

Closing thoughts on compilation

Everybody agrees that vectorization is good. Is it worth doing even in cases where it requires extra work, like with wide expressions? The results above suggest: most of the time, yes, but with interesting nuances. Vectorization is especially great for making the most of a GPU when your task is running many iterations on smaller batches of input data. For larger-batch scenarios, maybe you’ll only benefit from vectorization after you add JIT compilation, or after you upgrade your GPU, or after you’ve found some use for all the newfound idle CPU time. But, to a first approximation, vectorization is good.

The real open question for the field remains: for “wide” computation graphs, what is a practical strategy for getting vectorized code?

My experiments above show that this is not solved by JIT tools like torch.compile, nor do I expect it to be. I think a compiler would need to become a huge slow unreliable hairball to solve this type of auto-vectorization problem. Vexpr’s vectorizer is a compiler that solves this, but it does so by making the programmer meet the compiler in the middle. The programmer gives Vexpr a tree-like expression and tells Vexpr, “You should try vectorizing this.” Both of those pieces of information are valuable: the programmer structures the logic in a certain way, and the programmer indicates that there is opportunity for vectorization there. The programmer doesn’t need to do anything too difficult, and neither does the compiler.

I think this “meet-in-the-middle” approach between programmer and compiler is a good design principle that leads to good systems. And I think this will become even more true in the age of Large Language Models; rather than cramming too much magic into our compilers, let’s rely on humans-with-LLMs to meet the compiler in the middle.

(This project is supported by a GCP cloud compute grant from ML Collective, which has been super helpful. Thanks, also, to Rosanne Liu for useful feedback on drafts of this post.)

Gaussian Processes Extrapolate, Sometimes in Goofy Ways

Tue, 28 Mar 2023 02:00:00 -0700

Here is a toy function. (To see the code and more plots, check out this notebook.)

Figure 1: 80 random observations of a deterministic function (black) and the predicted maximal point in that function (orange), according to a Gaussian process trained on those 80 observations.

Intuitively, it seems clear that this function’s highest value probably occurs when x is in the center region. But a Gaussian Process (GP) thinks the highest value is out in a more mediocre region. This isn’t just a explore-exploit trade-off; the GP thinks the expected value is high at that orange point – higher than any value the GP has ever seen before!

I ran into this issue while exploring a real loss function, tuning hyperparameters with Bayesian Optimization. I found that my GP kept insisting there would be excellent results at unpromising points like the one above. The model sent me through hundreds of iterations of whack-a-mole, so I investigated the issue and came up with this toy example that makes the issue obvious.

This issue occurs because:

Gaussian Processes extrapolate, and they do it over an inferred distance, or “lengthscale”.
Sudden drop-offs in the function cause the GP to choose a low lengthscale. (This issue is most pronounced when testing deterministic functions. With noisy functions, these sudden drop-offs can be characterized as noise.)

So the GP makes local extrapolations that appear goofy given more global context.

Intuition: Why Gaussian Processes extrapolate

Consider a simple scenario with two observations and one prediction.

Figure 2: If you were a Gaussian process, what would you predict? You are given three random variables, A, B, and C, and the observed values for the first two. What is your prediction for C?

Gaussian Processes work by modeling this as a 3D multivariate Gaussian, using distance to determine correlation between variables. This clever trick transports us from thinking about functions over arbitrary spaces to thinking about Gaussian distributions with a finite number of variables.

Imagine another multivariate Gaussian that matches this correlation structure. Consider the performance of athletes in three different sports: Swimming (A), Cycling (B), and Running (C). Suppose correlations match distances in the chart above, with cov(A, B) = high, cov(B, C) = high, cov(A, C) = medium-to-low. In other words, elite swimmers are usually good cyclists but aren’t always great at running. Elite cyclists are usually good swimmers and runners. Elite runners are usually good cyclists but might not be great at swimming. A large set of unobserved latent causal factors underlie A, B, and C, and we don’t attempt to model these explicitly.

Suppose an athlete is a good cyclist. If we are given the extra context that she is a bad swimmer, does that increase the probability that she is a good runner? It depends on the actual covariance values, but there are definitely possible realities where that answer is yes. She doesn’t have the latent causal factors that make a good swimmer, and these latent factors also tend to make a good cyclist, yet she is still a good cyclist, so we expect she especially has the latent causal factors that make a good runner, to compensate.

If B is a medium value and A is low, we expect C to compensate for A. That's why Gaussian Processes extrapolate. (You can also show this by inverting a covariance matrix, but I like this playful explanation. See the appendix for something more rigorous.)

This is a good thing, usually.

Why extrapolation sometimes causes goofy predictions

Let’s zoom in on the section with the goofy prediction.

Figure 3: Zooming into the previous chart, we can see that the high prediction can be viewed locally as an extrapolation. Moreover, it is extrapolating from both left and right, which explains why the predicted value is extra high.

The GP is essentially making predictions using only local data. It is doing this because fitting the GP on this dataset caused the GP to treat everything extremely locally, i.e. it selected a very small “lengthscale”. It chose a small lengthscale because the function contains discrete drop-off points. Tiny changes in x sometimes yield large changes in f(x), and the GP is trying to accommodate this.

This leads to something that is undesirable in hyperparameter tuning: if ever you discover a sudden drop-off in performance from mediocre results to very-bad results, the GP will get excited that maybe there are very-good results immediately on the opposite side of the mediocre results. In my experience, it ends up spending nearly all of its time testing these crazy theories.

What does this mean? What can I do?

I’m not sure yet what is the best way to handle this.

One solution is to switch to using a Matern kernel with \(\nu = 0.5\), rather than the often-used value \(\nu = 2.5\). This specifies that the underlying function is not differentiable, allowing for sharp changes in direction on predictions.

Figure 4: Modifying the Matern kernel allows for sharp changes in the prediction’s slope.

Here’s a nice blog post on that subject. However, making this change runs the risk of hurting predictions – after all, these sudden dropoffs only occur in some parts of the space, and often those parts of the space are irrelevant.

It might be worth adopting the following view: a deterministic function with sudden discrete changes in output are bad news for a Gaussian Process. A broad solution is: figure out how to get rid of them.

You could do this using clamping / winsorizing tricks on your data; imagine taking the whole bottom row of points from Figure 1 and clamping them to be equal to the middle row.

Alternately, you could change your function so that it is not deterministic, for example by randomly initializing parameters or introducing something analogous to stochastic gradient descent. Then, presumably, these random dropoffs will occur randomly across a whole range, not at specific points, and the GP will capture this as observation noise. Moving away from a deterministic function has other benefits anyway, e.g. greatly reducing the risk of overfitting your hyperparameters to the validation set.

One takeaway is that you really need to pay attention to what your Gaussian Process is doing. It will not always automatically do what you consider intuitive.

Appendix

In case you’re interested, here are the GP predictions at other points, with standard deviations:

To get more rigorous with the simple A, B, C example, suppose we assign constant prior mean 0, as is common in Gaussian Processes. Denoting \(\Sigma_{XY} : cov(X,Y)\), the expected value for C is:

\[E\left[C \mid A=a, B=b\right] = \frac{a(\Sigma_{AC}\Sigma_{BB} - \Sigma_{AB}\Sigma_{BC}) + b(\Sigma_{BC}\Sigma_{AA} - \Sigma_{BA}\Sigma_{AC})}{\Sigma_{AA}\Sigma_{BB} - \Sigma_{AB}^2}\]

This can be derived from these equations, specifically the one for \(\boldsymbol \mu_*\).

It is difficult to understand this function at a glance, but it is interesting to simply ask, “What happens as \(a\) increases?” We see that: if the correlative chain from \(A \to B \to C\) is strong, but the direct correlation between \(A \to C\) is weak, i.e. \(\Sigma_{AB}\Sigma_{BC} > \Sigma_{AC}\Sigma_{BB}\), then increasing \(a\) will decrease \(E[C \mid ...]\). Thus, reducing \(a\) in isolation will increase \(E[C \mid ...]\), which is an example of extrapolation.

(This work was supported by ML Collective via their donated GCP compute resources which have been super helpful. Thanks, also, to Rosanne Liu for useful feedback on early drafts of this post.)

Maybe Bayesian Optimization Should Be Harder, Not Easier

Wed, 30 Nov 2022 01:00:00 -0800

(2023-11-01 updates: Refreshed the charts, other small tweaks, posted reproducible experiments.)

This post has two parts:

A fun point of view that I find compelling
A project that follows from that point of view, with some early results

Point of view

When you tune an AI model’s design and its training regime, you are exploring a search space. Bayesian Optimization is a framework that naturally arises when you try to automate your intuitive process of exploring a search space. The promise of Bayesian Optimization is: “If I can tell a computer my beliefs about this search space, then the computer can perform the search better than I can.”

From what I’ve seen, AI / ML people tend to feel burned by their experiences with Bayesian Optimization. Often I hear stories where they once invested a bunch of time into trying it on toy projects, but it never managed to cross over and be useful in their actual work where they continue to use manual search and non-sophisticated brute force searches. Often they feel guilty about all the time they wasted trying Bayesian Optimization.

Here is a slightly contrarian view on why Bayesian Optimization has not yet become ubiquitous.

Lots of people have tried to increase adoption of Bayesian Optimization by making it easier, creating products and libraries that hide the messy details. My hunch is that existing Bayesian Optimization solutions block / discourage the user from taking ownership of the process. In particular, there is a lot of unrealized potential in giving users more powerful ways to state their beliefs about a search space. I think what is needed is a set of tools / documentation / cookbooks designed for people who are willing to put in the time to master their tools. Existing tools could grow to fill this role.

I think Bayesian Optimization ought to be less one-size-fits-all. I think it should be more hands-on.

Taking ownership of your search

If you use a one-size-fits-all Bayesian model, your search will be data-intensive and will often be much less efficient than a manual search. Results will be much better if you invest some time telling the model your priors about your parameters. This is more fun than you might expect.

Consider: when you manually explore a search space, you have some intuition about which details of the model are “orthogonal”. For example, hyperparameters describing a neural network’s architecture seem like they are roughly independent of hyperparameters that describe the training regime, and we can afford to optimize these groups of parameters almost independently. Meanwhile, hyperparameters like “momentum” and “learning rate” are intertwined, and it is important to explore their joint space. A one-size-fits-all Bayesian Optimization model totally lacks this intuition. Fortunately, it is easy to input these beliefs to a Gaussian Process (GP). If you have two groups of parameters that are independent, simply dedicate a kernel to each one, then add the kernels together. If they are dependent, multiply the kernels together. If it’s somewhere in between, do both, and allow the model to adapt to the data. If you run your own Bayesian Optimization code locally, this is all trivial to implement, but it is impossible in any Bayesian Optimization products that I’ve seen. (Some products use a fully AutoML approach to infer these dependencies, but this is a data-intensive process.)

In owning your search, another important degree of freedom is choosing useful features for the GP kernel. Your hyperparameters might not be a good basis for a GP. You might want to configure your GP to transform the hyperparameter vector into a basis that enables the GP to make better predictions with fewer data points, maybe simply by adding a few redundant features to the hyperparameter vectors that may prove useful. This insight can be combined with the “orthogonal” trick above to greatly reduce the effective size of the search space.

Opportunities also lie in having fine-grained control over the optimization loop. For example, I have found it useful to create a “tick-tock” optimization loop that alternates between multiple objectives (accuracy and training time).

The project: Is hands-on Bayesian Optimization worthwhile?

This question has multiple facets, each involving wearing a different hat:

Is it usable? What is the right user interface? What is the overall on-ramp / learning process for a person motivated to learn hands-on Bayesian Optimization?
Is it fast? I am proposing we use more complicated kernels. Is that tractible, from a performance standpoint?
Does it work? Are there considerable benefits in hands-on Bayesian Optimization, relative to a one-size-fits-all approach? Is it much better than random search?

Is it usable?

There is a user interface problem here that needs solving. The first UI to build is a first-class programming interface. Later we can consider whether a graphical UI is needed.

I draw inspiration from D3, a library for data visualization. In a world of people trying to build “easy” visualization tools, D3 took a different approach. D3 doesn’t generate visualizations; D3 helps you program your own visualizations. Rather than exposing some magic function that outputs a customized pre-built visualization, D3 provides a useful set of primitives, then it gets out of the way and lets you write code. It takes time to learn, but once you have learned it, you are powerful. It’s usually a tortoise-and-the-hare phenomenon; the person who chooses D3 ends up winning the race (and winning with style).

I think BoTorch (with gpytorch) is a good start at filling this role for Bayesian Optimization, but it’s not finished. It provides the core algorithmic tools of Bayesian Optimization, but doesn’t provide a way for describing a search space (for example). When you read botorch tutorials, they essentially recommend that you don’t use botorch directly, but instead use a wrapper like Ax, but Ax is more of a one-size-fits-all framework, adding a lot of unnecessary friction to hands-on Bayesian Optimization. I would rather see botorch grow outward into a usable library, giving you all the pieces you need to write your own Bayesian Optimization. They could aim to be the D3 of Bayesian Optimization. I found myself needing to “finish” botorch with my own makeshift library that adds search spaces with conditional parameters, an expanded set of input transform utilities, and random search utilities.

Is it fast?

I am proposing that we take one-size-fits-all GP kernels and replace them with customized compositions of kernels. I was worried that these compositions of kernels would be slow, and, yes, by default, they are. But if you write them efficiently, they are very fast. I ended up writing a WeightedSPSMaternKernel, i.e. a “weighted sum of products of sums Matern kernel” which runs many kernels in parallel. This satisfied my needs.

Otherwise, I’m also proposing we start performing transformations on feature vectors. This is already a supported use case and is fast.

Optional reading: In case you want details, here is my code that describes a 22-dimensonal hyperparameter space and how it is transformed into a search space (for random search / optimization) and then into a feature space (for the GP). Vexpr and outerloop are libraries I created. Vexpr makes it possible to write fast, readable compositional GP kernels, and outerloop attempts to “finish” botorch.

Expand code view

import botorch
import gpytorch
import outerloop as ol
import outerloop.vexpr.torch as ovt
import torch
import vexpr as vp
import vexpr.torch as vtorch
import vexpr.custom.torch as vctorch

parameter_space = [
    ol.Choice("optimizer", ["adam", "sgd"]),
    ol.Choice("nesterov", [True, False],
              condition=lambda choices: choices["optimizer"] == "sgd"),
    ol.Int("epochs", 2, 60),
    ol.Int("batch_size", 16, 4096),

    ol.Scalar("conv1_weight_decay", 1e-7, 3e-1),
    ol.Scalar("conv2_weight_decay", 1e-7, 3e-1),
    ol.Scalar("conv3_weight_decay", 1e-7, 3e-1),
    ol.Scalar("dense1_weight_decay", 1e-7, 3e-1),
    ol.Scalar("dense2_weight_decay", 1e-7, 3e-1),

    ol.Scalar("1cycle_initial_lr_pct", 1/80, 1/2),
    ol.Scalar("1cycle_final_lr_pct", 1/30000, 1/100),
    ol.Scalar("1cycle_pct_warmup", 0.01, 0.5),
    ol.Scalar("1cycle_max_lr", 0.01, 20.0),
    ol.Scalar("1cycle_max_momentum", 0, 0.9999,
              condition=lambda choices: choices["optimizer"] == "sgd"),
    ol.Scalar("1cycle_min_momentum_pct", 0.0, 1.0,
              condition=lambda choices: choices["optimizer"] == "sgd"),

    ol.Int("conv1_channels", 4, 64),
    ol.Int("conv2_channels", 8, 128),
    ol.Int("conv3_channels", 16, 256),
    ol.Int("dense1_units", 8, 256),
]

# Transform to log-space
xform = ol.transforms.ToScalarSpace(
    space,
    ol.transforms.log({
        "epochs": "log_epochs",
        "batch_size": "log_batch_size",
        "1cycle_initial_lr_pct": "log_1cycle_initial_lr_pct",
        "1cycle_final_lr_pct": "log_1cycle_final_lr_pct",
        "1cycle_max_lr": "log_1cycle_max_lr",
        "1cycle_pct_warmup": "log_1cycle_pct_warmup",
        "1cycle_momentum_max_damping_factor":
        "log_1cycle_momentum_max_damping_factor",
        "1cycle_momentum_min_damping_factor_pct":
        "log_1cycle_momentum_min_damping_factor_pct",
        "1cycle_beta1_max_damping_factor":
        "log_1cycle_beta1_max_damping_factor",
        "1cycle_beta1_min_damping_factor_pct":
        "log_1cycle_beta1_min_damping_factor_pct",
        "beta2_damping_factor": "log_beta2_damping_factor",
        "conv1_weight_decay": "log_conv1_weight_decay",
        "conv2_weight_decay": "log_conv2_weight_decay",
        "conv3_weight_decay": "log_conv3_weight_decay",
        "dense1_weight_decay": "log_dense1_weight_decay",
        "dense2_weight_decay": "log_dense2_weight_decay",
        "conv1_channels": "log_conv1_channels",
        "conv2_channels": "log_conv2_channels",
        "conv3_channels": "log_conv3_channels",
        "dense1_units": "log_dense1_units",
    })
)


class VexprHandsOnLossModel(botorch.models.SingleTaskGP):
    def __init__(self, ...)

       # Transforms for the kernel

        xforms += [
            ol.transforms.append_mean(
                ["log_conv1_channels", "log_conv2_channels",
                 "log_conv3_channels", "log_dense1_units"],
                "log_gmean_channels_and_units"),
            ol.transforms.subtract(
                {"log_conv1_channels": "log_conv1_channels_div_gmean",
                 "log_conv2_channels": "log_conv2_channels_div_gmean",
                 "log_conv3_channels": "log_conv3_channels_div_gmean",
                 "log_dense1_units": "log_dense1_units_div_gmean"},
                "log_gmean_channels_and_units"),
            ol.transforms.add(
                {"log_1cycle_initial_lr_pct": "log_1cycle_initial_lr",
                 "log_1cycle_final_lr_pct": "log_1cycle_final_lr"},
                "log_1cycle_max_lr"),
            ol.transforms.add(
                {"log_1cycle_momentum_min_damping_factor_pct":
                 "log_1cycle_momentum_min_damping_factor"},
                "log_1cycle_momentum_max_damping_factor"),
            ol.transforms.add(
                {"log_1cycle_beta1_min_damping_factor_pct":
                 "log_1cycle_beta1_min_damping_factor"},
                "log_1cycle_beta1_max_damping_factor"),
            ol.transforms.append_mean(
                ["log_conv1_weight_decay", "log_conv2_weight_decay",
                 "log_conv3_weight_decay", "log_dense1_weight_decay",
                 "log_dense2_weight_decay"],
                "log_gmean_weight_decay"),
            ol.transforms.subtract(
                {"log_conv1_weight_decay": "log_conv1_wd_div_gmean",
                 "log_conv2_weight_decay": "log_conv2_wd_div_gmean",
                 "log_conv3_weight_decay": "log_conv3_wd_div_gmean",
                 "log_dense1_weight_decay": "log_dense1_wd_div_gmean",
                 "log_dense2_weight_decay": "log_dense2_wd_div_gmean"},
                "log_gmean_weight_decay"),
            partial(ol.transforms.ChoiceNHotProjection,
                    out_name=N_HOT_PREFIX)
        ]

        # ...


def make_handson_kernel(space, batch_shape=()):
    """
    This kernel attempts to group parameters into orthogonal groups, while
    also always allowing for the model to learn to use the joint space.
    """
    zero_one_exclusive = partial(gpytorch.constraints.Interval,
                                 1e-6,
                                 1 - 1e-6)

    state = State(batch_shape)

    ialloc = IndexAllocator()

    lengthscale = vp.symbol("lengthscale")
    x1 = vp.symbol("x1")
    x2 = vp.symbol("x2")

    def index_for_name(name):
        return next(i for i, p in enumerate(space) if p.name == name)

    def scalar_kernel(names):
        ls_indices = ialloc.allocate(len(names))
        indices = [index_for_name(name) for name in names]
        return ovt.matern(
            vtorch.cdist(x1[..., indices] / lengthscale[ls_indices],
                         x2[..., indices] / lengthscale[ls_indices],
                         p=2),
            nu=2.5)

    def choice_kernel(names):
        ls_indices = ialloc.allocate(len(names))
        indices = [index_for_name(name) for name in names]
        return ovt.matern(
            vtorch.cdist(x1[..., indices] / lengthscale[ls_indices],
                         x2[..., indices] / lengthscale[ls_indices],
                         p=1),
            nu=2.5)


    def scalar_factorized_and_joint(names, suffix):
        w_additive = vp.symbol("w_additive" + suffix)
        alpha_factorized_or_joint = vp.symbol("alpha_factorized_or_joint"
                                              + suffix)
        state.allocate(w_additive, (len(names),),
                       zero_one_exclusive(),
                       ol.priors.DirichletPrior(torch.full((len(names),), 2.0)))
        state.allocate(alpha_factorized_or_joint, (),
                       zero_one_exclusive(),
                       ol.priors.BetaPrior(4.0, 1.0))
        return vtorch.sum(
            vctorch.heads_tails(alpha_factorized_or_joint)
            * vtorch.stack([
                vtorch.sum(
                    w_additive
                    * vtorch.stack([scalar_kernel([name])
                                    for name in names],
                                   dim=-1),
                    dim=-1),
                scalar_kernel(names),
            ], dim=-1),
            dim=-1
        )

    def regime_kernels():
        return [
            # kernel: regime choice parameters
            choice_kernel([f"{N_HOT_PREFIX}{i}"
                           for i in range(4)]),

            # kernel: lr schedule
            scalar_factorized_and_joint(
                ["log_1cycle_initial_lr", "log_1cycle_final_lr",
                 "log_1cycle_max_lr", "log_1cycle_pct_warmup"],
                "_lr"),

            # kernel: momentum schedule
            scalar_factorized_and_joint(
                ["log_1cycle_momentum_max_damping_factor",
                 "log_1cycle_momentum_min_damping_factor",
                 "log_1cycle_beta1_max_damping_factor",
                 "log_1cycle_beta1_min_damping_factor",
                 "log_beta2_damping_factor"],
                "_momentum"),

            # kernel: relative weight decay
            scalar_factorized_and_joint(
                ["log_conv1_wd_div_gmean", "log_conv2_wd_div_gmean",
                 "log_conv3_wd_div_gmean", "log_dense1_wd_div_gmean",
                 "log_dense2_wd_div_gmean"],
                "_wd"),
        ]

    regime_joint_names = ["log_epochs", "log_batch_size",
                          "log_gmean_weight_decay"]

    def architecture_kernels():
        return [
            # kernel: lr schedule
            scalar_factorized_and_joint(["log_conv1_channels_div_gmean",
                                         "log_conv2_channels_div_gmean",
                                         "log_conv3_channels_div_gmean",
                                         "log_dense1_units_div_gmean"],
                                        "_units_channels"),
        ]

    architecture_joint_names = ["log_gmean_channels_and_units"]

    regime_kernel = vctorch.fast_prod_positive(
        vtorch.stack(([scalar_kernel(regime_joint_names)]
                      + regime_kernels()),
                     dim=-1),
        dim=-1)
    architecture_kernel = vctorch.fast_prod_positive(
        vtorch.stack(([scalar_kernel(architecture_joint_names)]
                      + architecture_kernels()),
                     dim=-1),
        dim=-1)
    joint_kernel = vctorch.fast_prod_positive(
        vtorch.stack(([scalar_kernel(regime_joint_names
                                     + architecture_joint_names)]
                      + regime_kernels()
                      + architecture_kernels()),
                     dim=-1),
        dim=-1)

    alpha_regime_vs_architecture = vp.symbol("alpha_regime_vs_architecture")
    alpha_factorized_vs_joint = vp.symbol("alpha_factorized_vs_joint")
    scale = vp.symbol("scale")

    state.allocate(alpha_regime_vs_architecture, (),
                   zero_one_exclusive(),
                   ol.priors.BetaPrior(2.0, 2.0))
    state.allocate(alpha_factorized_vs_joint, (),
                   zero_one_exclusive(),
                   ol.priors.BetaPrior(4.0, 1.0))
    state.allocate(scale, (),
                   gpytorch.constraints.GreaterThan(1e-4),
                   gpytorch.priors.GammaPrior(2.0, 0.15))

    kernel = (scale
              * vtorch.sum(vctorch.heads_tails(alpha_factorized_vs_joint)
                           * vtorch.stack(
                               [vtorch.sum(
                                   vctorch.heads_tails(
                                       alpha_regime_vs_architecture)
                                   * vtorch.stack([
                                       regime_kernel,
                                       architecture_kernel
                                   ], dim=-1),
                                   dim=-1),
                                joint_kernel],
                               dim=-1),
                           dim=-1))

    state.allocate(lengthscale, (ialloc.count,),
                   gpytorch.constraints.GreaterThan(1e-4),
                   gpytorch.priors.GammaPrior(3.0, 6.0))

    return kernel, state.modules

You can read the rest of the code here.

Does it work?

My initial results are promising, but my baseline (the default BoTorch model) has some issues that might need to be ironed out.

I ran 160 semi-random MNIST LeNet training experiments on a 22-dimensional hyperparameter space. I fed the results to a Gaussian Process and checked its predictive power for held out experiments, performing leave-one-out cross-validation.

For a baseline one-size-fits-all model, I was hoping to just use Botorch’s default MixedSingleTaskGP off the shelf, but it does not use priors on its parameters and is hence prone to overfit and output extreme low log probabilities for held-out results. I added priors from one of their other models, and this partially solved the issue.

For the hands-on model, I designed a kernel off the top of my head, trying to capture some of my intuition about my hyperparameters. This was without any iteration, I just guessed at what kernel might work, chose priors that seemed sensible, and now I’m sharing my first results.

As a sanity-check, here is the raw data for cross-validation 60 of the 160 experiments. To generate each point, the model is trained on 59 configurations and outputs, then must predict the 60th output, given its configuration. (Click or zoom to see details.)

Here are the results in aggregate, using two different success metrics, and plotted for different subsets of the 160 experiments. I gathered this on 50 different shuffles of the data and I plot the mean.

On average, for small-to-medium datasets the “hands-on” model assigned higher probability density to the “correct” output for the held-out data. (It is important to use geometric mean, i.e. take the mean of the log probability, so that we are sure to penalize very low probabilities.) When using the model in MLE mode (maximum likelihood estimator) to output a single prediction, the hands-on model on average has lower prediction error.

Toward the end, the hands-on model falls behind the baseline. The MLE predictions aren’t obviously worse, but the mean log probability is, suggesting that hands-on model gives more confident predictions than it ought to. Thus, there is room for improvement. Still, it is promising that my first try was this good. I suspect a little bit of iteration on the kernel would lead to one that is superior for all dataset sizes.

Early in the chart, the baseline displays a strange phenomenon where it fails to improve as it receives more data. It is worth doing some due diligence here and try to understand what is going on. Maybe the baseline has a low-hanging possible improvement, and maybe that improvement could also be brought to the hands-on model.

My conclusions from this experiment are:

The initial results are promising, but not conclusive.
The weird baseline phenomenon illustrates my point that you should be careful treating a Bayesian Optimization as a black box. I suspect Bayesian Optimization is only worthwhile if we figure out how to make users take ownership of their search.

(This project is supported by a GCP cloud compute grant from ML Collective, which has been super helpful.)

Imagine A Deep Network That Performs Successive Cheap Queries On Its Input

Fri, 08 Jul 2022 02:00:00 -0700

(A writeup of a side project that I’m working on. I’ve posted working code on Github for both pytorch and tensorflow.)

Imagine a deep neural network that has many chances to query information from its input, rather than just one chance, a network that reacts to the currently accumulated information to select its next query. How might this network work differently from today’s neural networks?

I think this type of deep network would be more opportunistic and symbiotic with the input. Rather than copying all the needed information from an input image into neural activations, the network would let the image continue to store the bulk of the information, and the network would perform a series of economical queries on the image. Neural representations would not store the answers to every conceivable question about the image. Instead, they would provide the minimal information needed to quickly answer questions via subsequent image queries, when that answer is needed.

The core benefit: Doing more with less

This type of network can do more with less. It iteratively gathers some information, then decides what information to gather next. Standard feedforward networks gather all the information they might need up front, and as a result a lot of processing is spent gathering information that does not end up being used.

This is especially evident on trivial toy data. Consider classifying which integer on a number line a point is nearest to, between \(1\) and \(N\). The neural network must output a unique binary representation for each integer, and it can only use weights, biases, and nonlinearities. (This may seem like a bad example of the type of task performed by a neural network, but it captures the core complexity of inferring the class of an observation generated by a mixture of high-dimensional Gaussians.) A standard neural network will need to perform \((N - 1)\) queries. An iterative network only needs to perform \(\log_2 n\).

Because the traditional static parallel approach can’t adapt its queries in response to the result of other queries, it has to do a lot more work.

This principle scales up to real-world data. Consider any task that involves observing an image. A sensible approach might involve first performing a low-resolution initial pass on the image, looking for regions of interest, followed by more detailed queries of those regions of interest. This will be more efficient and powerful than the non-reactive version, which would need to perform many different detailed queries on every part of the image. Yes, I may be underestimating the cleverness of a deep network, but I hope you can see the appeal of the idea.

Traditionally, Deep Learning emphasizes parallel distributed processing. This reactive approach tries to bring back some sequential (as opposed to distributed) processing, getting the best of both worlds.

How to perform a reactive query

There are many ways to make a neural network dynamically query its input. One way is via a form of attention that mimics a saccading vision system. I opted, instead, to use dynamically generated weights. Given a context vector, the model passes that vector into a normal neural network to output a set of weights, which are then run on the input image. This weight-generating network is called a hypernetwork. For image processing the model might generate the weights of a 4-layer convolutional neural network with a relatively small number of channels. This dynamically generated network is run on the image and outputs a new context vector, which can then be used to generate a new network, and so on. Context vectors are aggregated over time, either through residual connections or LSTM/GRU-like gates.

Hardware efficiency

For deep learning, it is prudent to use algorithms that support training in batches. If training in batches isn’t possible, you face an uphill battle. (This is part of why Transformers have overtaken recurrent neural networks for sequence processing.)

I was happy to find that this algorithm supports efficient training in batches, despite the fact that it departs from traditional deep networks by applying different weights to each input image. This operation can still be performed in parallel by implementing dynamic linear layers as batch matrix multiplications and dynamic convolutional layers as group convolutions.

This algorithm has a mix of performance costs and performance benefits that may balance out. On one hand, using different weights for every input image is more expensive. On the other hand, the dynamic approach enables each query to be significantly less complex, so each weight tensor and activation tensor is much smaller.

The neuroscience angle

This algorithm was partly inspired by a theory from neuroscience, and it provides a concrete illustration of what types of neural representations we might find in the brain.

Dana Ballard and colleagues have long written about how the the best economical solution to the inference problem changes entirely when you consider systems that are capable of actively responding to the world. They have criticized the classic Marr view that the human visual system builds up a representation of “what is where”. (This classic view overlaps heavily with some of my previous neuroscience work with Numenta.) Instead, Ballard and colleagues write:

the visual system is used to subserve problem-solving behaviors and such behaviors often do not require an accurate model of the world in the traditional sense of remembering positions of people and objects in the room

and

in a dynamic world, the cost of maintaining the correspondence between the representation and the world becomes prohibitive. For this reason animate vision systems may have to travel light and depend on highly adaptive behaviors that can quickly discover how to use current context

This algorithm uses a simple form of behavior – a dynamic query on the input image – and it “travels light” by using a small context vector whose job is to customize these dynamic queries and help them succeed in answering questions about the input image. This type of neural representation – a population whose job is to help the network figure out what actions it should take to answer subsequent questions – is an enticing model of what might be happening in the brain. Many thousands of careers have been spent trying to understand neural representations, especially in the hippocampus and mammalian neocortex. Some cortical representations have been partially explained, but most have been difficult to interpret. The neurons in this algorithm, similarly, would be hard to interpret, because they are used to complement the outside world and make it easy to query, rather than coding it directly.

Why it might not work

This adaptive query architecture might have too much potential to overfit. In terms of raw potential, traditional deep networks are inferior to these networks, but that inferiority may constrain them to only learn algorithms that successfully generalize outside the training set. By switching to an iterative reactive approach, we give the network potential to discover more powerful and economical algorithms, but gradient descent may be prone to finding solutions that don’t generalize. Maybe traditional deep architectures and traditional gradient descent are pair-bonded, and more advanced architectures will require slightly different learning algorithms.

Current status

The algorithm works reasonably well on toy data and MNIST (>99% accuracy, etc.) using fewer total convolutional FLOPS than a comparable static network. Preliminarily, it seems to work fine on CIFAR10, but I haven’t pushed it very hard. I’m still early in the tweaking / hyperparameter tuning process. Feel free to take a look at the code and run it yourself.

(Thanks to Eric Frank for helpful discussions on hypernetworks. Also, I decided to share this unfinished project after reading a Twitter thread from Rosanne Liu.)

Bayesian Optimization Is More Basis-Dependent Than You Might Think

Sun, 26 Jun 2022 03:00:00 -0700

(A cautionary tale, illustrated via a toy example. If you do Bayesian hyperparameter tuning, you will want to know about this. Here’s a notebook with all of my code.)

Here are two rotations of a function.

We are going to examine how well a Gaussian Process (GP) models this function.

I have provided example axis labels, but feel free to substitute your own. In my example scenario, we are training a neural network, first running it for some number of epochs at a high learning rate, then running the remainder of epochs at a low learning rate. We configure this training regime using two hyperparameters. The resulting accuracy is plotted in color. Either chart shows that having more epochs is better, and that ideally we will drop the learning rate for only the final epoch. Having more than one low-learning-rate epoch leads to overfitting, shown as a reduction in accuracy left of the diagonal (or when “# of epochs at low learning rate” is 2).

Of course, we don’t have easy access to this function. Evaluating it is expensive. So we try to model the function from a few observations so that we can predict which hyperparameters are worth testing. Below, a few observations of the function are shown, and the predictions of a GP for the remaining values are plotted on the right. I’ve fit this function using the default configuration of Ax, a popular open-source library for Bayesian Optimization.

That didn’t turn out well. The blue predictions are pretty far removed from the orange actual function. Now let’s try again with the rotated version of the function.

Much better. As you can see, the parameterization matters a lot. Even just rotating the parameterization can make a big difference in whether the Gaussian Process accurately models the function. This may come as a surprise, since many optimization algorithms (e.g. gradient descent) are rotation-invariant, and multivariate Gaussians are known for having no preferred orientation. Why are Gaussian Processes different?

When you fit a GP to data, the model is not learning how the axes (hyperparameters) covary with the scalar output (accuracy). Instead, it is learning how accuracies for points in hyperparameter space covary with each other, and it typically does this by learning a metric for distances between points. Using this metric, it predicts values using a method similar to a weighted average. Here are some example distances from the metric that the GP would ideally learn, denoted “Near” and “Far”:

Typically, fitting the GP to the data involves choosing a lengthscale for each axis, normalizing distance per axis. The problem with Parameterization 1 is that the most useful directions in hyperparameter space are not axis-aligned, they are diagonal. In Parameterization 1, changing either hyperparameter in isolation leads to a big change in the scalar output. In Parameterization 2, one of the hyperparameters causes very small changes, while the other causes large changes. A GP whose covariance kernels use only lengthscales can’t effectively fit to the Parameterization 1, but it can fit to Parameterization 2. It is unable to learn that the scalar output changes slowly when both hyperparameters are changed together.

This example is representative of a larger class of examples. You can imagine many other groups of hyperparameters that ought to change together. Learning rate and batch size, batch size and number of epochs, number of channels in one layer and in another layer… the list goes on. Each of these indicate useful directions in hyperparameter space that aren’t axis-aligned.

A straightforward solution would be to use a more powerful covariance kernel that is capable of learning arbitrary linear transformations, rather than just the diagonal matrix of lengthscales. It would then be able to discover a useful set of axes, rather than just scaling the provided axes. Would this be a good default strategy, or are there downsides? I’m not sure. It would be great if fitting a GP to a set of experiment results would automatically find useful directions in hyperparameter space, rather than assuming that you know the useful directions a priori.

Regardless, a top-level takeaway here is that you should not just blindly trust your GP. Only trust its suggestions if it performs well on cross-validation.

Likely ≠ Typical: A Viewpoint On Why We Perturb Neural Networks

Fri, 28 Jan 2022 02:00:00 -0800

Here is a fun explanation why neural networks do better when trained with noise. There are multiple existing explanations, but I particularly like this one.

First, an aside: Given 100 flips of an unfair coin with 60% probability of heads, the single-most-likely sequence is 100 consecutive heads, but a typical sequence will have about 60 heads. “Likely” and “typical” are two different things, and often what you really want is the “typical”.

Let’s apply this idea to neural networks.

Background

When training a neural network, results are often improved by randomly perturbing the network’s weights or activations. Two popular examples of this technique are Dropout, which randomly zeros activations, and Variational Dropout, which simulates noisy weights by multiplying or adding noise into activations.

These techniques may all work for the same fundamental reason. The authors of Variational Dropout made the case that, via the Central Limit Theorem, applying Dropout to activations is approximately the same as multiplying noise into the weights of the downstream layer, which is approximately the same as multiplying noise into the activations of the downstream layer. In this view, these techniques all have a unified explanation… but what is it?

One explanation is the Minimum Description Length principle (Hinton and van Camp 1993, Graves 2011, Kingma et al. 2015). When a weight is noisy, it contains less information. By Occam’s razor, a model containing less information is better and more likely to generalize. This view unifies Dropout and Variational Dropout with other forms of perturbation like Weight Decay.

Another explanation (Srivistava et al. 2014) is that Dropout forces the neural network to train many different subnetworks within the network, so that the network consists of a set of subnetworks forming consensus.

These explanations are intriguing, but they’re just stories that we tell ourselves. The Minimum Description Length principle is a heuristic, not provably correct unless we make some assumptions about priors (MacKay 1992). The ensemble-of-subnetworks is another fun idea. Here’s a third fun idea that makes very few assumptions.

The idea

We want a representative sample of the posterior \(P(W \mid D)\), not the most likely sample.

The following two images gives the core intuition. The text describes it more rigorously.

From the standpoint of probability theory, when we train a neural network we are evaluating weights \(W\) based on their probability of generating the dataset \(D\), i.e. \(P(D \mid W)\). By Bayes’ Rule, optimizing the likelihood function \(P(D \mid W)\) w.r.t. \(W\), is equivalent to optimizing the posterior \(P(W \mid D)\). In other words, selecting the weights that are most likely to generate the dataset is the same as inferring the most likely weights. This leap is strange; I find it helpful to append an explicit hypothesis term \(H\), for example \(H=\) “this dataset can be generated by a particular convolutional neural network architecture”, giving us \(P(W \mid D, H)\). It’s under this hypothesis \(H\) that inferring \(W\) makes any sense at all.

(When the task is classification, the “dataset” is a set of labels \(Y\) for some known set of inputs \(X\), and we optimize \(P(Y \mid X, D)\). Also, to keep things simple, I’m not discussing priors. Inserting priors doesn’t change anything fundamental.)

In the schematic image above, red dots denote a set of \(N=6\) samples of a posterior. Note that it’s not actually tractable to sample this posterior. Instead, we optimize \(W\) to find a single good point in this posterior distribution. Depending on our search process and the actual shape of \(P(D \mid W)\), we may find ourselves at a local maximum (at the center of the red dots), or at a global maximum (blue). Here I show why the local maximum is often objectively better.

It’s important to realize that our fundamental goal is not to select the weights that were most likely to generate the dataset. Our goal is to infer an effective posterior predictive distribution \(P(\tilde{x} \mid D, H)\), or \(P(\tilde{y} \mid \tilde{x}, X, Y, H)\) for classification. In other words, we want to make predictions about observations outside of \(D\).

The optimal predictive posterior is a weighted vote across all possible \(W\). For example, when performing classification, for each \(W\) we evaluate \(P(\tilde{y} \mid \tilde{x}, W, H)\) by passing \(\tilde{x}\) through the neural network, then we combine all of these results according to the weight \(W\)’s likelihood on the training set, \(P(Y \mid X, W)\).

In neural networks, we never perform this optimal approach. Typically we select one set of weights \(W\) and use that as our posterior predictive model. So it is important to select a \(W\) that is representative of a large portion of the probability mass, rather than simply selecting a \(W\) with a large probability density.

When we perturb neural networks, we build in an assumption that there exist large pockets of probability mass, and we constrain ourselves to find these. By selecting one of these points that are surrounded by more probability mass, we build a better estimator of the optimal \(P(\tilde{x} \mid D, H)\). There is nothing inherently wrong with the narrow spiky parts of \(W\)-space. Ideally they would have some influence on the model’s predictions, but on their own they’ll tend to be a poor approximation of the optimal \(p(\tilde{x} \mid D, H)\).

Another way of saying this

Rather than talking about neural network weights, we can talk about computer programs.

Consider the set of all computer programs. You might denote each program by a unique binary string. You might limit them to a certain length and to a certain architecture / interpreter.

Our task is: given training data, decide how to process test data. Adopting the Bayesian point of view, the optimal strategy is: for every program, measure how well it reproduces the training data, then use this measure to decide how to weight this program’s vote. Run every program on the test data, then combine their weighted votes.

This optimal strategy is prohibitively expensive. To approximate it, we select a small number of programs, possibly just one. Naively, we may think that it’s best to choose the one with the highest weight. Suppose, instead, we sort the programs by similarity, and we look for clusters of relatively high weight in this sorted list. We now select a program that is at the center of one of these clusters. This program is probably a better approximation of the optimal solution, because it “speaks” for many legitimate programs.

Neural network weights \(W\) are a continuous space of computer programs, loosely sorted by similarity.

Some discussion

This view suggests that a good set of weights will have relatively high probability density and a relatively small curvature (second derivative). All other things equal, the lower the curvature’s magnitude, the more probability mass that this \(W\) probably “speaks” for. This idea is closely related to model comparison.

All of these ideas follow naturally from probability theory, so Bayesian techniques like Variational Dropout benefit from this phenomenon. Rather than measuring the second derivative, Variational Dropout reduces curvature w.r.t. each each weight by pushing each weight to have higher variance.

It is well-known that using maximum likelihood as an optimization target can lead to goofy outcomes. Much of MacKay’s book focuses on how the single-most-likely outcome of a distribution is often atypical. Repeating my opening point: “likely” and “typical” are two different things, and often what you really want is the “typical”.

Thanks to Ian Eisenberg for reading drafts of this.

Some "Causal Inference" intuition

Thu, 04 Nov 2021 15:00:00 -0700

(2022-03-23: This is the second iteration of this post. The original post was an off-the-cuff Slack message that I posted here, and it described causal inference as “black magic”. Since writing that, I have read “The Book Of Why” by Judea Pearl and I now have a clearer understanding.)

Does smoking cause lung cancer? Here are three approaches to answering this question.

Approach 3 is particularly interesting.

Approach 1: Intervene

The gold standard for answering causal questions is randomized controlled experiments. Choose a population, split them into three random groups, force one group to smoke, force one group not to smoke, and use the other as a control group. We can denote the results of these experiments as \(P(\text{cancer} \mid do(\text{smoke}))\), \(P(\text{cancer} \mid do(\neg \text{smoke}))\), and \(P(\text{cancer})\), respectively. This technique always works (as long as you’re careful) but is often impractical / unethical.

In terms of a causal diagram, this intervention removes one of the arrows.

Thus, with intervention, we can be confident that any correlation between smoking and cancer indicates causation (in some direction), not confounding.

Is it possible to infer causality from observational data? Sometimes, yes, depending on the causal graph. Read on.

Approach 2: The back-door adjustment

The most well-known strategy for answering causal questions using observational data is to find all of the possible confounding factors and control for them. The simplest way to control for these factors is to split the population into subpopulations, one for each value of the “Confounders” variable. Then, each subpopulation has its own causal diagram

and any correlation will indicate causation (in some direction), because the confounders don’t vary within the subpopulation.

We can combine the results for subpopulations to predict the results of a randomized controlled trial for the full population:

\[\begin{align} P(y \mid do(x)) & = \sum\limits_{z}{P(z) P(y \mid do(x), z)} \\ & = \sum\limits_{z}{P(z) P(y \mid x, z)} \end{align}\]

Accounting for all possible confounders is difficult. Are smokers also more likely to drink alcohol? Is their socioeconomic status lower? We must find all of these relevant factors and test whether smoking correlates with lung cancer within each subpopulation. Exhaustively listing these factors is sometimes possible, but often is not an option. As Fisher pointed out, it’s possible there is a genetic factor that causes cancer and also makes you more likely to smoke. How would you control for that?

Approach 3: The front-door adjustment

Here is a clever alternate strategy that allows us to infer causality from purely observational data. Rather than observing upstream factors like alcohol or genetic factors, we observe downstream factors. We find some other observable factor that is within the causal chain between “smoking” and “lung cancer”, for example, “tar in lungs”. If “tar in lungs” is a noisy mediator between smoking and lung cancer, then we can now infer causality by observing all three variables {smoking, tar in lungs, cancer} for the population. For “tar in lungs” to be a mediator, it must be the case that all causal influence of smoking on lung cancer is by way of tar in the lungs.

There are a few useful ways of describing this technique:

We use “smoking” to signal whether the unknown upstream factor is present, and we use “tar in lungs” as the more direct test of smoking-induced cancer.
Rather than directly attacking the question “Does smoking cause cancer?”, we answer the question “Does tar in the lungs cause cancer?”, and we perform the back-door adjustment (Approach 2) treating “smoking” as a confounder. We combine this information with “Does smoking cause tar in the lungs?” to get our final answer.
We let the universe perform a randomized controlled trial for us! The variable “tar in lungs” is a noisy mediator – it won’t always be equal to the truth value of the “smoking” variable. Some smokers won’t have tar. Perhaps even some non-smokers will have tar (due to non-modeled causes like pollution or job hazards). According to our causal diagram, there are no confounders causing these random occurrences that also influence lung cancer, so within a smoking or non-smoking subpopulation, this random noise is functionally equivalent to random interventions.

We can estimate the result of a randomized controlled trial as follows:

\[\begin{align} P(y \mid do(x)) & = \sum\limits_{z \in \{\text{tar}, \neg \text{tar}\}}{P(z \mid do(x)) P(y \mid do(z))} \\ & = \sum\limits_{z \in \{\text{tar}, \neg \text{tar}\}}{p(z \mid x)\sum\limits_{x' \in \{ \text{smoke}, \neg \text{smoke} \}}{p(x')p(y \mid x', z)}} \end{align}\]

and each of the terms in this expanded summation can be measured from observational data.

Note that the mediator must be at least somewhat noisy. Expanding this formula, we need estimations of cancer probabilities for populations meeting the conditions \(P(\text{cancer} \mid \text{tar}, \neg \text{smoke})\) and \(P(\text{cancer} \mid \neg \text{tar}, \text{smoke})\). If we just have one of these, we can use it to estimate the other. If we have neither, then “tar in lungs” is a redundant variable, totally equal to “smoking” and it provides no use. In that case we would need to find a different mediator.

Noisy mediators give us interesting capabilities. They can come in many forms. In a medical study, patients may decide to receive a treatment, then be prevented by a car breakdown. To practitioners of causal inference, car breakdowns are gifts from the universe.

Grid cells: Visualizing the CAN model

Sun, 16 Apr 2017 05:00:00 -0700

I simulated the Continuous Attractor Network model of grid cells. Here’s an initial look.

Here are 4096 groups of 4 neurons. Below, I hover over a cell to show its inhibitory output. The red cells are inhibited by the selected cell.

These cells are arranged into groups of 4, with one cell devoted to each direction. The “preferred direction” of the top cell in each group is North, the right cell East, the bottom cell South, and the left cell West. The North cell inhibits a ring of cells slightly “north” of the cell. The North cell also receives excitatory feedforward input when the animal moves north (this isn’t depicted). The same is true of the East, South, and West cells.

In the simulation below, I give each neuron a random initial firing rate (shown in black), then I let some time pass. The velocity input is 0. Each cell gets an equal amount of feedforward input.

It forms a near-perfect hexagonal lattice. The lattice is always oriented parallel to either the horizontal or vertical axis.

I kinda understand why the grid prefers to align with one of these axes. It has something to do with the periodic map of cells. The space here is similar to a game of “Asteroids” – when your ship goes off the right edge, it appears on the left side. If you go straight left or straight up, you’ll soon reach your starting point, but this isn’t true of other directions. If the topology were more like the surface of a sphere, rather than this “Asteroids” model, then all orientations would be equally likely.

Now I give it some velocity input. Below, I give extra feedforward input to every S and E cell.

This causes the lattice of active cells to shift. If you record an individual cell, you’ll find that it has a hexagonal firing field, just like a grid cell.