What happens when you vectorize wide PyTorch expressions?

In scientific computing, code is often naturally expressed as wide, tree-like expressions. Often different branches of that tree contain similar chunks of logic, so there is potential to run many different branches together in parallel vectorized operations. What happens when you take your nice tree-like code and mangle it into hard-to-read vectorized code? How would a person do that?

I created Vexpr and used it to take real experiment code and convert its readable expressions into vectorized expressions at runtime. In this post, I present the results. Topics include:

What is the immediate impact?
How does this relate to torch.compile?
What is the more detailed breakdown of the impact on the GPU and CPU?

Introduction: Wide expressions

Mathematical expressions naturally form trees. Here’s a toy example.

\[\sqrt{a^2 + b^2} + \sqrt{c^2 + d^2}\]

Here’s Python code implementing this expression and highlighting its wide tree-like structure:

sum([math.sqrt(sum([a ** 2,
                    b ** 2])),
     math.sqrt(sum([c ** 2,
                    d ** 2]))])

For this simple function, we can write vectorized PyTorch code by hand:

torch.tensor([a, b, c, d]).pow(2).view((2, 2)).sum(dim=0).sqrt().sum()

The former code calls pow 4 times, calls sum twice, calls sqrt twice, then calls sum one last time. The latter code flattens the tree into a single pipeline so that one operation occurs for each level of the tree. Previously we ran one Python function call per node of the tree, and afterward we run one call per level of the tree – a number that is exponentially smaller.

PyTorch is often used in a pipelined way as shown above, but that’s usually because the expression is inherently deep; neural networks are the obvious example. Here we are concerned with expressions that are wide.

Imagine scaling up this toy example: use actual, larger expressions; let each variable represent a list of vectors, not just a number; within the expression, call functions like SciPy’s cdist which compute pairwise distances and return large matrices; do all of this on batches of inputs. With these changes, we now have a wide expression that is worth running on a GPU.

This scenario comes up often in scientific computing. For example, the kernel of a Gaussian Process (GP) takes in two lists of vectors and returns pairwise similarities. We can encode some of our intuition into these kernels by composing them via weighted sums and products. This leads to giant tree-like expressions, and because those expressions have many similar operations in each parallel branch, there is a lot of potential for vectorization.

Writing vectorized code for giant expressions is hard, so I created Vexpr to make it easy. Vexpr takes readable-but-slow PyTorch, NumPy, and JAX expressions and compiles them into fast, delightfully ugly vectorized expressions.

How to vectorize one level of an expression tree

Wide expressions naturally form a tree. When we vectorize that tree, we create a new narrower expression that invokes a series of Python functions, one for each level of the original tree.

Trees collapsing

How do we collapse a tree and reduce operations to be one per level? Each of the operations now will have to take in a tensor that contains every input to that level and return a tensor that contains all of the outputs.

We can see some operations already support such a change.

Elementwise operations like pow and sqrt can run on multiple inputs as-is.
Parallel sums can be implemented as .view(...).sum(dim=0)

What about more difficult cases? For example, what if the parallel sums are of different lengths? On GPUs, fast parallel reductions only work when inputs all have the same length. Would providing a single-input single-output interface still have a benefit? The answer is yes, even if the operation internally just loops over the sums. Collapsing one level of the tree allows subsequent levels to be collapsed, and low-hanging fruit tends to appear in the deeper levels. Moreover, we can do better than looping over the sums, even when the lengths are different. Vexpr’s vectorizer groups the inputs by length and performs a reduced number of operations—one for each unique length. For example, this decreases the number of torch.cdist operations in my hands-on kernel from 49 to 5. These 5 calls happen invisibly inside of a cdist_multi function that the code calls once.

These collapsed operations often introduce some overhead. Batch operations often require permuting tensors (e.g. for batched pairwise distances) or reordering values (e.g. putting equal-length sums next to each other). For operations that are grouped by size, we must split input tensors then re-concatenate the output tensors. Thus, while vectorizing has many benefits, it also introduces extra work for the GPU that is not necessary when using a non-vectorized expression. Is this overhead worth it? Let’s turn to experiments to find out.

Vectorizing leads to 4x-7x speed-up on one set of benchmarks

Here I run a pair of benchmarks on an NVIDIA V100 GPU. I describe the benchmarks within the context of real Gaussian Process use cases, but you don’t need to understand Gaussian Processes to understand these results; I am simply running this big vectorized expression on inputs of different shapes.

When you use Gaussian Processes for Bayesian Optimization, you first fit a GP’s parameters to the training data, then you optimize a set of candidate points to maximize expected improvement. Both of these steps include backpropagating gradients through the GP kernel.

Benchmark 1: Fit the kernel’s parameters. I run the forward and backward pass of the kernel, using x1.shape == x2.shape == (379, 26), i.e. a training set of 379 26-dimensional vectors. I repeat 100 times, then wait for a final CUDA synchronize. I test with and without torch.compile.

	Don’t compile	Compile
Baseline	6.43s	3.0s	Compile speed-up: 2.1x
Vectorized	0.99s	0.76s	Compile speed-up: 1.3x
	Vectorize speed-up: 6.5x	Vectorize speed-up: 3.9x	Combined speed-up: 8.5x

Benchmark 2: The optimization loop. In this experiment, x1.shape == (60, 2, 26) and x2.shape == (60, 381, 26). That means we’re searching for 60 single candidate points, with an additional point included in each (hence the 2, not 1) because the Noisy Expected Improvement algorithm also gets predictions for a set of previous potentially best points. These two points are appended to 379 training points to give us 381 points. Again I run the kernel’s forward and backward pass 100 times.

	Don’t compile	Compile
Baseline	6.03s	2.57s	Compile speed-up: 2.3x
Vectorized	0.83s	0.63s	Compile speed-up: 1.3x
	Vectorize speed-up: 7.2x	Vectorize speed-up: 4.1x	Combined speed-up: 9.6x

In these experiments, vectorizing my kernel caused a 4x speed-up when I used torch.compile, and a ~7x speed-up with straight pytorch code.

…but this speed-up doesn’t happen in all benchmarks

Now I test the kernel’s performance in another scenario: hold-one-out cross-validation. During cross-validation, we fit \(N\) models on \(N\) slightly different datasets. We can test this by just rerunning Benchmark 1 on \(N\) models in parallel. To demonstrate a surprising phenomenon, I set \(N=20\).

Benchmark 3: Cross-validation. Repeat Benchmark 1, but train 20 models in parallel rather than 1. So we have x1.shape == x2.shape == (20, 379, 26), and the shapes of every parameter, e.g. lengthscale, have (20,) prepended to them.

	Don’t compile	Compile
Baseline	15.6s	7.50s	Compile speed-up: 2.1x
Vectorized	17.7s	6.15s	Compile speed-up: 2.9x
	Vectorize slow-down: 1.1x	Vectorize speed-up: 1.2x	Combined speed-up: 2.5x

This result surprised me. The vectorized version is sometimes slower than the baseline, at least when you examine only the GP kernel in an isolated benchmark. The end-to-end performance of the larger system is still often faster due to the freed up CPU (see next section), but it is noteworthy that vectorizing code doesn’t always directly speed up that code.

To understand where the change occurs, I rerun this benchmark for different \(N\). Note that \(N=1\) and \(N=20\) correspond to the original Benchmark 1 and Benchmark 3 results, respectively. I used the same hardware, but I enabled some additional profiling, hence the slower times compared to above.

Benchmark showing diminishing improvement from vectorization

The vectorized kernel initially has a huge advantage over the baseline, but this advantage diminishes as we give it more parallel work, and without torch.compile the vectorized kernel is eventually slower when viewed in isolation. Why does this happen?

The CPU is a bottleneck. Vectorizing removes that bottleneck.

To really understand the performance impact of vectorization, we need to understand the CPU and GPU usage before and after. First, note that CUDA / PyTorch are built on an asynchronous relationship between the CPU and GPU, where the CPU ideally should always run ahead, always queueing the GPU’s future work, while the GPU is always working through its queue. However, there are two events that cause the CPU to wait for the GPU:

When the CPU chooses to read a value from the GPU. This is a decision made by your code.
When the CPU gets too far ahead of the GPU. This is a decision made by CUDA.

I profiled the kernel using NVIDIA’s nsys, and I used that trace to obtain GPU and CPU active time. My so-called CPU “active” time is actually an inferred value; PyTorch uses CUDA in a way that spins the CPU 100% constantly, even when the CPU is just waiting for the GPU, so I use heuristics to detect these waits and subtract it them from the actual active time. (2023-10-31 update: Thanks to gregjm for pointing out that CUDA offers a way to avoid this CPU-spinning, and that this is actually a PyTorch issue. The initial version of this blog post blamed CUDA for this unnecessary power consumption.)

Here is the same plot from above, but with the CPU and GPU time overlaid.

Benchmark with CPU and GPU time included. The baseline is CPU-bound until we reach larger input sizes.

First, to understand the result of Benchmark 1 above, look at the left side of both plots. Both the baseline and vectorized models have low total active GPU time. But the baseline model puts a much larger workload on the CPU, which is responsible for orchestrating the set of operations that are sent to the GPU. Thus, the GPU spends the vast majority of the time idle, waiting for the CPU to give it more work. For the vectorized model the amount of CPU work is almost always less than the amount of GPU work, so the GPU is almost never idle; the benchmark time is roughly equal to the GPU active time.

Now we focus on the surprising result from Benchmark 3, where the baseline model did slightly better than the vectorized model. As we scale up the number of models, we see an interesting phenomenon. When training many models simultaneously, even the baseline model is able to keep the GPU busy. Once the GPU active time exceeds CPU active time. one of the key selling points of vectorized code is eliminated, because the nonvectorized code becomes good enough. This was a fun, surprising fact; even unvectorized code can outrun the GPU if you pass in large enough tensors. I expect this phenomenon to occur in other large-batch scenarios like training on very large datasets or doing Bayesian Optimizations with very large sets of candidate points. Of course, the vectorized code is still superior when using torch.compile, and in all cases its CPU usage is far superior.

Finally, let’s look at the GPU workload. Independent of the effect on CPU, what is the impact of vectorization on the total amount of work that the GPU has to do? Looking at slopes of the “GPU time” lines, we see that for some models, e.g. my non-compiled model, vectorization increases the total workload, and for other models it decreases the workload. I studied the CUDA traces closely and found that vectorization does indeed reduce many aspects of the GPU workload, greatly reducing the number of operations and decreasing the total amount of time spent on the fundamental computations of the algorithm. However it also introduces overhead (mentioned above) by interspersing operations that permute and reorder the tensors, or splitting them into groups then concatenating results. Sometimes the reduced “fundamental” time outweighs the additional overhead, while other times the overhead outweighs the reduction in fundamental time.

So we see that vectorization has three effects:

It lets us keep the GPU busy even when inputs are small.
It frees up the CPU to do other work.
It can slightly change the total amount of GPU work, sometimes for the worse.

The speed-up from vectorization can be great, but it can be underwhelming in scenarios where none of the benefits are needed.

The benefits of vectorization increase as GPU speed increases

Here is a point that follows naturally from everything above, but it might not be immediately obvious. It is quite striking when you experience it.

As I built Vexpr, I tested on an NVIDIA T4. Then for this experiment, I upgraded to a much-faster NVIDIA V100, and the benefits of vectorization greatly improved.

You always want your CPU to stay ahead of the GPU so that the GPU is never idle. Code that is good enough to stay ahead of this year’s GPU might not be good enough for next year’s GPU. With each upgrade, you need the dotted CPU line from these charts to be lower and lower.

This means that vectorizing your code is good strategy for future-proofing it.

GPyTorch’s “structure” kernels show similar results

I also tested GPyTorch’s limited support for vectorization. GPyTorch lets you take sets of identically-shaped kernels and run them as a single vectorized kernel. This capability is easy to use when summing single-feature kernels, so I created a partially vectorized kernel by replacing sums of single-feature Matern kernels with single Additive Structure kernels.

Benchmark showing vectorized version is faster than baseline, but this advantage goes away at higher input sizes

(Don’t focus too much on comparing absolute speed of the Vexpr and the GPyTorch kernels. There are many small differences between the two that have nothing to do with vectorization. For example, unlike my Vexpr kernel, GPyTorch has a nice optimization of putting this line before this line.)

Vexpr has been optimized much more for this use case than GPyTorch, but we see that the same fundamental phenomena occur. Vectorization leads to wins at small batch sizes, but the advantage diminishes at large batch sizes. In this experiment, vectorization increased the total GPU workload, so with large batch sizes the only advantage of vectorization is a freed up CPU.

Interestingly, neither GPyTorch kernel ever reaches a point where GPU time is equal to the benchmark time. The GPU always spends at least 1-2 seconds idle. This happens because GPyTorch’s kernels do a synchronous equality check on every call, forcing a GPU synchronize, which causes the CPU to fall behind the GPU immediately afterward. So when you use GPyTorch kernels on GPUs, you don’t get the ideal fully asynchronous execution that you’re supposed to get with CUDA / PyTorch. (There is good news: PyTorch has recently introduced torch.cuda.set_debug_mode(2) which detects these unwanted synchronize events. Like others, I think every PyTorch library developer should become friends with this API.)

Closing thoughts on compilation

Everybody agrees that vectorization is good. Is it worth doing even in cases where it requires extra work, like with wide expressions? The results above suggest: most of the time, yes, but with interesting nuances. Vectorization is especially great for making the most of a GPU when your task is running many iterations on smaller batches of input data. For larger-batch scenarios, maybe you’ll only benefit from vectorization after you add JIT compilation, or after you upgrade your GPU, or after you’ve found some use for all the newfound idle CPU time. But, to a first approximation, vectorization is good.

The real open question for the field remains: for “wide” computation graphs, what is a practical strategy for getting vectorized code?

My experiments above show that this is not solved by JIT tools like torch.compile, nor do I expect it to be. I think a compiler would need to become a huge slow unreliable hairball to solve this type of auto-vectorization problem. Vexpr’s vectorizer is a compiler that solves this, but it does so by making the programmer meet the compiler in the middle. The programmer gives Vexpr a tree-like expression and tells Vexpr, “You should try vectorizing this.” Both of those pieces of information are valuable: the programmer structures the logic in a certain way, and the programmer indicates that there is opportunity for vectorization there. The programmer doesn’t need to do anything too difficult, and neither does the compiler.

I think this “meet-in-the-middle” approach between programmer and compiler is a good design principle that leads to good systems. And I think this will become even more true in the age of Large Language Models; rather than cramming too much magic into our compilers, let’s rely on humans-with-LLMs to meet the compiler in the middle.

(This project is supported by a GCP cloud compute grant from ML Collective, which has been super helpful. Thanks, also, to Rosanne Liu for useful feedback on drafts of this post.)