Here is a fun explanation why neural networks do better when trained with noise. There are multiple existing explanations, but I particularly like this one.

First, an aside: Given 100 flips of an unfair coin with 60% probability of heads, the single-most-likely sequence is 100 consecutive heads, but a typical sequence will have about 60 heads. “Likely” and “typical” are two different things, and often what you really want is the “typical”.

Let’s apply this idea to neural networks.

Background

When training a neural network, results are often improved by randomly perturbing the network’s weights or activations. Two popular examples of this technique are Dropout, which randomly zeros activations, and Variational Dropout, which simulates noisy weights by multiplying or adding noise into activations.

These techniques may all work for the same fundamental reason. The authors of Variational Dropout made the case that, via the Central Limit Theorem, applying Dropout to activations is approximately the same as multiplying noise into the weights of the downstream layer, which is approximately the same as multiplying noise into the activations of the downstream layer. In this view, these techniques all have a unified explanation… but what is it?

One explanation is the Minimum Description Length principle (Hinton and van Camp 1993, Graves 2011, Kingma et al. 2015). When a weight is noisy, it contains less information. By Occam’s razor, a model containing less information is better and more likely to generalize. This view unifies Dropout and Variational Dropout with other forms of perturbation like Weight Decay.

Another explanation (Srivistava et al. 2014) is that Dropout forces the neural network to train many different subnetworks within the network, so that the network consists of a set of subnetworks forming consensus.

These explanations are intriguing, but they’re just stories that we tell ourselves. The Minimum Description Length principle is a heuristic, not provably correct unless we make some assumptions about priors (MacKay 1992). The ensemble-of-subnetworks is another fun idea. Here’s a third fun idea that makes very few assumptions.

The idea

We want a representative sample of the posterior \(P(W \mid D)\), not the most likely sample.

The following two images gives the core intuition. The text describes it more rigorously.

A schematic picture of a likelihood function is plotted, showing one wide roughly-Gaussian shaped curve and three very narrow curves. The narrow curves are taller, but the wide one contains much more likelihood mass.

From the standpoint of probability theory, when we train a neural network we are evaluating weights \(W\) based on their probability of generating the dataset \(D\), i.e. \(P(D \mid W)\). By Bayes’ Rule, optimizing the likelihood function \(P(D \mid W)\) w.r.t. \(W\), is equivalent to optimizing the posterior \(P(W \mid D)\). In other words, selecting the weights that are most likely to generate the dataset is the same as inferring the most likely weights. This leap is strange; I find it helpful to append an explicit hypothesis term \(H\), for example \(H=\) “this dataset can be generated by a particular convolutional neural network architecture”, giving us \(P(W \mid D, H)\). It’s under this hypothesis \(H\) that inferring \(W\) makes any sense at all.

(When the task is classification, the “dataset” is a set of labels \(Y\) for some known set of inputs \(X\), and we optimize \(P(Y \mid X, D)\). Also, to keep things simple, I’m not discussing priors. Inserting priors doesn’t change anything fundamental.)

In the schematic image above, red dots denote a set of \(N=6\) samples of a posterior. Note that it’s not actually tractable to sample this posterior. Instead, we optimize \(W\) to find a single good point in this posterior distribution. Depending on our search process and the actual shape of \(P(D \mid W)\), we may find ourselves at a local maximum (at the center of the red dots), or at a global maximum (blue). Here I show why the local maximum is often objectively better.

It’s important to realize that our fundamental goal is not to select the weights that were most likely to generate the dataset. Our goal is to infer an effective posterior predictive distribution \(P(\tilde{x} \mid D, H)\), or \(P(\tilde{y} \mid \tilde{x}, X, Y, H)\) for classification. In other words, we want to make predictions about observations outside of \(D\).

The optimal predictive posterior is a weighted vote across all possible \(W\). For example, when performing classification, for each \(W\) we evaluate \(P(\tilde{y} \mid \tilde{x}, W, H)\) by passing \(\tilde{x}\) through the neural network, then we combine all of these results according to the weight \(W\)’s likelihood on the training set, \(P(Y \mid X, W)\).

Probability Mass

In neural networks, we never perform this optimal approach. Typically we select one set of weights \(W\) and use that as our posterior predictive model. So it is important to select a \(W\) that is representative of a large portion of the probability mass, rather than simply selecting a \(W\) with a large probability density.

When we perturb neural networks, we build in an assumption that there exist large pockets of probability mass, and we constrain ourselves to find these. By selecting one of these points that are surrounded by more probability mass, we build a better estimator of the optimal \(P(\tilde{x} \mid D, H)\). There is nothing inherently wrong with the narrow spiky parts of \(W\)-space. Ideally they would have some influence on the model’s predictions, but on their own they’ll tend to be a poor approximation of the optimal \(p(\tilde{x} \mid D, H)\).

Another way of saying this

Rather than talking about neural network weights, we can talk about computer programs.

Consider the set of all computer programs. You might denote each program by a unique binary string. You might limit them to a certain length and to a certain architecture / interpreter.

Our task is: given training data, decide how to process test data. Adopting the Bayesian point of view, the optimal strategy is: for every program, measure how well it reproduces the training data, then use this measure to decide how to weight this program’s vote. Run every program on the test data, then combine their weighted votes.

This optimal strategy is prohibitively expensive. To approximate it, we select a small number of programs, possibly just one. Naively, we may think that it’s best to choose the one with the highest weight. Suppose, instead, we sort the programs by similarity, and we look for clusters of relatively high weight in this sorted list. We now select a program that is at the center of one of these clusters. This program is probably a better approximation of the optimal solution, because it “speaks” for many legitimate programs.

Neural network weights \(W\) are a continuous space of computer programs, loosely sorted by similarity.

Some discussion

This view suggests that a good set of weights will have relatively high probability density and a relatively small curvature (second derivative). All other things equal, the lower the curvature’s magnitude, the more probability mass that this \(W\) probably “speaks” for. This idea is closely related to model comparison.

All of these ideas follow naturally from probability theory, so Bayesian techniques like Variational Dropout benefit from this phenomenon. Rather than measuring the second derivative, Variational Dropout reduces curvature w.r.t. each each weight by pushing each weight to have higher variance.

It is well-known that using maximum likelihood as an optimization target can lead to goofy outcomes. Much of MacKay’s book focuses on how the single-most-likely outcome of a distribution is often atypical. Repeating my opening point: “likely” and “typical” are two different things, and often what you really want is the “typical”.

Thanks to Ian Eisenberg for reading drafts of this.