### Why is it hard to train deep neural networks? Degeneracy, not vanishing gradients, is the key

In this post, I will try to address a common misunderstanding about the difficulty of training deep neural networks. It seems to be a widely held belief that this difficulty is mainly, if not completely, due to the vanishing (and/or exploding) gradients problem. “Vanishing gradients” refers to the gradient norms becoming exponentially smaller for parameters deeper in the network. Smaller gradients mean parameters changing ever so slowly, and so learning gets stuck until the gradients become large enough, which could take exponentially long. This idea goes back, at least, to Bengio et al. (1994) and still seems to be everybody’s favorite explanation for why it is hard to train deep neural networks.

Let’s first consider a simple scenario: a deep linear network being trained to learn a linear mapping. Of course, deep linear networks aren’t interesting from a computational perspective, but Saxe et al. (2013) showed that learning dynamics in such networks can still be informative about the learning dynamics in nonlinear networks. So, let’s start  with this simple scenario. Here’s the learning curve and the initial gradient norms (before any training) for a 30-layer network (errorbars are standard errors over 10 independent runs).

I will shortly explain why these results are labeled “Fold 0” in the figure. The gradients here are with respect to layer activations (gradients with respect to parameters behave similarly). The network weights are initialized with the standard $1/\sqrt{n}$ initialization. The training loss decreases rapidly at first, but then quickly asymptotes at a suboptimal value. The gradients certainly don’t vanish (or explode), at least initially! They do become smaller as training progresses, but that is to be expected and it’s not clear at all that the gradients here are “too small” in any sense:

To show that convergence to a suboptimal solution here doesn’t have anything to do with the size of the gradient norms per se, I will now introduce a manipulation that will increase the gradient norms, but will worsen the performance. Here it is (in blue):

So, what did I do? I simply changed the initialization in a pretty minimal way. Each initial weight matrix in the original networks is a $64 \times 64$ matrix (initialized with the standard $1/\sqrt{n}$ initialization). In the networks shown in blue here, I just copied the first half of each initial weight matrix to the second half (the initial weight matrix is “folded” once, so that’s why I denote these as “Fold 1” networks). This reduces the rank of the initial weight matrices, and makes them more degenerate. Note that this manipulation is done only on the initial weight matrices, so no other constraint is imposed on learning once the training gets started. Here’s how the gradient norms look after the first few epochs:

So, I introduced a manipulation that increased the size of the gradient norms overall, yet the performance got significantly worse. Conversely, I will now introduce another manipulation that will shrink the size of the gradient norms, yet will improve performance substantially. Here it is (in green):

As you may have guessed from the label, this new manipulation initializes the weight matrices to be orthogonal. Let’s recall that orthogonal matrices are the least degenerate matrices among matrices of a fixed (Frobenius) norm, where degeneracy can be measured in different ways, for example, as the fraction of singular values smaller than a given constant. Here’s how the gradients look after the first few epochs in this case:

In fact, I can even multiply the initial orthogonal matrices by some factor smaller than $1$ (e.g. by $0.8$ in the following example) and deliberately create vanishing gradient norms in the initial network without much effect on the training performance (magenta):

There’s another, slightly subtler, reason for why the size of the gradient norms can’t explain the bulk of the training difficulties with deep neural networks: modern adaptive SGD variants typically used in practice these days, such as Adam, apply updates that are approximately invariant to the size of the gradient norms. So, if the only problem with training deep networks were a simple scaling problem, these methods would have easily dealt with that issue.

So, what’s going on? If the size of the gradient norms per se isn’t responsible for training difficulties, then what is? The answer is that it’s the degeneracy of the model that, by and large, determines the training performance. Why is degeneracy harmful for training performance? Intuitively, the reason is that learning slows down substantially along degenerate directions in the parameter space, hence degeneracies reduce the effective dimensionality of the model. So, you might think that you’re fitting a model with $\sim 130\mathrm{K}$ parameters, as in the above examples, but in reality you’re effectively fitting a model with substantially fewer degrees of freedom because of the degeneracies. So, the problem with the “Fold 0” and “Fold 1” networks above was that although the gradient norms are fine, the network’s available degrees of freedom contribute extremely unevenly to those norms: while a handful of degrees of freedom (non-degenerate ones) explain almost all of it, the vast majority (degenerate ones) don’t contribute anything at all (a conceptually helpful, but not strictly accurate, way of thinking about this may be to imagine only a few of the hidden units in each layer changing their activity in response to different inputs, while the vast majority just responding the same way independently of the input).

This reduction in the effective degrees of freedom can be quite substantial, as in the $1/\sqrt{n}$-initialized random matrices above. As shown by Saxe et al. (2013), the product of such matrices becomes increasingly degenerate as the number of matrices multiplied (i.e. network depth) increases. Here’s an example with 1-layer, 10-layer and 100-layer networks respectively (adapted from Saxe et al. (2013)):

As depth increases, the singular values of the product matrix become increasingly concentrated around $0$, except for a vanishingly small fraction of singular values that become arbitrarily large. This result isn’t just relevant for linear networks. A similar thing happens in nonlinear networks too: as depth increases, the responses of the hidden units in a given layer become increasingly lower-dimensional, i.e. increasingly degenerate. In fact, this “degeneration process” can proceed much more quickly with depth in nonlinear networks with hard-saturating boundaries as in ReLU networks.

A nice visualization of this degeneration process is presented in a paper by Duvenaud et al. (2014):

As depth increases, the input space (shown on the top left) gets distorted into increasingly thin filaments with only a single direction (orthogonal to the filament) at each point in the input space affecting the network’s response (it may be a bit hard to extrapolate from this figure how input spaces of more than two dimensions will behave under a similar mapping, but it turns out they become “hyper-pancakey” locally, i.e. there is a single direction at each point, orthogonal to the pancake surface, the network is sensitive to). Along that sensitive direction, the network in fact becomes hyper-sensitive to variations.

Finally, I can’t resist mentioning my own paper (with Xaq Pitkow) at this point. In this paper, through a series of experiments, we argue that the degeneracy problem I discussed in this post severely afflicts training in deep nonlinear networks, and that one of the ways (and quite possibly the single most important way) in which skip connections help training in deep networks is through breaking such degeneracies. I suspect that other methods like batch normalization or layer normalization that also help training in deep networks also work at least partly through a similar degeneracy-breaking mechanism, in addition to any other potentially independent mechanisms such as reducing the internal covariate shift as originally proposed. It is well-known, for example, that divisive normalization is an effective way of decorrelating the responses of hidden units, which in turn can be seen as a degeneracy-breaking mechanism.

Update (1/5/18): I should also mention this important recent paper by Pennington, Schoenholz & Ganguli. Orthogonal initialization completely eliminates degeneracies in linear networks, but not in nonlinear networks. In this paper, they provide a method to calculate the entire singular value distribution of the Jacobian of a nonlinear network and show that a depth-independent, non-degenerate singular value distribution can be achieved, with careful initialization, in networks with hard-tanh nonlinearities, but not in ReLU networks. The empirical results show that networks with depth-independent, non-degenerate singular value distributions train orders of magnitude faster than networks whose singular value distributions become wider (higher variance) and more degenerate with depth. This is a powerful demonstration of the importance of eliminating degeneracies and controlling the entire singular value distribution of a network, not just its mean.