The paper claims to discover two distinct phases of stochastic gradient descent (SGD): error minimization and representation compression. During the error minimization phase, training (and test) error goes down substantially, whereas during the representation compression phase, SGD basically pumps noise into the model and washes out the irrelevant bits of information about the input (irrelevant for the task at hand). The transition between these two phases seems to precisely coincide with a transition in the magnitudes of the means and variances of the gradients: the error minimization phase is dominated by the mean gradients and the compression phase is dominated by the variances of the gradients.
The authors formalize these ideas in terms of information, so error minimization becomes maximizing the mutual information between the labels and the representation , i.e. and representation compression becomes minimizing the mutual information between the input and the representation , i.e. . The authors seem to imply that the compression phase is essential to understanding successful generalization in deep neural networks.
Unfortunately, I think information is not the right tool to use if our goal is to understand generalization in deep neural networks. Information is just too general a concept to be of much use in this context. I think a satisfactory explanation of generalization in deep networks would have to be both architecture- and data-dependent (see this and this). I have several specific issues with the paper:
1) They have results from a toy model only. It’s not clear if their results generalize to “realistic” networks and datasets. The talk mentions some results from a convolutional net trained on MNIST, but larger networks and more complex datasets (e.g. CIFAR-10 or CIFAR-100 at the very least) have to be tested in order to be sure that the results are general.
2) Networks right before they enter the compression phase already achieve good generalization performance (in fact, the transition to the compression phase seems to take place at a point where the test error starts to asymptote). This makes it questionable whether compression can really explain the bulk of the generalization performance. Relatedly, the compression phase seems to take place over an unrealistically long period. In practical problems, networks are rarely, if ever, trained for that long, which again questions the applicability of these results to more realistic settings.
3) I tried to replicate the drift vs. diffusion dominated phases of SGD claimed in the paper, but was not successful. Unfortunately, the paper is not entirely clear (another general weakness of the paper: lots of details are left out) about how they generate the crossing gradient mean vs std plot (Fig. 4) –what does even mean?–. So, I did something that seemed most reasonable to me, namely computed the norms of the mean and standard deviation (std) of gradients (across a number of mini-batches) at each epoch. Comparing these two should give an indication of the relative size of the mean gradient compared to the size of the noise in the gradient. The dynamics seems to be always diffusion (noise) dominated with no qualitatively distinct phases during training:
The network here is a simple fully-connected network with 3 hidden layers (128 units in each hidden layer) trained on MNIST (left) or CIFAR-100 (right). At the beginning of each epoch, the gradient means and stds were calculated from 100 estimates of the gradient (from 100 minibatches of size 500). The gradients are gradients with respect to the first layer weights. I didn’t normalize the gradients by the norm of the weights, but this would just scale the mean and std of the gradients the same way, so it shouldn’t change the qualitative pattern.
4) As long as the feedforward transformations are invertible, the networks never really lose any information either about the input or about the labels. So, they have to do something really ad hoc like adding noise or discretizing the activities to mold the whole thing into something that won’t give nonsensical results from an information-theoretic viewpoint. To me, this just shows the unsuitability of information-theoretic notions for understanding neural networks. In the paper, they use bounded tanh units in the hidden layers. To be able to make sense of information-theoretic quantities, they discretize the activations by dividing the activation range into equal-length bins. But then the whole thing basically depends on the discretization and it’s not even clear how this would work for unbounded ReLU units.
Update (3/5/18): A new ICLR paper by Andrew Saxe et al. argues that the compression phase observed in the original paper is an artifact of the double-sided saturating nonlinearity used in that study. Saxe et al. observe no compression with the more widely used relu nonlinearity (nor in the linear case). Moreover, they show that even in cases where compression is observed, there is no causal relationship between compression and generalization (which seems to be consistent with the results from a more recent paper) and that in cases where compression is observed, it is not caused by the stochasticity of SGD, thus pretty much refuting all claims of the original paper. There’s some back and forth between the authors of the original paper and the authors of this new study on the openreview page for those interested, but I personally find the results of this new study convincing, as most of them align with my own objections to the original paper.