### No, information bottleneck (probably) doesn’t open the “black-box” of deep neural networks

#### by exactnature

This paper has been making the rounds recently. The paper was posted on arxiv in March, but for some reason seems to have attracted a lot attention recently due to a talk by the second author.

The paper claims to discover two distinct phases of stochastic gradient descent (SGD): error minimization and representation compression. During the error minimization phase, training (and test) error goes down substantially, whereas during the representation compression phase, SGD basically pumps noise into the model and washes out the irrelevant bits of information about the input (irrelevant for the task at hand). The transition between these two phases seems to precisely coincide with a transition in the magnitudes of the means and variances of the gradients: the error minimization phase is dominated by the mean gradients and the compression phase is dominated by the variances of the gradients.

The authors formalize these ideas in terms of information, so error minimization becomes maximizing the mutual information between the labels and the representation , i.e. and representation compression becomes minimizing the mutual information between the input and the representation , i.e. . The authors seem to imply that the compression phase is essential to understanding successful generalization in deep neural networks.

Unfortunately, I think information is not the right tool to use if our goal is to understand generalization in deep neural networks. Information is just too general a concept to be of much use in this context. I think a satisfactory explanation of generalization in deep networks would have to be both architecture- and data-dependent (see this and this). I have several specific issues with the paper:

**1)** They have results from a toy model only. It’s not clear if their results generalize to “realistic” networks and datasets. The talk mentions some results from a convolutional net trained on MNIST, but larger networks and more complex datasets (e.g. CIFAR-10 or CIFAR-100 at the very least) have to be tested in order to be sure that the results are general.

**2)** Networks right before they enter the compression phase already achieve good generalization performance (in fact, the transition to the compression phase seems to take place at a point where the test error starts to asymptote). This makes it questionable whether compression can really explain the bulk of the generalization performance. Relatedly, the compression phase seems to take place over an unrealistically long period. In practical problems, networks are rarely, if ever, trained for that long, which again questions the applicability of these results to more realistic settings.

**3)** I tried to replicate the drift vs. diffusion dominated phases of SGD claimed in the paper, but was not successful. Unfortunately, the paper is not entirely clear (another general weakness of the paper: lots of details are left out) about how they generate the crossing gradient mean vs std plot (Fig. 4) –what does even mean?–. So, I did something that seemed most reasonable to me, namely computed the norms of the mean and standard deviation (std) of gradients (across a number of mini-batches) at each epoch. Comparing these two should give an indication of the relative size of the mean gradient compared to the size of the noise in the gradient. The dynamics seems to be always diffusion (noise) dominated with no qualitatively distinct phases during training:

The network here is a simple fully-connected network with 3 hidden layers (128 units in each hidden layer) trained on MNIST (left) or CIFAR-100 (right). At the beginning of each epoch, the gradient means and stds were calculated from 100 estimates of the gradient (from 100 minibatches of size 500). The gradients are gradients with respect to the first layer weights. I didn’t normalize the gradients by the norm of the weights, but this would just scale the mean and std of the gradients the same way, so it shouldn’t change the qualitative pattern.

**4)** As long as the feedforward transformations are invertible, the networks never really lose any information either about the input or about the labels. So, they have to do something really *ad hoc* like adding noise or discretizing the activities to mold the whole thing into something that won’t give nonsensical results from an information-theoretic viewpoint. To me, this just shows the unsuitability of information-theoretic notions for understanding neural networks. In the paper, they use bounded tanh units in the hidden layers. To be able to make sense of information-theoretic quantities, they discretize the activations by dividing the activation range into equal-length bins. But then the whole thing basically depends on the discretization and it’s not even clear how this would work for unbounded ReLU units.

**Update (3/5/18):** A new ICLR paper by Andrew Saxe et al. argues that the compression phase observed in the original paper is an artifact of the double-sided saturating nonlinearity used in that study. Saxe et al. observe no compression with the more widely used relu nonlinearity (nor in the linear case). Moreover, they show that even in cases where compression *is* observed, there is no causal relationship between compression and generalization (which seems to be consistent with the results from a more recent paper) and that in cases where compression is observed, it is not caused by the stochasticity of SGD, thus pretty much refuting all claims of the original paper. There’s some back and forth between the authors of the original paper and the authors of this new study on the openreview page for those interested, but I personally find the results of this new study convincing, as most of them align with my own objections to the original paper.

1. Saying that “Information is too general” to explain Deep Learning is similar to saying that energy and entropy are too general to explain physics… fortunately, they do. The mutual information quantities we suggest act very similar to energy and entropy in statistical physics and crucially depend on the data and architecture, much more than standard simpler numbers like VC dimensions. See also my longer talk: https://youtu.be/RKvS958AqGY

2. The discretization of the units is inherent to any realistic network, either through noise, finite accuracy, or effective binning. The real maps to the layers are never lossless!

3. We actually repeated our abalysis for both MNIST and CIFAR and when correctly measured we clearly see both the compression phase and the corresponding two gradient phases in both cases, even when using RelU nonlinearities and CNN architectures. The generalization error improves through both phases, we never said otherwise. But the final and most important generalization improvement happens during the compression phase.

4. The two gradient phases were independently discovered and reported by others. See: https://medium.com/intuitionmachine/the-peculiar-behavior-of-deep-learning-loss-surfaces-330cb741ec17

4. The numerical results reported here are clearly incorrect as the gradients must decease eventually when the training error saturates.

Naftali Tishby

Regarding the first point numbered four, that Medium post does not include an independent discovery of the two gradient phases as far as I can tell.

Agreed.

I believe this is Ravid Shwartz-Ziv’s code for the paper: https://github.com/ravidziv/IDNNs

The file `idnns/plots/plot_gradients.py` operationalizes the meanings of the metrics plotted to show the two gradient phases. Both are calculated over the gradients in an epoch. The norm of the mean gradient is the “gradient mean” and the thing labeled as if it is standard deviation is the norm of the element-wise variances (see lines 55-75, roughly).

I’m having difficulty using Shwartz-Ziv’s code to replicate the results of the paper, but I hope to correspond with him to resolve this.

Thanks! This is very useful. That’s basically how I calculated the mean and std of gradients in the figure in the post as well (I took the elementwise stds, instead of variances.) I’ll make my code available soon so people can replicate it.

I should have read further in the code: things are done on a cumulative sum basis by layer, so the value for layer two is the sum of the metric for layer one and layer two – that’s why the lines stack so nicely. And after that summing there is a square root of the sum of sums of variances (at the point of plotting) so I guess that’s why they call it a standard deviation.

Also, the y axis labels seem to get misaligned somehow, so be careful trying to read values from the axis labels.

I did some more work on this and reached out directly to Schwartz-Ziv and Tishby about my questions, but I never heard back from them. The write-up is here: https://planspace.org/20180213-failure_to_replicate_schwartz-ziv_and_tishby/

[…] to my knowledge, but it still built up some buzz. It has been difficult to replicate, for both bloggers and academics. I attempted to replicate some aspects, and emailed the authors with the message […]

[…] chosen by Schwartz-Ziv and Tishbi, and cannot be observed when small changes are made to it. This blog post also gives some interesting food for […]