Review from Emin Orhan about Benna and Fusi (2016). ]]>

Also, the y axis labels seem to get misaligned somehow, so be careful trying to read values from the axis labels.

]]>The file `idnns/plots/plot_gradients.py` operationalizes the meanings of the metrics plotted to show the two gradient phases. Both are calculated over the gradients in an epoch. The norm of the mean gradient is the “gradient mean” and the thing labeled as if it is standard deviation is the norm of the element-wise variances (see lines 55-75, roughly).

I’m having difficulty using Shwartz-Ziv’s code to replicate the results of the paper, but I hope to correspond with him to resolve this.

]]>2. The discretization of the units is inherent to any realistic network, either through noise, finite accuracy, or effective binning. The real maps to the layers are never lossless!

3. We actually repeated our abalysis for both MNIST and CIFAR and when correctly measured we clearly see both the compression phase and the corresponding two gradient phases in both cases, even when using RelU nonlinearities and CNN architectures. The generalization error improves through both phases, we never said otherwise. But the final and most important generalization improvement happens during the compression phase.

4. The two gradient phases were independently discovered and reported by others. See: https://medium.com/intuitionmachine/the-peculiar-behavior-of-deep-learning-loss-surfaces-330cb741ec17

4. The numerical results reported here are clearly incorrect as the gradients must decease eventually when the training error saturates.

Naftali Tishby

]]>[will add this to the next arxiv update. Thanks for pointing it out!]

re: “why this happens”. I agree and it is clear that theoretically speaking feedforward networks are more expressive due to not having the constraint imposed by weight-shared recurrence, but getting them to train to do what you want them to do is the challenge. The hypothetical case with conflicting losses you mentioned indeed appears to be (at least partially) happening, evinced by yielding satisfactory performances for either early or endpoint predictions, but not both (above experiment). We noticed throwing in many more parameters in feedback like networks doesn’t easily lead to overfitting; recurrence/weight-sharing+episodic losses appears like a good internal regularizer, but feedforward networks appeared unnaturally `too unconstrained` to cope. See table 6 where the larger feedforward nets start to overfit, while feedback nets with significantly higher parameters counts (i.e. larger physical depth) didn’t experience a considerable drop (supplementary).

]]>