Recurrence as an efficient way to achieve depth in neural networks
by Emin Orhan
We’ve recently discussed this paper on “feedback networks” by Zamir et al. in a journal club. The paper temporalizes static image recognition tasks and shows that recurrent nets that are trained to do the task at all times perform quite well even early on during inference and naturally implement a coarse-to-fine classification strategy where early outputs of the network correspond to coarse classifications that get refined over time. I really like the paper overall. I especially appreciate and applaud their efforts to probe and understand how recurrence changes the nature of the representations learned in deep networks. I basically have two criticisms of the paper. The first is mostly terminological: I think the use of the term “feedback” in feedback networks is misleading. What they rather mean is just “recurrence” (as in recurrence in vanilla recurrent neural networks or in LSTMs), whereas “feedback” implies something more specific, i.e. a set of connections functionally and architecturally (architecturally=having a different structure) distinct from feedforward and lateral (within-layer) connections. The second criticism is that, again unlike what the title implies, the crucial manipulation of the paper has actually nothing to do with the architecture of the network per se, but how it is trained: specifically whether the outputs of the intermediate layers are also trained to do the task or not. This is the unambiguous conclusion of the results reported in the crucial Table 4. This particular training method can be implemented in any type of network be it purely feedforward or recurrent.
This second point made me think about what actual advantages (or disadvantages) a recurrent architecture specifically bestows on a neural net and I realized something that I perhaps knew only implicitly: recurrence is an efficient way of achieving depth in a neural network. It’s well known that a recurrent net unrolled in time is equivalent to a feedforward network with weight sharing across layers. So, a recurrent net achieves depth with
neurons and
synapses, whereas a feedforward network achieves the same depth with
neurons and
synapses. Of course, the recurrent net is less expressive than the corresponding feedforward net due to weight sharing across layers, but both the feedback network paper and an earlier paper by Liao & Poggio show that one doesn’t lose much in terms of performance by sacrificing this extra bit of expressivity. Intriguingly, this could also explain why even in highly visual animals such as ourselves and other primates, one doesn’t find very deep visual cortices. Instead, one finds only a handful of hierarchical visual areas (~5 in primates), but lots of recurrence both within the same area and across areas. This then raises the opposite question: if recurrence is so efficient, why isn’t the whole visual cortex, or even the entire brain, a fully recurrent net? I suspect that there is an interesting trade-off between expressivity and efficiency and our visual cortices might be striking a balance between the two. But, fleshing out this idea requires some work.
Great insights! Enjoyed reading through them.
A few comments on your points:
re: first criticism on feedback networks, perhaps Sec. 3 of supplementary material (http://goo.gl/5Ttx08) or table 4 in the paper could add insights. They both argue only recurrence wouldn’t bring about the same results as feedback. It appears that indeed both ‘recurrence’ and ‘output->input’ characteristics are essential (the two requirements listed in the beginning of Sec. 3 of paper), and by removing either one, the results meaningfully change.
re: second criticism, I basically view existing LSTM/RNNs as one way of instantiating feedback based learning if they’re employed in a particular way, as they can fulfill the basic requirements of feedback (Sec. 3). Indeed feedback networks don’t bring any new mechanism that fundamentally didn’t exist in RNNs (at least in the current instantiation), though existing recurrent models may not be the best fit for instantiating feedback (see Sec. 6 of supplementary). However, I would argue that this way of “feedback” like training can’t be realized through pure feedforward networks, unless core changes like mid-network losses and weight-sharing are employed, which in that case, I don’t really call the network pure feedforward anymore. The efforts towards gaining the same effects of feedback by adding mid-network losses to feedforward networks failed, suggesting that weight-sharing (i.e. ~recurrence) plays a fundamental role in conjunction with mid-way losses (i.e. ~output->input).
re: your last point/question “if recurrence is so efficient, why isn’t the whole visual cortex, or even the entire brain, a fully recurrent net?”, that’s a very interesting catch. I would suggest you see Sec. 5.2 of the supplementary material, where the experiments suggest neither purely feedforward nor purely recurrent are best choices, and a certain minimum physical depth and iteration count are required for the best performance. Current neural networks are not anywhere close to brain, but interestingly the observations match in this case.
Thanks for the very helpful clarifications! You said that “the efforts towards gaining the same effects of feedback by adding mid-network losses to feedforward networks failed, suggesting that weight-sharing (i.e. ~recurrence) plays a fundamental role in conjunction with mid-way losses (i.e. ~output->input).” I was wondering if this result is reported in the paper: i.e. results for a feedforward net with depth L, trained at M layers (rather than just at the top)? I was interested in this result precisely, but couldn’t find it in the paper.
I find this result quite fascinating and would like to understand it better. Feedforward nets are clearly more expressive than recurrent nets because of weight sharing in recurrent nets. So, my naive expectation would be that they should work at least as well as recurrent nets. So, why doesn’t mid-layer training work as well in feedforward nets? I don’t think it’s a matter of overfitting. But, it could be that when weights aren’t shared, losses at different push the weights in conflicting directions, whereas they are more coordinated when weights are shared. It would be nice to have a simple model of this effect.
typo in the second paragraph: “losses at different layers push …”
re: “results for a feedforward net with depth L, trained at M layers (rather than just at the top)?”. This is feedforward networks with auxiliary losses (Sec. 4.1 and Fig. 6) but when the entire network is trained/fine-tuned with all the auxiliary losses. We did that experiment, but we didn’t report it because no matter how much we tried, the results were always poorer than the ones currently reported in Fig. 6. The best feedforward results were (comparable to curves in Fig. 6): 6.8\%, 10.2\%, 13.1\%, 13.0\%, 59.8\%, 66.3\%, 68.5\% for depths 8, 12, 16, 20, 24, 28, and 32, respectively. The fully trained feedforward net with auxiliary losses always sacrificed either the early or the endpoint performances (not too unintuitive). I think similar observations were made in inception papers.
[will add this to the next arxiv update. Thanks for pointing it out!]
re: “why this happens”. I agree and it is clear that theoretically speaking feedforward networks are more expressive due to not having the constraint imposed by weight-shared recurrence, but getting them to train to do what you want them to do is the challenge. The hypothetical case with conflicting losses you mentioned indeed appears to be (at least partially) happening, evinced by yielding satisfactory performances for either early or endpoint predictions, but not both (above experiment). We noticed throwing in many more parameters in feedback like networks doesn’t easily lead to overfitting; recurrence/weight-sharing+episodic losses appears like a good internal regularizer, but feedforward networks appeared unnaturally `too unconstrained` to cope. See table 6 where the larger feedforward nets start to overfit, while feedback nets with significantly higher parameters counts (i.e. larger physical depth) didn’t experience a considerable drop (supplementary).