We’ve recently discussed this paper on “feedback networks” by Zamir et al. in a journal club. The paper temporalizes static image recognition tasks and shows that recurrent nets that are trained to do the task at all times perform quite well even early on during inference and naturally implement a coarse-to-fine classification strategy where early outputs of the network correspond to coarse classifications that get refined over time. I really like the paper overall. I especially appreciate and applaud their efforts to probe and understand how recurrence changes the nature of the representations learned in deep networks. I basically have two criticisms of the paper. The first is mostly terminological: I think the use of the term “feedback” in feedback networks is misleading. What they rather mean is just “recurrence” (as in recurrence in vanilla recurrent neural networks or in LSTMs), whereas “feedback” implies something more specific, i.e. a set of connections functionally and architecturally (architecturally=having a different structure) distinct from feedforward and lateral (within-layer) connections. The second criticism is that, again unlike what the title implies, the crucial manipulation of the paper has actually nothing to do with the architecture of the network per se, but how it is trained: specifically whether the outputs of the intermediate layers are also trained to do the task or not. This is the unambiguous conclusion of the results reported in the crucial Table 4. This particular training method can be implemented in any type of network be it purely feedforward or recurrent.
This second point made me think about what actual advantages (or disadvantages) a recurrent architecture specifically bestows on a neural net and I realized something that I perhaps knew only implicitly: recurrence is an efficient way of achieving depth in a neural network. It’s well known that a recurrent net unrolled in time is equivalent to a feedforward network with weight sharing across layers. So, a recurrent net achieves depth with neurons and synapses, whereas a feedforward network achieves the same depth with neurons and synapses. Of course, the recurrent net is less expressive than the corresponding feedforward net due to weight sharing across layers, but both the feedback network paper and an earlier paper by Liao & Poggio show that one doesn’t lose much in terms of performance by sacrificing this extra bit of expressivity. Intriguingly, this could also explain why even in highly visual animals such as ourselves and other primates, one doesn’t find very deep visual cortices. Instead, one finds only a handful of hierarchical visual areas (~5 in primates), but lots of recurrence both within the same area and across areas. This then raises the opposite question: if recurrence is so efficient, why isn’t the whole visual cortex, or even the entire brain, a fully recurrent net? I suspect that there is an interesting trade-off between expressivity and efficiency and our visual cortices might be striking a balance between the two. But, fleshing out this idea requires some work.