Backpropagation works fine with random feedback weights

Neural networks are trained with the backpropagation algorithm. The backpropagation algorithm has a nice and simple interpretation in terms of backpropagating errors from the top layer to the lower layers of the network during training. For example, in the case of multi-layer feedforward linear networks, the update for a particular weight between layers l and l+1 can be written as:

\Delta W_{ij} \propto \gamma^\top W_{l+1}^\top W_{l+2}^\top \ldots W_{L-1}^\top \mathbf{e}

Here W_m are the forward weights, \mathbf{e} is the error vector and \gamma is a vector that depends only on the activities of the pre- and post-synaptic neurons of this particular connection. We can easily interpret this expression as the backpropagation of the error vector \mathbf{e} from the top of the network to the rest of it through feedback weights W_m^\top. Interpreting it this way has some biological appeal, since the brain is known to contain an abundance of feedback projections whose function is rather unclear. However, we have a problem. In order to make this idea work, we need to assume that feedback projections are always symmetric to the feed-forward projections, which is not particularly realistic for the brain. So, perhaps the first thing one would try is to see how important it is to maintain this precise symmetry between the forward and the backward projections. For example, what if we replace W_m^\top in the backward projections by some random and fixed matrices B_m. This is precisely what they do in this recent paper by Lillicrap et al. (2016). This is such a simple and beautiful idea that I’m surprised nobody tried it before!

Amazingly, this works just fine. In fact, it seems to work slightly better than standard backprop in most cases they consider in the paper. This could be because the algorithm with fixed random feedback has elements of second-order optimization (in particular Gauss-Newton), the authors argue. Formally, they show that in linear networks, the error feedback projected by fixed, random feedback weights is always a positive scalar multiple of the error feedback that would be sent by the pseudo-inverse of the forward weights, W^+. The significance of this is that if the feedback weights exactly matched the pseudo-inverse, W^+, they would be implementing an approximation to the Gauss-Newton method. Of course this argument is heuristic: the algorithm doesn’t really implement a second-order descent algorithm; it doesn’t even correspond to minimizing any loss function as the dynamics it describes is non-conservative, as the authors note.

It’s interesting to note that during learning forward weights come to resemble (but not exactly match) the pseudo-inverse of feedback weights (more than they come to resemble the transpose of feedback weights). Had the forward weights exactly matched the pseudo-inverse of feedback weights, that would mean that you don’t really need to train the networks, because the pseudo-inverse of a random matrix is a random matrix! So, the network seems to try to satisfy two conditions at the same time: on the one hand it tries to align the forward weights to the pseudo-inverse of feedback weights so that the feedback weights can backpropagate useful error signals, but on the other hand, it tries to do the task which requires that the feed-forward weights be non-random (and indeed orthogonal to any random matrix on average). So, there seems to be an interesting trade-off here. Of course, this “the-network-tries-to” language is a bit misleading, because as mentioned above, the dynamics doesn’t really correspond exactly to minimizing any objective, but effectively that seems to be what’s happening.

Like all good papers, this paper opens up a lot of interesting questions. Here are some I’m particularly interested in:

  • Although the authors’ efforts to analyze the algorithm are laudable, it would be nice to have a better, more rigorous understanding of what exactly the algorithm is doing.
  • Except for the spiking network example, they ignore any nonlinearities in the backward projection path (even in non-linear networks). This is not biologically plausible. It would be interesting to explore what happens when non-linearities in the backward pathway are included.
  • Can we somehow do better than the completely random feedback matrices B? What is the optimal fixed feedback matrix?
  • Most importantly, feedback connections are plastic too. Does learning become even faster with plastic feedback weights?