A note on fast weights: do they really work as advertised?

In fields where one cannot prove one’s claims mathematically, it is immensely important to be as rigorous as possible. Among other things, this implies considering all plausible alternative hypotheses to one’s own hypothesis and making sure that the results cannot be explained by those alternative hypotheses.  People working in experimental fields are usually quite careful about this. Machine learning is also, by and large, an experimental field (in the sense that usually one cannot prove one’s claims with mathematical rigour), so similar standards of experimental rigour should apply there as well.

I was reminded of this when I was re-reading the “fast weights” paper by Ba, Hinton et al. In this paper, they extend the standard vanilla recurrent neural network architecture with some form of Hebbian short-term synaptic plasticity. This Hebbian connectivity maintains a dynamically changing short-term memory of the recent history of the activities of the units in the network. They call this Hebbian connectivity “fast weights” as opposed to the standard “slow” recurrent connectivity.

If I haven’t made a mistake, the equation governing the model can be condensed into the following form (I’m assuming that their inner loop of Hebbian plasticity is unrolled for one step only, i.e. S=1):

\mathbf{h}_{t+1} = f\Big( \mathcal{LN}\Big[\mathbf{W}_h \mathbf{h}_t + \mathbf{W}_x \mathbf{x}_t + \big(\eta \sum_{\tau=1}^{\tau=t-1}\lambda^{t-\tau-1}\mathbf{h}_\tau \mathbf{h}_\tau^{\top}\big) f(\mathbf{W}_h \mathbf{h}_t + \mathbf{W}_x \mathbf{x}_t) \Big] \Big)

where \mathcal{LN}[\cdot] refers to layer normalization. The fast weight matrix here is given by the Hebbian matrix \eta \sum_{\tau=1}^{\tau=t-1}\lambda^{t-\tau-1}\mathbf{h}_\tau \mathbf{h}_\tau^{\top}.

The authors show that this model performs better than a number of other recurrent models in a couple of tasks. They seem to think that the Hebbian fast weight matrix is crucial (and mainly responsible) for the superior performance of the model. However, there are two obvious controls that are sorely missing in the paper and that makes a fair assessment of the contribution of fast weights difficult. First, the model has more depth than the standard architectures (note the nested f(\cdot)s in the equation above), so maybe there’s nothing special about the Hebbian fast weight matrix, rather the advantage is mainly due to the added depth. This could be tested by replacing the Hebbian matrix by another matrix, e.g. a random, fixed matrix or the identity matrix. In this connection, it’s interesting to observe that the authors chose this fairly complicated architecture rather than simpler ways of incorporating a Hebbian component to the recurrent connectivity, for example something like this:

\mathbf{h}_{t+1} = f\Big( \mathcal{LN} \Big[ \big[\mathbf{W}_{h} + \eta \sum_{\tau=1}^{\tau=t-1}\lambda^{t-\tau-1}\mathbf{h}_\tau \mathbf{h}_\tau^{\top} \big] \mathbf{h}_{t} + \mathbf{W}_{x} \mathbf{x}_{t} \Big] \Big)

which doesn’t increase the effective depth and would in fact be closer to the biological idea of short-term synaptic plasticity (this, in fact, appears to be how fast weights were originally defined in Hinton’s old “fast weights” paper). This may be because they tried something simple like this and it didn’t work well, so they found the added complexity necessary. But then it would be important to understand why fast weights don’t work in this simple context and require the more complicated machinery.

Secondly, layer normalization by itself improves the performance of vanilla recurrent networks, so we know that part of the improvement is due to normalization, again the question is how much. The necessary control to run here is a vanilla recurrent network with layer normalization.

I’m currently working on running these (and some other) controls, so I should hopefully have some answers to these questions soon.