Optimal synaptic memory consolidation

One of the fundamental challenges facing the brain is a trade-off between what we might loosely call learning and memory. On the one hand, we want our synapses to be highly plastic so that we can learn from our experiences quickly. On the other hand, we want our memories to be stable so that they are not easily overwritten by the constant barrage of experiences that we face every day and this seems to require rigid synapses. So, what is the optimal way to strike the balance between these two conflicting requirements?

Building on a series of previous papers, Benna & Fusi (2016) address this question in the context of a highly simplified model of synaptic modifications. In this model, the current value of a synapse is assumed to be a linear superposition of all the past modifications made to that synapse, weighted by a function of the time since that modification:

w_a(t) = \sum_{t^\prime} \Delta w_a (t^\prime) r(t-t^\prime)

Here a indexes the synapse, \Delta w_a (t^\prime) is the synaptic modification made at time t^\prime and r(t-t^\prime) is the weighting function that determines the effect of a modification made at time t^\prime on the current value of the synapse. Another important assumption is that the modifications are unit-magnitude, equal-probability depression or potentiation events, i.e. \Delta w_a(t^\prime)=\pm 1, that arrive at the synapse at a constant rate and are uncorrelated across time. It is assumed that there are N such synapses in total, uncorrelated with each other.

To formalize the plasticity-rigidity trade-off, they define a signal-to-noise ratio that quantifies how well we can decode a past synaptic modification event from the current values of the synapses. The signal is defined as the mean overlap between the current values of the synapses and the past synaptic modification pattern:

\mathcal{S}_{t^\prime}(t) \equiv \frac{1}{N}  \langle \sum_{a=1}^N w_a(t) \Delta w_a(t^\prime)  \rangle

and the noise is defined as the standard deviation of this overlap:

\mathcal{N}_{t^\prime}(t) \equiv \sqrt{ \frac{1}{N^2} \langle (\sum_{a=1}^N w_a(t) \Delta w_a(t^\prime) )^2 \rangle - \mathcal{S}^{2}_{t^\prime}(t) }

where the angle brackets denote averages over the stochastic modifications. With the assumptions made, it is then straightforward to derive that the signal-to-noise ratio for a memory inducted at time t^\prime is proportional to:

SNR_{t^\prime} (t) \propto r(t-t^\prime) / \sqrt{\sum_{l: t_l<t} r(t-t_l)^2}

Heuristically, we can see from this equation that the slowest decaying r(t) we can afford before the variance term diverges is r(t) \approx 1 / \sqrt{t} (to see this, note that plugging this in the denominator gives the harmonic series). One can make this argument more formal by writing down an objective (e.g. the area under the SNR(t) curve above an arbitrary threshold), optimizing with respect to r(t) and finding out that r(t) \approx 1 / \sqrt{t} indeed gives the correct answer, but I won’t do it here. This is the first main result of the paper (and to me the more important one). One can show that a system that displays the 1/\sqrt{t} decay can achieve the almost extensive N / \log N memory capacity (capacity is defined as the time at which the SNR(t) drops to an arbitrary fixed value, which they take to be 1 in the paper).

The rest of the paper is devoted to coming up with a hand-crafted, tractable dynamical system (which can be interpreted as describing the internal dynamics of a complex synapse model) that would display the desired 1/\sqrt{t} decay. The solution they come up with is based on the heat equation:

\frac{\partial u}{\partial t} = D \frac{\partial^2 u}{\partial x^2}

where D \propto g/C is the diffusion coefficient (where C is known as the heat capacity and g is the conductivity). Recall that the solution to the heat equation at x=0 already displays the required 1/\sqrt{t} decay. However, this naive solution is not good enough, because any discretized version of it would require too many variables (\sim \sqrt{N} variables) to achieve the 1/\sqrt{t} decay over a sufficiently long time. The “patch” they offer for this problem is to use an inhomogeneous heat equation with an exponentially increasing heat capacity: i.e. C(x) = \exp(\beta x) and an exponentially decreasing conductivity g(x) = \exp(-\beta x). This has the effect of slowing down the diffusion along the x direction and requires only \sim \log N variables to implement the desired 1\sqrt{t} decay in a discretized version.

Although I find this model informative for elucidating the kinds of mechanisms required for achieving extensive memory capacity in an efficient way under the studied scenario (e.g. multiple variables with exponentially differing time-scales interacting with each other), I have two basic issues with the paper. First, the assumptions and simplifications they make to render the problem tractable also make it somewhat uninteresting from a practical viewpoint: uncorrelated events arriving at uncorrelated synapses is not a very practically relevant scenario. In the studied scenario, there’s also no accounting of the importance of a synapse for prior experiences (cf. Kirkpatrick et al., 2017). Incidentally, Kirkpatrick et al. (2017) show that any synapse that optimizes their elastic weight consolidation objective (which combines the loss in the current episode and deviation from the current value of the synapse weighted by the importance of that synapse for prior episodes) automatically satisfies the optimal 1/\sqrt{t} decay under a scenario similar to that studied in Benna & Fusi (2016). This result suggests that this decay might naturally fall out of other objectives that the synapse would have to optimize.

Secondly, the desire to come up with a tractable model also restricts the authors to a linear model (i.e the heat equation). But, there is no reason synapses should be linear. Nonlinear models might, in fact, perform better. There’s not much room for improvement over the linear model proposed in the paper, since both the 1/\sqrt{t} decay and the nearly extensive memory capacity are optimal under the studied scenario, but perhaps the \log N discrete variables required in the linear model could be improved upon in a nonlinear model, or perhaps the robustness of the model could be improved with a nonlinear model, or perhaps a nonlinear model could perform better than any linear model under more realistic scenarios not studied in this paper.

So, an alternative approach would be to model the synapse as a nonlinear dynamical system (i.e. an RNN), and optimize its parameters (one can think of more sophisticated variations on this idea to learn more interpretable solutions) under more realistic scenarios. I find this approach very useful, because (i) the hand-crafted solutions we humans come up with usually tend to be highly special (e.g. an attractor with all eigenvalues equal to 1), whereas nature prefers more generic solutions, (ii) with a hand-crafted model, we can rarely beat or even match the performance of an optimized nonlinear model even in relatively simple problems, so such models are useful for delineating the contours of what is achievable and for giving us valuable hints about more general principles.