### The relative value of learning over memorizing

#### by Emin Orhan

At the end of my last post, I mentioned the possibility that a large episodic memory might obviate the need for sophisticated learning algorithms. As a fun and potentially informative exercise, I decided to quantify this argument with a little experiment. Specifically, given a finite amount of data, I wanted to quantify the relative value of learning from that data (*i.e.* by updating the parameters of a model using that data) vs. just memorizing the data.

To do this, I compared models that employ a mixture of learning and memorizing strategies. Given a finite amount of “training” data, a *k*%-learner uses *k*% of this data for learning and memorizes the rest of the data using a simple key-value based cache memory. A 100%-learner is a pure learner that is typical in machine learning. For the learning model, I used a ResNet-32 and for the memory model, I used the cache model described in this paper. The predictions of a *k*%-learner are given by a linear combination of the predictions obtained from the learner (ResNet-32) and the predictions obtained from the cache memory:

*prediction = w * prediction from the learning model + (1-w ) * prediction from the cache memory*

where *w* is a hyper-parameter that is estimated separately for each *k*%-learner (I assume that the cost of learning a single hyper-parameter is negligible compared to the cost of learning the parameters of a model).

Suppose I already used up *k*% of the data for training my ResNet-32 model and this achieves a generalization accuracy of *x*. Now the question is: what should I do with the rest of the data? I can either use that data to continue to train my model, which leads to a 100% learner and let’s say this 100% learner achieves an accuracy of *y*; alternatively I can just memorize the remaining data by caching (with the help of my partially trained ResNet-32 model), which leads to a *k*%-learner and let’s say this *k*%-learner achieves an accuracy of *z*. Then, given that I have already used *k*% of the data for learning, the relative value of learning the remaining data over just memorizing it is defined by:

*relative_value_of_learning(k) = (y-x) / (z-x)*

that is, the improvement in accuracy achieved by a 100%-learner divided by the improvement in accuracy achieved by the *k*%-learner. A large value here indicates that learning is much more valuable than memorizing (*i.e.* it pays off to learn from the remaining data rather than just memorizing it) and a value of 1 would indicate that learning and memorizing are equally valuable. In the latter case, given that learning is usually computationally much more expensive than memorizing, we would probably be inclined to memorize rather than learn.

The following figure shows the *relative_value_of_learning(k)* as a function of *k* for the CIFAR-10 benchmark.

So, by this measure learning is ~10 times as valuable as memorizing in this task. There appears to be a decreasing trend in the value of learning as *k *becomes larger*, *but the data is a bit noisy (ideally, I should have run this simulation multiple times to get more reliable estimates).

Is this result surprising? It was surprising to me! I was expecting the relative value of learning to be smaller and the curve shown above to approach 1 much more quickly. So, now I am a bit less skeptical of the growing literature on biologically plausible analogues of backpropagation after this exercise. There is definitely a lot of value in learning good representations (much more value than I had initially thought).

Some caveats: this exercise is specific to a particular task and particular learning and memorizing models. The results might be different in different setups. Given that much of the effort in machine learning is directed toward coming up with better pure learning models (rather than better memory models), I expect that the relative value of learning estimated here is an overestimate, in the sense that one can improve the performance of memorizing models by using more sophisticated memory models than the simple key-value cache model assumed in this exercise.

Finally, an analysis like this should help us perform a cost-benefit analysis for learning vs. memorizing both in natural and artificial agents. Coming up with cost estimates is probably easier in artificial agents: for example, one can estimate the FLOPS involved in learning vs. memorizing a given amount of data; or one can include memory costs as well. Depending on our exact cost function, the optimal strategy would involve a specific mix, or a specific trajectory, of learning vs. memorizing during the lifetime of the agent.