### A simple cache model for image recognition

I just posted a new paper on arxiv titled “A Simple Cache Model for Image Recognition”. In this paper, I make a very basic observation: in image recognition tasks, the layers of a deep network close to the output layer contain independent, easily extractable class-relevant information that is not already contained in the output layer itself. I then propose to read out this extra information using a simple continuous key-value cache memory that is directly inspired by Grave et al. (2017), who add a similar continuous cache memory to recurrent sequence-to-sequence models.

The really nice thing about this model is that by properly setting only two hyper-parameters, it is possible to significantly improve the accuracy of a pre-trained model at test time. For example, here are some error rates on the CIFAR-10 and CIFAR-100 test sets:

where ResNet32 ($\lambda=0$) is the baseline model without a cache component, ResNet32-Cache3 is a model that linearly combines the predictions of the cache component and the baseline model (as you may have guessed, $\lambda$ represents the mixture weight of the cache component), and ResNet32-Cache3-CacheOnly ($\lambda=1$) is a model that uses the predictions of the cache component only.

I also found that a cache component substantially improves the robustness of baseline models against adversarial attacks. Here are the classification accuracies of the same three models above on adversarial images generated from the CIFAR-10 test set by four different attack methods applied to the baseline ResNet32 model:

Interestingly, the cache-only model is more robust against adversarial perturbations than the other models. It is intuitively easy to understand why cache models improve robustness against adversarial examples: they effectively extend the range in the input space over which the model behaves similarly to the way it behaves near training points and recent work has shown that neural networks behave more regularly near training points than elsewhere in the input space (for example, as measured by the Jacobian norm or the number of “linear regions” in the neighborhood of a given point).

Indeed, I confirmed this by looking at the Jacobian of the models at the test points:

Both cache models reduce the mean Jacobian norm over the baseline model (a). The two cache models, however, have somewhat different characteristics: the linear-combination model (ResNet32-Cache3) has the smallest first singular value but slightly increased lower-order singular values compared to the baseline model, whereas the cache-only model reduces the singular values (and the Jacobian norms per trial) more consistently. This pattern appears to be related to the differential generalization behavior of the two models: the linear-combination cache has the higher test accuracy, but the cache-only model is more robust against adversarial perturbations. So, the conjecture is that within-sample generalization performance is mostly determined by the first singular value (or the first few singular values), whereas the out-of-sample generalization performance, e.g. sensitivity of the models against adversarial perturbations, is also significantly affected by the lower-order singular values.