Deep learning is woefully sample inefficient compared to humans. Sample inefficiency is one of the most important challenges facing deep learning today, possibly even more so than the generalization issues, which might be resolved more or less automatically if we could successfully address the sample inefficiency problem. I’ve recently estimated that our current best self-supervised learning algorithms would need the equivalent of billions of years of human-like visual experience in order to reach human-level accuracy and robustness in visual object recognition. The situation appears to be similar in language: deep learning models seem to demand unrealistically large amounts of data to acquire at least some linguistic constructions. In this post, I’d like to share my thoughts on whether it’ll ever be possible to reach human-level sample efficiency with variations on current deep learning techniques (without any fundamental changes to the minimal inductive bias philosophy of the current techniques) and, if so, what it’ll take to achieve that. I’ll focus exclusively on the visual domain here, since this is the domain I know more about and, as mentioned above, I’ve already done some work on it. Some of the points and claims I’ll make below for the visual domain may generalize to language, but my sense is that achieving human-level sample efficiency may need fundamentally different methods in language.
First, to calibrate ourselves to the amount of quantitative improvement needed over current methods in order to achieve human-level sample efficiency in visual object recognition, let me bring up this figure from my paper (this figure is from a more recent version of the paper that hasn’t been updated on arxiv yet):
The figure shows the amount of natural human-like video data necessary to achieve human-level accuracy (indicated by the red zone at the top) on ImageNet under different extrapolation functions (please see the paper for details) and using one of the best self-supervised visual representation learning algorithms available today (namely, DINO). The developmentally relevant timescale of 10 years is marked by a vertical arrow. In order to achieve human-level sample efficiency, we need to be close to that red zone up top around the time of this mark. To do that, I estimate that we need to be close to the big black dot at the maximum amount of natural video data I used for this experiment (that is, a few thousand hours of natural video). That’s roughly ~30% higher than where we are right now in absolute numbers (the rightmost red dot). So, it seems like we need a pretty big improvement! An improvement comparable in size (in fact a slightly larger improvement) was achieved over the last couple of years in self-supervised learning on ImageNet, mainly through algorithmic advances. Can we achieve a similar improvement in self-supervised learning from human-like natural videos with relatively generic algorithms (and without introducing additional modalities)?
My hunch is that we can. I predict that this can be accomplished through a combination of simple, relatively unexciting, but effective developments. I think that scaling and hyperparameter optimization, in particular, will be key to these developments. Let me now elaborate on these points.
First, scaling. The human retina has something like 6M cones, densely packed in and around the fovea. By contrast, in computer vision, we still typically work with relatively low resolution images, like 224×224 or 256×256 pixels, which is roughly 2 orders of magnitude lower in resolution. Especially in more naturalistic, non-photographic images/frames, where the objects of interest can be small and are not necessarily centered on the image, low spatial resolution can severely limit the amount of useful information we can extract about the objects from the image. So, we need to move toward bigger images that are more like 2048×2048 pixels in size (4.2MP) to have a spatial resolution comparable to the human retina. We know from empirical work that increasing the image resolution significantly improves recognition accuracy, especially when incorporated into a compound scaling scheme as in EfficientNets. For example, the following figure from the EfficientNet paper shows how much one can improve the performance of a model with a fixed number of parameters with carefully tuned compound scaling (the effect is likely to be bigger for models that are farther away from ceiling performance):
I suspect further that some architectural improvements to our current models may be possible in the near term. As I have argued before, I find it very unlikely that with the standard transformer architecture we have already hit the jackpot and found the architecture with the optimal (or near-optimal) scaling properties (both data and model size scaling) in such a short time. These improvements may come in the form of better hyperparameter optimizations. For instance, I suspect that the hyperparameter choices commonly used for the ViT architecture may be suboptimal for computer vision applications. As noted in the original ViT paper, these choices were actually directly borrowed from the BERT model for masked language modeling:
But, there’s no reason to expect that the hyperparameter choices that were optimal for NLP (assuming they were optimal or near-optimal in the first place) would also be optimal for computer vision applications. For instance, since the visual world is arguably richer than language in terms of informational content, the embedding dimensionality may need to be correspondingly larger in ViTs (perhaps at the expense of depth) or it may need to be distributed differently across the model (e.g. lower dimensional in early layers, higher dimensional in later ones).
More substantive improvements to the transformer architecture may also be possible. For example, I find models like the edge transformers that incorporate a “third-order” attention mechanism quite intruiging (I’ve been experimenting with a model like this myself recently, with pretty encouraging preliminary results). It’s important to note that these models incorporate, at best, very soft inductive biases and hence are still very generic models.
Finally, this is a bold prediction, but I do not expect major algorithmic improvements in the sample efficiency of self-supervised learning algorithms themselves. My intuition suggests that in terms of sample efficiency, algorithms like masked autoencoders (or generative models like Image GPT and VQGAN) are probably as good as any generic algorithm could hope to be, because these algorithms essentially try to predict everything from everything else, hence they might be expected to squeeze every bit of useful information from a given image. On the other hand, better optimization of the hyperparameter choices in these algorithms could again lead to significant improvements, especially in a novel domain like natural, egocentric, headcam videos where the original hyperparameter choices made for static photographic images may be suboptimal. For example, the crop sizes and their locations (instead of being chosen completely randomly) may need to be tuned differently for natural videos. Along these lines, I have recently seen it suggested that the random crops used in contrastive self-supervised learning algorithms like MoCo or SimCLR may need to be larger in size when these algorithms are applied to natural videos or that they may benefit from a light-weight object detection model that keys in on regions in the image that are likely to contain objects (somewhat similar to foveation via eye movements in human vision). Similar considerations may apply to other self-supervised learning algorithms.
I’d like to revisit this post in about a year and see if the predictions I’ve made here will be borne out.
Update: After I wrote this post and re-read it a couple of times, I realized that the opening to this post might be a bit misleading. Sample efficiency may depend on the distribution from which the “samples” are drawn, so it’s possible for an algorithm to be much more sample efficient with respect to a certain type of distribution, say photographic images as in ImageNet, and much less so with respect to a different type of distribution, say frames from natural, egocentric videos. Perhaps, this is the case with our current self-supervised learning methods: they work quite well for static, photographic, ImageNet-like images, but not so well for frames from natural, egocentric videos. If this is really the case, it would make the problem of sample inefficiency discussed at the opening of this post somewhat less dramatic and less significant from a practical point of view. These methods are probably not yet very close to the Bayes error rate on ImageNet, so they potentially still have quite a bit of room for improvement even on ImageNet, but they may already be quite good (in terms of sample efficiency) on ImageNet. In any case, it would obviously be highly desirable to have self-supervised learning algorithms that are sample efficient with respect to as wide a range of visual stimuli as possible and maybe that’s what we should really mean by the “sample inefficiency problem of deep learning”.
Update (11/11/2022): Here is a recent paper on Atari demonstrating how one can drastically improve the sample efficiency of a reference model (here Agent57) with a few simple, “unexciting” tricks (along the lines suggested in this post for visual object recognition).