What will it take to achieve human-level sample efficiency in deep learning?

Deep learning is woefully sample inefficient compared to humans. Sample inefficiency is one of the most important challenges facing deep learning today, possibly even more so than the generalization issues, which might be resolved more or less automatically if we could successfully address the sample inefficiency problem. I’ve recently estimated that our current best self-supervised learning algorithms would need the equivalent of billions of years of human-like visual experience in order to reach human-level accuracy and robustness in visual object recognition. The situation appears to be similar in language: deep learning models seem to demand unrealistically large amounts of data to acquire at least some linguistic constructions. In this post, I’d like to share my thoughts on whether it’ll ever be possible to reach human-level sample efficiency with variations on current deep learning techniques (without any fundamental changes to the minimal inductive bias philosophy of the current techniques) and, if so, what it’ll take to achieve that. I’ll focus exclusively on the visual domain here, since this is the domain I know more about and, as mentioned above, I’ve already done some work on it. Some of the points and claims I’ll make below for the visual domain may generalize to language, but my sense is that achieving human-level sample efficiency may need fundamentally different methods in language.

First, to calibrate ourselves to the amount of quantitative improvement needed over current methods in order to achieve human-level sample efficiency in visual object recognition, let me bring up this figure from my paper (this figure is from a more recent version of the paper that hasn’t been updated on arxiv yet):

The figure shows the amount of natural human-like video data necessary to achieve human-level accuracy (indicated by the red zone at the top) on ImageNet under different extrapolation functions (please see the paper for details) and using one of the best self-supervised visual representation learning algorithms available today (namely, DINO). The developmentally relevant timescale of 10 years is marked by a vertical arrow. In order to achieve human-level sample efficiency, we need to be close to that red zone up top around the time of this mark. To do that, I estimate that we need to be close to the big black dot at the maximum amount of natural video data I used for this experiment (that is, a few thousand hours of natural video). That’s roughly ~30% higher than where we are right now in absolute numbers (the rightmost red dot). So, it seems like we need a pretty big improvement! An improvement comparable in size (in fact a slightly larger improvement) was achieved over the last couple of years in self-supervised learning on ImageNet, mainly through algorithmic advances. Can we achieve a similar improvement in self-supervised learning from human-like natural videos with relatively generic algorithms (and without introducing additional modalities)?

My hunch is that we can. I predict that this can be accomplished through a combination of simple, relatively unexciting, but effective developments. I think that scaling and hyperparameter optimization, in particular, will be key to these developments. Let me now elaborate on these points.

First, scaling. The human retina has something like 6M cones, densely packed in and around the fovea. By contrast, in computer vision, we still typically work with relatively low resolution images, like 224×224 or 256×256 pixels, which is roughly 2 orders of magnitude lower in resolution. Especially in more naturalistic, non-photographic images/frames, where the objects of interest can be small and are not necessarily centered on the image, low spatial resolution can severely limit the amount of useful information we can extract about the objects from the image. So, we need to move toward bigger images that are more like 2048×2048 pixels in size (4.2MP) to have a spatial resolution comparable to the human retina. We know from empirical work that increasing the image resolution significantly improves recognition accuracy, especially when incorporated into a compound scaling scheme as in EfficientNets. For example, the following figure from the EfficientNet paper shows how much one can improve the performance of a model with a fixed number of parameters with carefully tuned compound scaling (the effect is likely to be bigger for models that are farther away from ceiling performance):

I suspect further that some architectural improvements to our current models may be possible in the near term. As I have argued before, I find it very unlikely that with the standard transformer architecture we have already hit the jackpot and found the architecture with the optimal (or near-optimal) scaling properties (both data and model size scaling) in such a short time. These improvements may come in the form of better hyperparameter optimizations. For instance, I suspect that the hyperparameter choices commonly used for the ViT architecture may be suboptimal for computer vision applications. As noted in the original ViT paper, these choices were actually directly borrowed from the BERT model for masked language modeling:

But, there’s no reason to expect that the hyperparameter choices that were optimal for NLP (assuming they were optimal or near-optimal in the first place) would also be optimal for computer vision applications. For instance, since the visual world is arguably richer than language in terms of informational content, the embedding dimensionality may need to be correspondingly larger in ViTs (perhaps at the expense of depth) or it may need to be distributed differently across the model (e.g. lower dimensional in early layers, higher dimensional in later ones).

More substantive improvements to the transformer architecture may also be possible. For example, I find models like the edge transformers that incorporate a “third-order” attention mechanism quite intruiging (I’ve been experimenting with a model like this myself recently, with pretty encouraging preliminary results). It’s important to note that these models incorporate, at best, very soft inductive biases and hence are still very generic models.

Finally, this is a bold prediction, but I do not expect major algorithmic improvements in the sample efficiency of self-supervised learning algorithms themselves. My intuition suggests that in terms of sample efficiency, algorithms like masked autoencoders (or generative models like Image GPT and VQGAN) are probably as good as any generic algorithm could hope to be, because these algorithms essentially try to predict everything from everything else, hence they might be expected to squeeze every bit of useful information from a given image. On the other hand, better optimization of the hyperparameter choices in these algorithms could again lead to significant improvements, especially in a novel domain like natural, egocentric, headcam videos where the original hyperparameter choices made for static photographic images may be suboptimal. For example, the crop sizes and their locations (instead of being chosen completely randomly) may need to be tuned differently for natural videos. Along these lines, I have recently seen it suggested that the random crops used in contrastive self-supervised learning algorithms like MoCo or SimCLR may need to be larger in size when these algorithms are applied to natural videos or that they may benefit from a light-weight object detection model that keys in on regions in the image that are likely to contain objects (somewhat similar to foveation via eye movements in human vision). Similar considerations may apply to other self-supervised learning algorithms.

I’d like to revisit this post in about a year and see if the predictions I’ve made here will be borne out.

Update: After I wrote this post and re-read it a couple of times, I realized that the opening to this post might be a bit misleading. Sample efficiency may depend on the distribution from which the “samples” are drawn, so it’s possible for an algorithm to be much more sample efficient with respect to a certain type of distribution, say photographic images as in ImageNet, and much less so with respect to a different type of distribution, say frames from natural, egocentric videos. Perhaps, this is the case with our current self-supervised learning methods: they work quite well for static, photographic, ImageNet-like images, but not so well for frames from natural, egocentric videos. If this is really the case, it would make the problem of sample inefficiency discussed at the opening of this post somewhat less dramatic and less significant from a practical point of view. These methods are probably not yet very close to the Bayes error rate on ImageNet, so they potentially still have quite a bit of room for improvement even on ImageNet, but they may already be quite good (in terms of sample efficiency) on ImageNet. In any case, it would obviously be highly desirable to have self-supervised learning algorithms that are sample efficient with respect to as wide a range of visual stimuli as possible and maybe that’s what we should really mean by the “sample inefficiency problem of deep learning”.

Update (11/11/2022): Here is a recent paper on Atari demonstrating how one can drastically improve the sample efficiency of a reference model (here Agent57) with a few simple, “unexciting” tricks (along the lines suggested in this post for visual object recognition).

Advertisement

The value of incremental, cumulative improvements is underestimated in AI/ML

Many people working in AI/ML have a mental model of AI progress in which surpassing human-level performance in most practically important real-world tasks will require several new big ideas. People, for instance, often talk about human-level AI (whatever that means) being several “transformer” level breakthroughs away. This way of thinking about progress seems to assume a “heroic inventor” model of innovation: i.e. there are only a handful of very big ideas out there that will prove to be crucial in the long run and everybody tries to be one of those handful of heroic inventors who will discover at least one of those really important ideas (the annoying proliferation of the “All you need is X” titles in AI/ML papers points to this being quite a common view at least implicitly held by many practitioners).

But what if this view of AI progress is fundamentally misguided and mistaken? What if reaching human-level AI (whatever that means exactly) —or any other important benchmark for that matter— requires not a handful of very big ideas, but a million (maybe more) very small ideas instead, a million incremental improvements? A marginal revolution of sorts in AI/ML! Indeed, examples of innovation and progress we’re familiar with from other domains strongly suggest that the incremental, cumulative model might be a much more realistic model of progress than “the heroic inventor” model with its small number of big, qualitative jumps:

1) For example, this is how biological evolution almost always comes up with its innovations: even very complex organs like camera eyes very likely evolved through many many small, incremental improvements over time and not through a small number of big breakthroughs.

2) Ironically, optimization of neural networks (and other complex systems) also works most successfully in this way: we optimize these models through local search, i.e. through gradient descent, by taking many many small steps, each of which improves the model only a tiny bit.

3) Similarly, if you take a look at any book on the history of technology or culture (e.g. George Basalla’s The Evolution of Technology, Henry Petroski’s The Evolution of Useful Things, Brian Arthur’s The Nature of Technology, or Matt Ridley’s excellent book How Innovation Works), one of the main messages it is most likely to hammer home is that “the heroic inventor” is almost always a myth and that technological progress almost always happens very gradually and cumulatively instead, by combining existing ideas and/or refining them and elaborating on them over many iterations.

The following passages from Ridley’s book are representative in this respect (from p. 28 and p. 35 of the book, respectively; Chapter 8 of Ridley’s book contains two entire sections titled “innovation is gradual” and “innovation is recombinant”):

Or consider this passage from another book I’ve been reading recently, Kevin Laland’s thought-provoking book Darwin’s Unfinished Symphony, where the author discusses a computational model of the emergence and growth of cumulative culture (p. 172):

It’s surprising to me that there are very few works in AI/ML these days trying to do this kind of integrative work, combining, consolidating very many incremental improvements to achieve bigger improvements. The new ML startup MosaicML (with their main project Composer) seems to explicitly pursue a goal like this (kudos to them!). Another example that comes to my mind is a paper from a group at DeepMind that came out a few years ago combining several then newly proposed ideas to improve the training and generalization efficiency of model-free reinforcement learning. But it’s hard to think of many more examples of this kind of integrative work and I think there should be a lot more of it: at least a couple of high profile papers like this every year, combining, integrating the most promising ideas proposed that year to see how far one can push the state of the art in a given domain. A back of the envelope calculation suggests that if there are 100 such ideas every year each improving performance in a task or a domain by a small amount, say by 0.1% independently on average, cumulatively they may add up to something much bigger, like 10% (and even supposing that I overestimated here both the impact of each small idea and the number of such ideas in a given year by a factor of two, which is quite possible, the cumulative improvements could still add up to a significant 2-3% each year, simply by combining ideas that have already been proposed by others that year, a non-negligible —and basically free— cumulative improvement that would be foolish to pass up).

Of course, people need to make sure the new ideas they propose do lead to real improvements in performance as they claim (albeit small) by running proper experiments (for example, with multiple runs and with proper baselines and hyper-parameter optimizations). They also need to make it extremely easy for others to use and build upon their idea in terms of implementation and I think well-designed, easy-to-use, common frameworks like Composer might be ideal for this purpose.

Thoughts on the new scaling laws for large language models

I finally had a chance to read in detail the new scaling laws paper from DeepMind recently and wanted to share a few quick thoughts about it (here is another well-written piece on the new scaling laws, summarizing the main points of the paper and the implications of these new results). Briefly the paper finds that the original scaling laws paper by Kaplan et al. significantly overestimated the optimal model size (and conversely significantly underestimated the optimal number of training tokens) for a given amount of compute (given number of FLOPs).

The following example is taken from the new scaling laws paper: suppose you decide to increase your compute budget 10-fold. The old scaling laws would tell you the optimal thing to do (in terms of final pretraining validation loss) is to increase your model size 5.5-fold and the number of training tokens 1.8-fold (so you should spend most of your budget on increasing the model size, as opposed to increasing the number of training tokens). The new scaling laws, on the other hand, say that you should increase the model size roughly 3.2-fold and the number of training tokens also roughly 3.2-fold (i.e. roughly in equal proportions). The origin of this discrepancy seems to be mainly related to hyperparameter optimization: the original scaling laws paper doesn’t tune the learning rate schedule separately for individual simulations and it uses a fixed number of training tokes (or iterations) for all simulations, which, it turns out, leads to underestimating the performance of the smaller size models in these scaling experiments.

Now, here are my quick thoughts on these results:

1) First of all, I just want to note that this was completely predictable from the GPT-3 paper. I wrote a blog post about it around that time, pointing out that their smaller models seemed to be more compute efficient than the largest 175B parameter model (other people also pointed out the same thing); it was pretty clear that they just hadn’t trained those smaller models for long enough. In fact that same figure discussed in my blog post suggests that even the new scaling laws paper might be overestimating the optimal model size for a given number of FLOPs (more on this below).

2) The new scaling laws paper hints at the possibility that the scaling law governing compute vs. optimal model size might not even be a power law, it might be a more slowly growing function. This is based on the observation that there’s possibly a bend in the scaling curve at the largest end of the range of FLOP counts tested in this paper (see below). This is potentially more bad news for big models.

FLOPs vs. optimal model size might grow more slowly than a power law.

3) This paper performs a separate hyperparameter tuning for the cosine cycle length parameter in the learning rate schedule in individual runs of the scaling experiment (individual dots above), or more precisely, based on the number of training tokens used in individual runs, which appears to be critical in improving the performance of the smaller size models. But the paper still doesn’t do a more complete hyperparameter search over other potentially important hyperparameters in these individual runs: for example, the maximum learning rate, the choice of the optimizer, e.g. AdamW vs. Adam, which might actually be an important choice, as they point out elsewhere in the paper (footnote 8):

AdamW vs. Adam choice turns out to be an important choice.

and even architectural choices like how to allocate the extra parameters within the model: for example, maybe using the extra parameters for widening the model is better for smaller models but increasing the depth instead is better for larger models (or vice versa), etc. This suggests that a more completely optimized set of experiments might potentially yield qualitatively different results. It’s again possible that the smaller models might do even better when their hyperparameters are more thoroughly optimized, thus reducing the optimal model size for a given number of FLOPs even further.

4) Even if the trend uncovered in this paper (or the one in the original scaling laws paper for that matter) were perfectly accurate, the difference in final validation loss between the optimal size model and, say, a 10x smaller model might be too small to be of practical significance. I’m not really going to care about a 0.01 difference in my final validation loss, if it means I need to design a whole new hardware architecture, a brand new hardware parallelism method, or a brand new interconnect technology in order to increase my model size 10x. It’s just not worth it. Compute-optimal doesn’t mean effort-optimal! And basically this seems to be what is happening in a lot of these scaling experiments. Look at these (incomplete) isoFLOP curves below from the new scaling laws paper and see how flat they are over a wide range of model sizes:

I would happily choose the smallest model size inside the highlighted rectangular box instead of going for a slightly better, but 5x bigger model.

5) Given how much seems to hinge on the results of these scaling experiments (e.g. the difference between having to develop novel hardware tools to train and deploy models or not), I think there’s an urgent need to do these important experiments even more carefully than the new scaling laws paper does, for example, by running even more thorough hyperparameter searches per run and perhaps also testing up to slightly larger FLOPs.

6) My hunch is that we will soon find out that even a 70B parameter model (called Chinchilla in the new scaling laws paper) is still too big for the amount of compute used for that model; my guess is that something like a ~10B parameter model will turn out to be roughly equivalent to this model (in terms of the final loss and downstream capabilities) if trained for ~7x longer. And, in hindsight, everyone will remember this episode in history as a very funny memory (“remember that time when a bunch of people got carried away and trained a 175B parameter model using bespoke hardware, when a 10B parameter model would do just fine, and then everybody tried to one-up them; those were pretty crazy times!”).

7) Be very skeptical of the model size scaling experiments you see reported in machine learning papers these days (especially if they sound magical!). Just like the original scaling laws paper, these papers usually don’t perform independent hyperparameter searches for different model sizes and also don’t control for compute (need to do more training iterations with a smaller model) and this likely leads to an underestimation of the performance and the capabilities of the smaller models reported in these papers.

Emergent problems with “emergent” capabilities

If I have two variables x and y that are linearly related, say y=x for the sake of simplicity, they look like this if I plot both of them on a linear scale:

If I now plot the x axis on a logarithmic scale on the other hand (\texttt{semilogx} in matplotlib), they look like this:

It looks exponential! It is exponential on this scale! Now instead of drawing a continuous curve, if I sample a bunch of discrete points along the x axis and only plot those (with their corresponding y values), they now look like this:

All of a sudden, it looks like something truly magical and miraculous happens in y, some special y quality (“your royal yness”) “emerges” when we cross a magical x value. But it’s all an illusion! Nothing of the sort happens. This is just an artifact of the way we’re plotting these variables. The underlying relation is still y=x, the epitome, the pinnacle, the very essence of non-emergence, boringness, and banality (if I may): you get what you give.

Why am I feeling the need to write these clearly obvious and obviously clear facts? I’ve seen a couple of deep learning papers recently (e.g. this paper and this paper) reporting “emergent capabilities” as some seemingly magical model size threshold is passed: so, here x would be model size and y would be performance in some downstream task. But unfortunately these claims do not seem to take into account the simple plotting artifact described above.

What should they have done? What should be done instead? I would suggest the following: please just fit some simple functions to the (x, y) data that you have, tell us which ones you tried and which one fit the data best: Is it linear? Is it logarithmic? Is it some low degree polynomial? Is it exponential (likely unlikely)? Can you even distinguish between these choices given the (limited and noisy) data you have? Please show us that you can! Admittedly, this doesn’t sound as seductive or mystical as claiming “emergent capabilities”, but it’s much more descriptive and informative.

I don’t deny that there may be “emergent” (or “abrupt”) phenomena in the sense that these papers intend, for example, if the underlying relation between x and y were a high degree power function or an exponential, then one could perhaps make a plausible case for “emergent” phenomena, provided, of course, one makes it mathematically clear what exactly one means by “emergent” and why that definition is justified: e.g. is quadratic good enough for “emergence” or do we need at least cubic or do we need an exponential for true “emergence” (which would show up as a double exponential in a \texttt{semilogx} plot)? Why or why not? Unfortunately I think these papers sadly fall short in this regard. I’m as impressed as the next person by what these new generation large deep learning models can seemingly do, but I sometimes fear that their unexpected success might be starting to make some believe in magic.

Update: Another problem with most of these model size scaling experiments is that they usually don’t optimize the hyperparameters of different sized models separately and also don’t control for the amount of compute (i.e. one needs to do more training iterations with a smaller model), which likely causes an underestimation of the pretraining performance and the downstream capabilities of the smaller sized models, as revealed by the new scaling laws paper and as discussed further in this post.

A simple plausibility argument for why scaling is probably not enough

In the original scaling laws paper by Kaplan et al., there is a set of experiments comparing the scaling behavior of transformers with the scaling behavior of LSTMs. The results of these experiments are summarized in Figure 7 in the paper (reproduced below). This figure shows that transformers consistently outperform LSTMs for a given number of parameters, but more importantly they also display a much better scaling behavior than LSTMs (i.e. better asymptotic performance, as indicated by a steeper slope). This means that architecture can affect the scaling behavior a great deal (although the difference between architectures need to be significant enough for architectural choice to make a material difference in the scaling behavior, as the same section also includes the results of another set of experiments comparing the scaling behavior of transformers with the scaling behavior of universal transformers —a variation on the original transformer architecture—, and the difference here is marginal at best).

Transformers display much better scaling than LSTMs (from Kaplan et al., 2020).

My plausibility argument is then simply that it’s a priori very unlikely that we’ve hit upon the architecture with the optimal scaling behavior after only a few years of serious effort by the deep learning community (the original transformer paper came out a mere five years after the AlexNet paper, the year deep learning research seriously took off). Rather, it seems a priori much more plausible that there are many more significant architectural/algorithmic innovations waiting to be discovered that will further improve the scaling behavior of deep learning models. I do think, however, that these innovations would need to target very general information processing needs (such as integrating information from larger contexts, integrating within-context information more effectively and efficiently, dealing with vanishing gradients, etc.) rather than trying to build in domain-specific priors reflecting “how we think we think”, which never really works in the long run, as I have argued before.

Update: Here is an interesting article I found that tries to estimate the rate of algorithmic progress over several decades relative to Moore’s law (rate of improvement in hardware over time) for a wide range of computational problems. The authors conclude: “Overall, we find that algorithmic progress for the median algorithm family increased substantially but by less than Moore’s law for moderate-sized problems and by more than Moore’s law for big data problems.” Obviously, computational problems in deep learning are much more likely to belong to the latter category, hinting at the relative importance of algorithmic improvements for such problems. Here is a related blog post by OpenAI from a few years ago, again trying to quantify algorithmic improvements in ImageNet models since AlexNet (spanning roughly a decade of research). The authors similarly conclude: “Our results suggest that for AI tasks with high levels of recent investment, algorithmic progress has yielded more gains than classical hardware efficiency.” It may seem like we’ve been stuck with the basic transformer architecture for quite a while now, but I do strongly believe (and the data just cited back up my belief) that significant algorithmic improvements over this basic transformer architecture will come at some point, it’s just that it’s hard to predict when exactly this will happen. It seems that right now people are more interested in scaling-up than in algorithmic improvements (pictorially, this corresponds to moving along one of the straight lines in the log-log scaling plot above, instead of trying to descend to a qualitatively better line in the same plot); this seems to be because at the moment there is likely a bigger bang for the buck for efforts invested in scaling-up, but I think this will change as we start to get diminishing returns from this approach.

Update 2: It could be argued that for practically important computational problems we might care about, scaling could get us to super human-level performance even with sub-optimal algorithms. This is certainly true. A good example of this would be AlphaGo vs. its later iterations like AlphaGo Zero or AlphaZero. Even though these later versions were algorithmically superior to AlphaGo, at large enough scales, AlphaGo itself was already good enough to achieve super human-level performance at playing Go. However, it should be kept in mind that asymptotics always wins in the long run, so algorithmic improvements are not to be left at the table lightly. It also seems plausible to suggest that at large enough scales, significant algorithmic improvements often lead to large jumps and hence surprising, qualitative improvements in model capabilities and to the emergence of completely novel capabilities, which again suggests that new algorithms might be necessary for certain capabilities.

Neural networks are actually not very good at memorizing

It’s often claimed that neural networks are very good at memorizing information. In a certain sense, this is definitely true: if you train a sufficiently large capacity neural network for long enough, it will happily memorize more or less anything you give to it. But in another important sense, this claim is not true: the catch here is that you have to train it for long enough. Even when the data comes from a highly structured domain (e.g. images or text), it will typically take many passes over it for the network to fully incorporate it into its parameters. Fundamentally, this seems to be because the neural network loss function we need to optimize in order to incorporate some data into the parameters of the model is usually a very complicated object and the only way we know how to optimize it is through local search, i.e. gradient descent, so we have to do it incrementally by taking many small steps, which means that we have to see the same data many times.

Humans, on the other hand, can at least sometimes incorporate new information very fast, basically after a single exposure. There are classic experiments in psychology, for example, demonstrating that humans can sequentially go through thousands of pictures, looking at each picture once for a few seconds only and recognize them hours to days later with very high accuracy (Shepard, 1967; Standing, 1973; Brady et al., 2008). A lot of the semantic knowledge we have (e.g. factual knowledge) also seems to be of this nature: acquired single-shot, maybe after reading a book or learning it from a friend, and retrieved and used as necessary later on.

Geoff Hinton, in an interview, once expressed this fundamental difference between humans and our current generation of neural networks quite nicely: “The brain is solving a very different problem from most of our neural nets … I think the brain isn’t concerned with squeezing a lot of knowledge into a few connections, it’s concerned with extracting knowledge quickly using lots of connections.”

I’ve recently wondered how current deep learning models (learning new information in a standard way, i.e. via gradient descent) would fare in a rigorous, head-to-head, quantitative comparison with humans in such fast-learning tasks. Are they not quite as good as humans yet, but pretty darn close, or are they simply still leagues behind humans in this respect? To investigate this, I subjected Image GPT (iGPT) models to the same recognition memory experiment that humans did in Brady et al. (2008). I wrote up the full results in this preprint that I posted on arxiv a few weeks ago. The main result, summarized in the figure below, is that even the best iGPT model that I’ve tried needs something like ~10 exposures to the same study images in order to reach a recognition memory performance that humans achieve after only a single exposure:

Recognition memory accuracy in humans vs. different iGPT models as a function of the number of exposures to a set of 2500 study pictures depicting real-word objects (copied from Figure 2 in the paper).

Pretraining and bigger model sizes improve recognition memory performance, but these improvements are not noticeable after a single exposure (it usually takes at least a few exposures for these improvements to become visible) so that even in the best case the models are basically still at chance level after a single exposure. This makes me a bit skeptical that simply scaling up the pretraining data size or model size would be a feasible strategy to reach human-level recognition memory performance (an updated version of the paper will include a back-of-the-envelope calculation to drive home this point).

Many deep learning practitioners seem to be aware of this shortcoming of neural networks. There is an entire literature on extending neural networks with some sort of external memory to improve their fast-learning or memorization capability (among other benefits): e.g. Grave et al. (2016); Blundell et al. (2016); Pritzel et al. (2017); Orhan (2018); Khandelwal et al. (2019); Borgeaud et al. (2021); Wu et al. (2022); etc., etc. The basic idea here is to off-load the task of fast-learning or memorization on to the external memory, while the neural network focuses on learning the necessary computations on a slower time scale: a type of separation of concerns (this idea is commonly known as complementary learning systems hypothesis in psychology; it’s a bit of an open question to what extent this hypothesis is actually true when it comes to the brain). The recent RETRO paper from Deepmind explains this particular motivation behind these types of models quite well:

“The benefits of increasing the number of parameters come from two factors: additional computations at training and inference time, and increased memorization of the training data. In this work, we endeavor to decouple these, by exploring efficient means of augmenting language models with a massive-scale memory without significantly increasing computations.”

These models seem to work really well in practice, but their one significant (perhaps fatal) drawback is being a loser in the hardware lottery: they’re simply too cumbersome, impractical, and inefficient to implement with today’s hardware. The RETRO model, for instance, requires you to keep around (and constantly retrieve from) a datastore of size ~100 TB (for their largest datastore). Since most deep learning data is stored externally (as opposed to, for example, streaming data where you really have only a single opportunity to “see” the data), people instead usually don’t mind paying the one-time cost of training a much smaller sized neural network by doing multiple passes over the dataset (hence “slow” learning) but obtaining a much more compressed representation of the data in the end (in the parameters of the model). Perhaps, new generation wafer-scale chips will make models like RETRO more attractive for hardware, but I’m not sure if they’ll be able to tip the balance entirely in favor of such models any time soon over the more standard “slow-learning” models that practitioners today find so familiar and convenient.

A critique of “Why Greatness Cannot Be Planned”

I’m cross-posting this recent piece from my Substack here, since it’s relevant to machine learning.

I’ve recently read Kenneth Stanley and Joel Lehman’s thought-provoking book, Why Greatness Cannot be Planned, and wanted to share my thoughts about the book.

The book is an intriguing critique of objective-based thinking in science and technology and in various other aspects of our life, such as education and romance. I found a lot to sympathize with in the book, especially its strong emphasis on the importance of individual autonomy and the diversity of pursuits, i.e. letting people pursue their own unique interests, whatever they happen to find interesting and worth pursuing in life, encouraging them to be explorers and “treasure hunters”. Others can then benefit from their explorations; they can use their discoveries as “stepping stones” in their own explorations. As a libertarian, this is a philosophy of life that is personally very appealing to me.

That being said, I do believe the book’s main thesis about objectives is based on a misunderstanding (or a misdiagnosis), so it is likely incorrect in my view.

The main problem with objective-based thinking the book identifies is deception: for any ambitious goal, such as achieving human-level intelligence in AI or through biological evolution, the stepping stones to that ultimate goal are numerous and often quite dissimilar (and unrelated) to the goal. It doesn’t make much sense, for example, to try to maximize “intelligence” when you’re at the single-cell stage in the history of life on earth and hope to reach human-level intelligence at some point along the way. Instead, the stepping stones are often reached serendipitously while trying to do something completely unrelated to achieving your ultimate ambitious goal: for example, vacuum tubes were essential in building the first computers, but they were originally invented for a completely different purpose. So, rather than explicitly trying to optimize for an ambitious objective (which may be many serendipitous stepping stones away), the authors instead recommend exploring ideas or innovations according to their novelty, interestingness, excitingness, or their potential to become a productive stepping stone, a launching pad for even newer, more exciting ideas. The hope is then that we will have collected enough useful, serendipitous stepping stones along the way that at some point, our ultimate ambitious objective (e.g. achieving human-level intelligence) will appear in the horizon (within striking distance) and at that point (and only then) will it make sense to directly optimize for that objective. The book’s main idea is thus a strong emphasis on exploration unhindered and unconstrained, as much as possible, by any considerations about achieving ambitious objectives or goals.

It’s a neat theory as far as it goes, but there are several issues with its main line of reasoning (in the following, I will focus mostly on reaching human-level intelligence either through AI or through biological evolution as my working example of an ambitious objective as this is the example I know most about):

(1) The authors make very strong assumptions about the nature of stepping stones and ambitious objectives without much concrete evidence. For example, is it really true that the stepping stones to an ambitious goal are always deceptive? Some recent examples from machine learning suggest that this may not be the case: when we consider, for example, highly capable machine translation, speech recognition, text-to-image generation, question answering, game playing, protein folding prediction systems developed in recent years, they’re almost always trained with fairly standard models and training methods in one long optimization run that consistently reduces some highly generic loss function (or equivalently consistently improves some highly generic objective function) on a very large scale dataset, hence there’s really no deception along the optimization path where the loss first has to increase before it decreases. This suggests that such deception phenomena may not be as common as the authors suggest in the optimization of ambitious objectives (and super-human level Go playing, accurate protein folding prediction, human-level machine translation are all undoubtedly very ambitious objectives).

(2) Related to the previous point, the authors also underestimate the ability of objective-based optimization to collect useful and interesting stepping stones. Again, consider models like GPT-3 or Facebook’s WMT multilingual machine translation model trained on very large scale datasets. These models collect many stepping stones along their optimization path to become highly capable language and machine translation models, respectively. Even in much simpler models, objective optimization can generate a step-by-step emergence of stepping stone capabilities, as demonstrated by Andrew Saxe’s work on the dynamics of learning in simple neural network models:

Copied from Figure 3 in Andrew Saxe’s paper on the dynamics of learning in deep neural networks.

It could be argued that these stepping stones are qualitatively similar to the end product: e.g. the model just picks up more and more linguistic capabilities along its optimization path. But this is just a consequence of the relatively narrow domains/objectives these models are trained on. There’s no reason to think that training a model in a much richer domain would not give rise to a similar emergence of diverse, qualitatively different stepping stone capabilities along its optimization path.

(3) Sometimes the seeming inability of objective optimization to get us to our most ambitious goals may simply be due to the choice of wrong objectives rather than an inherent shortcoming of objective-based thinking itself. This is nicely illustrated by the example given by the authors of trying to reach human-level intelligence from single-celled organisms through maximizing “intelligence”. The problem with this objective is that “intelligence” is an imprecise, vague, non-operational objective. Instead, we need to choose a more generic and actionable objective that can be applied to single-celled organisms and then try to reach human-level intelligence as a by-product of this optimization (rather than as the direct target of it). This is certainly how biological evolution achieved human-level intelligence: by optimizing fitness or reproductive success; human-level intelligence emerged as a by-product. A similar example given in the book is the example of pre-historic humans trying to build a computer. Of course, this doesn’t make sense because those people didn’t even have the concept of a computer, so it’s not an objective they could have acted upon. But if we instead chose a more generic and actionable objective that could be applied to pre-historic humans as well as to more modern humans, such as maximizing material outputs (i.e. something like GDP PPP), it’s conceivable that they would have invented computers at some point along the way as a by-product; and indeed something like this is roughly how we got computers in reality.

(4) Contrary to what I have claimed in my previous point, the authors argue, unconvincingly in my mind, that fitness in biological evolution is not an objective in the usual sense. For example, the authors argue that a fitness objective would require a “maximally fit” organism. But this is only true for a static fitness landscape; if the landscape changes, for example, as a result of environmental changes, there doesn’t necessarily have to be a “maximally fit” organism. Fitness is also not really different from novelty or interestingness (criteria favored by the authors) in this respect. The only thing needed for either an objective-based search or novelty search is a local gradient pointing in the direction of higher fitness or novelty in the current landscape (more/less fit or more/less novel). The authors correctly point out that for novelty search, whether x is more novel than y is not absolute, but rather depends on what the agent has already learned (the exploration history of the agent), but this is again true for fitness as an objective as well: whether x is more fit than y may depend on the current environment/ecosystem (the evolutionary history), so this is also not materially different between novelty search and fitness as an objective.

(5) This brings me to perhaps the most important objection I’d like to raise against the main thesis of the book: I think that the authors misdiagnose what makes biologoical evolution (and other forms of natural and human innovation) powerful. The key thing that makes biological evolution (other mechanisms or processes of innovation) powerful is the richness of the world we live in and the existence of a huge number of parallel organisms/agents exploring, or searching, different parts/niches of this incredibly rich world and the complicated network of interactions between these organisms/agents. There’s nothing wrong with simple generic objectives, like fitness or reproductive success or likelihood or reward (in machine learning or reinforcement learning), driving the exploration in such a rich environment. Conversely, there’s nothing magical about alternative criteria like novelty or interestingness driving the exploration. It’s rather the rich, complicated, dynamic environment we live in and the very many parallel, interacting searches going on in this environment that make creative and useful innovations possible.

There’s reason to believe that if the world were more simple, stable, and static, fitness maximization wouldn’t lead to such a high degree of diversity and innovation in biological evolution. In fact, this is the basic theme of Stephen Jay Gould’s famous punctuated equilibrium theory of evolution: long periods of stasis in evolution punctuated by sudden disruptions in the environment/ecosystem (e.g. a meteor impact) followed by rapid adaptation to the new environment/ecosystem. This idea is circumstantially supported by the early history of life on earth where the first couple of billion years of evolution took place in very harsh and relatively uniform environmental conditions and did not produce a lot of innovation in life forms compared to the amount of creativity and innovation that unfolded afterwards in much richer, more complex and favorable environmental conditions.

(6) As I mentioned earlier, the primary emphasis of the book is on free exploration unconstrained by objectives. But constraints on exploration (in one form or another) are absolutely essential to come up with anything useful or interesting. There’s one very informative hypothetical example in Chapter 10 of the book (devoted to natural evolution) that I’d like to discuss in this connection. The authors imagine a hypothetical (peaceful) world, called Gentle Earth, in which competition for survival or reproduction is not a constraint on evolution. The details are not fleshed out unfortunately, but in such a scenario, presumably any mutation, any imaginable life form would be viable and as a result evolution would produce vastly more novel life forms than it has in the actual world (which might perhaps be called Cruel Earth). But, Gentle Earth in the limit is just like Borges’ Library of Babel, where almost all books are uninteresting gibberish and only a vanishingly small proportion of books actually contain anything of interest or value to humans. So, constraints of one form or another are absolutely necessary to limit the endless possibilities to those that are actually productive, useful, or interesting. For example, depending on the details, physical/chemical limits on viability (some mutations won’t generate physically or chemically viable organisms) can provide one such set of constraints even in Gentle Earth. One can debate the relative strengths and weaknesses of different sets of constraints (e.g. interestingness vs. fitness), but at least some such set of constraints are essential.

To sum up, although I find a lot to admire in the book (e.g. its strong emphasis on the importance of individual exploration), I think Why Greatness Cannot Be Planned ultimately misdiagnoses what exactly is essential and what isn’t in artificial and natural mechanisms or processes that generate powerful and creative innovations and it overestimates the difference between objective-based search and novelty search as exploration mechanisms.

Catastrophic forgetting is yet another pet problem rendered obsolete by scale

For a while now, much of the academic ML research has basically been a stubborn refusal to acknowledge the blindingly obvious, undeniable fact that scale renders most of the pet problems of this field obsolete: few-shot learning, compositionality, out-of-distribution generalization, “meta-learning”, disentanglement etc. I wrote about these issues in several earlier posts before (e.g. this, this, and this). These so-called problems are simply artifacts of the small scales and the toy settings researchers choose to study, so researchers should just stop worrying about these non-problems already (and wasting their and other people’s energy and money) now that it’s clear they will disappear in larger scales and more realistic settings. I was reminded of this once again after reading this beautiful anonymous ICLR submission that shows that catastrophic forgetting also belongs to this grisly junkyard. That’s right, catastrophic forgetting is not a real problem for large models trained on large, rich, realistic datasets. So, can people please stop writing pointless papers on this non-problem masquerading as a problem in meaningless toy settings now? Thank you.

How much “human-like” visual experience do current self-supervised learning algorithms need to achieve human-level object recognition?

I just posted a new paper to arxiv where I try to estimate the sample efficiency of the state-of-the-art self-supervised visual representation learning algorithms vis-a-vis humans in a complex, realistic visual object recognition task, namely ImageNet. I come up with an estimate that suggests that compared to humans these algorithms would need several orders of magnitude more “human-like”, natural video data in order to reach human-level performance in ImageNet. This is a very challenging estimation problem and my estimate thus comes with a lot of caveats (I discuss some of the main caveats in the paper), but it is the first serious, quantitative attempt to address this important question that I know of.

Ditching academic research in AI/ML

The news of the existence of at least one collusion ring in the AI/ML conference peer-review system has made some waves recently (here and here are two recent reddit threads on this topic). What would be the most meaningful response to this kind of explicit fraud in the system? In this post, I’d like to express some possibly unpopular and uncomfortable opinions (which is something I like to do in general apparently :)) and toy with some radical ideas/suggestions for improving the overall AI/ML research ecosystem.

First of all, it’s important to realize that people respond to incentives. Although, of course, pointing this out doesn’t absolve individual culpability, issues like this point to systemic problems that need to be addressed systemically. It is hard to imagine something like this happening, for instance, if conferences weren’t such a high-stake game in AI/ML research. So, we have to ask ourselves why the stakes are so high. Michael Littman’s article partially answers this question:

… stakes are high because acceptance rates are low (15%–25%), opportunities for publishing at any given conference are limited to once a year, and publications play a central role in building a researcher’s reputation and ultimate professional success. Academic positions are highly competitive, so each paper rejection—especially for graduate students—has a real impact on future job prospects. Some countries correlate promotion and salary decisions to the number of papers accepted at a specific set of high-profile conferences (and journals).

Why are academic positions highly competitive? It’s because there are too many candidates for too few positions. These too many candidates produce too too many papers, too many of which are, to put it bluntly, worthless. Even when these papers are technically sound, they don’t address any interesting or important problems, they propose simplistic ideas in the context of toy problems that obviously won’t pan out for any sufficiently interesting and important large-scale realistic problem. The sad truth is that even if these papers are accepted by a conference, they won’t be read by anybody, won’t provide any benefit for any practical use, and won’t even have any tangible impact whatsoever on the field in the long run. There’s no reason for anybody to waste their time on papers like these, other than the Machiavellian reasons touched upon by Littman (basically to signal to their potential employers that they are “productive” and to chase after power, prestige, and money). There’s no good reason for the public to fund this kind of unproductive research with taxpayers’ money.

It could be argued that this situation is inevitable: most ideas will lead to dead ends, only a very small number of ideas will win out in the long run through a process of natural selection of ideas. But, this is not true: yes, some ideas will, of course, not pan out in the long run, but the current quality/quantity combination for research outputs in AI/ML is clearly not ideal. In my opinion, an alternative research landscape more or less exclusively dominated by a small number of large industry labs like OpenAI, Google Brain, FAIR, etc. as opposed to a large number of small academic labs would clearly land us at a much more favorable position in the space of quality/quantity of research outputs, so the current situation is not inevitable.

This problem, by the way, isn’t specific to AI/ML research, it afflicts most of academia, but probably becomes especially acute when a field becomes “hot.” I sometimes genuinely wonder: at what point do academics in general admit that their field is basically artificially driven by government money and by irrational incentives and rent-seeking behavior? That there are just too many people employed in their field going after too many unproductive, obviously flawed ideas, or uninteresting, insignificant questions? Perhaps the answer is never, because as Upton Sinclair once observed, “it is difficult to get a man to understand something when his salary depends on his not understanding it.” Can academics really justify that they should get this money instead of a public school, or a public hospital, or a homeless shelter, for instance?

What is my proposal then? What would a more rational system look like? First of all, I suggest that there should be a lot fewer people working professionally in AI/ML research. In recent years, most of the interesting and impactful work in this field has come from large industry labs that have the resources to run large scale experiments, so perhaps they should employ the overwhelming majority of the people working professionally in the field. This would mean basically winding down most of the low-impact academic research in AI/ML. Also, in a more rational research landscape, a lot more collective effort/resources than now would be spent on improving hardware and collecting/curating data.

For the rest, I propose a system similar to the marketplace for music production/consumption. The barriers to entry into the field aren’t very high in AI/ML research. Fortunately, large industry players generally share their tools/models publicly. Obviously, they can always do a better job in this respect, for example by making their internal large scale datasets public, by making large scale compute more affordable, more readily accessible to amateur researchers. Motivated amateurs would then produce “content” using these tools and share it publicly: if you think you built something cool, you should just put it out there: write up what you did in a report, put it on arxiv, put your models and code on github in an easily accessible format for others to use and most importantly, make demos to get people excited. If you really did something cool, people will notice it, including prospective professional employers. This would then be the motivated, talented amateur’s ticket to a professional career in AI/ML research.

As this system would eliminate most academic research in the field, there wouldn’t be any need for conferences/journals (of course, conferences could still be organized to meet with people and discuss ideas in person, but they would be a much more informal affair, perhaps more like workshops today). Peer review would be carried out publicly in the marketplace of ideas. There would probably be much less output overall, and whatever output is produced would be more likely to be interesting and impactful, because it would be produced by people genuinely driven to create something interesting and useful to others.

A good yardstick that I like to think about in this connection is OpenAI. Wikipedia says they employ over 120 people. Now, I don’t know how many of those are involved in research, but let’s say ~100. It’s probably safe to say that these are some of the smartest, most talented people in the field. Yet, if we consider their research output quantitatively, it’s not that much. Every year, they put out only a handful of extremely high-impact, high-quality papers/products, like GPT-3, DALL-E, CLIP etc. If the very same set of people were employed in academia instead, they’d probably produce at least one or two orders of magnitude more papers between them, but these papers would be much much less impactful and lower in quality, again attesting to the irrational, unproductive incentive structure of academia.

I should make it clear that I’m not advocating winding down AI/ML education in academia, just research. In fact, education could be the main legitimate purpose of academia under this system. I should also make it clear that I’m not suggesting this system as a model for research in all fields. Some fields with higher technical barriers for research (for example, molecular biology) clearly produce very useful, practical knowledge and/or make meaningful contributions to our understanding of nature (although as I mentioned above, I think the same bad incentives are at play in most places in academia to some degree, so shrinking the size of academic research in general would perhaps not be such a bad idea).

I know at least two other fields quite intimately: cogsci/psychology and neuroscience. Now, I’m going to make an extremely incendiary claim and suggest that research in neither of these fields has produced anything of much value in our understanding of how the mind/brain works and so both deserve a significant shrinkage in size in academia as well. It’s not an exaggeration to say that I have personally learned a lot more about the nature of intelligence, cognition, perception and about how our brains might be doing all these things (supposedly the main subject matter of psychology/neuroscience) from the deep learning research that came out in the last 5-10 years than from decades of simplistic, unfruitful, and sometimes frankly straight up silly psychology/neuroscience research (I’d be extremely willing to debate this issue with anybody who has a different opinion about it). I humbly but sincerely suggest that as a first small step toward improving itself, psychology/neuroscience research can start by putting an indefinite moratorium on the mind-numbingly and soul-crushingly dull and uninteresting left-right random dot motion discrimination task and all its equally uninteresting and insignificant variants. Please do it!

Pinker on why humans are smarter than rats

I’ve been reading Steven Pinker’s The Blank Slate and was struck by a passage I wanted to share. Early in the book, Pinker takes up the question of what makes humans smarter than rats, a question originally posed by Rumelhart and McClelland in the famous PDP book. Rumelhart and McClelland’s answer is to point out: (1) humans have a much bigger cortex than rats and (2) humans and rats live in very different milieus, the human milieu being much more culture-laden than the rat milieu:

Pinker finds this answer, especially the first component (that the human cortex is basically a massively scaled-up version of the rat cortex), patently wrong and even ridiculous, so much so that he goes on to mock this idea several times in later chapters.

Now, I don’t know if this hypothesis (that the human cortex is, to a good approximation, just a scaled-up version of the rat cortex) is true or false. But, it doesn’t strike me as obviously false. Pinker is clearly underestimating the computational power of the sheer scaling-up of the model size here (even without a concomitant increase in data size and diversity or an increase in training time). The human cortex has roughly three orders of magnitude more neurons than the rat cortex. Assuming a similar level of connection sparsity between the two species, this would translate into a whopping six orders of magnitude difference in the number of synapses, or “parameters” (the assumption of similar connection sparsity levels in the human and rat cortices is probably unrealistic; I expect the actual scaling factor for the number of synapses to be somewhere between three and six orders of magnitude, but I couldn’t find a reliable estimate for this). If we learned one thing from recent results in machine learning, it is that increases in model size on this scale can lead to very large, qualitative changes in model behavior. Here’s an example from the GPT-3 paper:

Note that the x-axis in this figure covers a range that is roughly three orders of magnitude in size, hence it would likely be an underestimate of the analogous human vs. rat difference. Note also that in many individual tasks (the faint lines), the model goes through what appears to be a qualitative shift in performance as the model size is increased, with the smaller models performing at near zero accuracies, while the largest one performing at much higher accuracy.

A seemingly innocuous but actually striking prediction from this kind of model size scaling effect is that bigger models should be broadly better than smaller ones across a diverse range of tasks. The individual tasks in the figure above, for example, represent a broad range of text-based tasks, but this would be true for even more different tasks possibly involving other modalities. For example, if we were to plug in visual inputs to the models plotted above and trained them on some visual tasks, the larger models would still outperform the smaller ones.

As I have recently learned from Robert Plomin’s excellent book, Blueprint, this prediction in fact turns out to be true even when we just consider individual differences between humans (so no need to make cross-species comparisons), that is, people who are good at a particular cognitive-perceptual task often tend to be good at other seemingly completely unrelated perceptual-cognitive tasks as well, and these correlations are driven by what Plomin calls “generalist genes”, i.e. genes that have diffuse effects on a broad range of cognitive abilities.

This result is easy to explain if we assume that individual differences between the brain structure of different people relate to innate, but very generic properties, like the number of neurons or the number of connections etc., because as mentioned above a strong correlation between performance in a diverse range of tasks is exactly what you would expect under the scenario of variation in such generic properties like model size. The same result is, however, very hard to explain under a Pinkerite innate-specialized-modularist account of the human brain. I want to highlight a few relevant and important quotes from Plomin touching on this issue:

I think this example from Pinker is unfortunately not an isolated example. Psychologists often don’t have solid, reliable intuitions about the computational complexity of the perceptual and cognitive problems humans face and the importance of various factors such as model size and data size and diversity on performance in these problems. I would actually go so far as to suggest that the entire psychology literature is replete with cases where psychologists make unfounded and unjustified poverty of the stimulus claims based on their unreliable, incorrect intuitions about these computational questions. I hope to write more about this important issue some time in the near future.

AI research, wise passiveness, and negative capability

AI research needs more wise passiveness and negative capability. Wise passiveness is an idea introduced by William Wordsworth in his poem Expostulation and Reply. This poem appears as the first poem in his famous Lyrical Ballads. In the poem, Wordsworth advocates a quiet receptiveness, a passive, non-systematizing openness to the world:

The eye–it cannot choose but see;
We cannot bid the ear be still;
Our bodies feel, where’er they be,
Against or with our will.

Nor less I deem that there are Powers
Which of themselves our minds impress;
That we can feed this mind of ours
In a wise passiveness.

Think you, ‘mid all this mighty sum
Of things for ever speaking,
That nothing of itself will come,
But we must still be seeking?

Here is a longer, superb dissection of the whole poem. Wordsworth invites us to simply listen to the world as it unceasingly speaks to us; then, perhaps we wouldn’t even have to seek knowledge from extraneous, indirect sources like books and/or dead men, which can be interpreted as tradition or received wisdom more generally.

John Keats entertained a similar idea with his concept of negative capability: “… capable of being in uncertainties, mysteries, doubts, without any irritable reaching after fact and reason.” As Crichlow Goellnicht explains, this means a passive, receptive “acceptance of the world in all its diverse aspects, without having to analyze, rationalize, and categorize those aspects, without having to explain away every mystery and doubt, without having to fit everything into a neat, philosophical system.” The reason Keats called this negative capability is presumably because it involves being at peace with uncertainty, doubt, mystery, vagueness, murkiness, and ambiguity, all concepts with at least some degree of negative connotation.

Where am I going with all this? What does this have anything to do with AI or machine learning? Here’s the connection: I think there are whole subfields in AI and machine learning research centered around ideas or concepts that cease to make a whole lot of sense if we become more receptive (gently, passively receptive) to the irreducible richness and complexity of the world without trying to impose our own preconceived theories or ideas on it. I think ideas such as disentanglement, objects, part-whole hierarchies, compositionality, etc. all belong to this unfortunate genre. These are all an educated person’s folk theories about how the world works. The real world and our minds are invariably infinitely more complicated and interesting than can be adequately captured by folk theories like these.

I’d like to end this short post by recommending a few other readings that have argued for a similar non-reductionist view of the world and the mind that embraces their full richness and complexity:

The Bitter Lesson by Rich Sutton (of course :))

Reality has a surprising amount of detail by John Salvatier (h/t Eric Jang)

Science and Engineering for Learning Robots by Eric Jang

On Chomsky and the Two Cultures of Statistical Learning by Peter Norvig

The Unreasonable Effectiveness of Data by Halevy, Norvig, and Pereira

On the Origin of Objects by Brian Cantwell Smith (please be warned that this book may be a bit too philosophical, too “lyrical” 🙂 for a working scientist)

Is compositionality/systematic generalization really a problem for neural networks?

In my last post, I discussed two issues that are widely considered to be serious problems for deep learning models: generalization and few-shot learning (more specifically, meta-learning as a proposal for performing few-shot learning). I argued that these are only problems when we consider small models trained with very limited amounts of data. In this post, I’d like to give one more example of this kind of thing: compositionality or systematic generalization. I’ll again argue that this is only a problem when we consider small toy domains without a lot of structure. It’ll mostly cease to be a problem when we start thinking about the much richer structure of the world we live in, and of our bodies and minds (including our language) that inherit this richness.

There are by now probably more than a dozen benchmarks that evaluate slightly different notions of compositionality or systematic generalization: e.g., SCAN, gSCAN, CURI, COGS, PCFG SET, BabyAI, CLOSURE, SQOOP etc. to name just a few that I’m most familiar with. A common feature shared by most of these benchmarks is that they take place in simple, toy domains without a lot of “affordances”, which necessarily restricts the abundance and richness of the linguistic and semantic/conceptual structures that can be created in them. Some of these benchmarks use natural language or something close to it (e.g., COGS, CFQ), so they don’t necessarily suffer from this particular shortcoming, although they may have other potential weaknesses, like not having a large enough training set or the target task involving a somewhat arbitrary and artificial semantic form (but this is a separate discussion).

For example, a common evaluation condition in these benchmarks is to generalize from just a handful of combinations like x_1 \circ y and x_2 \circ y (e.g., eat furiously and read furiously) to a novel combination x_3 \circ y (e.g., sleep furiously), where x_3 is assumed to be learned from other contexts and x_1, x_2, x_3 are usually the only items of their kind in the domain (e.g., actions). But why do we even expect something like this to work? The world we live in, the world inside our minds (our conceptual world), and our language are nothing like this barren landscape.

When we infer the meaning of a novel combination like sleep furiously, we don’t just have two other actions, eat and read, to rely on. Instead, we have an immensely rich, interconnected web of concepts that we bring to bear on this task. An average English speaker knows tens of thousands of words and our conceptual world is presumably much richer than this number would indicate, because there are no single words for many of our concepts and some of our concepts are altogether difficult to precisely articulate in language. But more than its sheer size, what gives this conceptual web its true richness and power is its highly interconnected and structured nature. For example, among the dizzying, almost stupefying range of things we know about sleeping is the fact that it can sometimes involve restless states, wild movements, hellish nightmares, intense dreams, loud snoring etc., which are all associated with the concept of fury, or the state of being furious, through various more or less circuitous conceptual routes, so we could easily imagine what it would be like to sleep furiously by tracing these routes, even if we heard this particular combination for the first time.

And when applied at scale, neural networks are in fact remarkably good at capturing and utilizing these kinds of associations to make sense of novel combinations. Recent large scale deep learning models like DALL-E and GPT-3 are very good demonstrations of this in my view. Look at the remarkable agility and accuracy with which DALL-E seems to make sense of novel combinations like “a store front that has the word ‘openai’ written on it” (we know that this is a novel combination, because it doesn’t exist in the real world):

Or consider this utterly mind-blowing demonstration of the compositional skills of GPT-3 (source):

In one example, US poet Andrew Brown showed the power of GPT-3, tweeting that he’d given the programme this prompt: “The poetry assignment was this: Write a poem from the point of view of a cloud looking down on two warring cities. The clever student poet turned in the following rhyming poem:”

GPT-3 responded:

“I think I’ll start to rain,

Because I don’t think I can stand the pain,

Of seeing you two,

Fighting like you do.”

And even in simpler, toy domains, which common compositionality benchmarks often focus on, there’s some recent evidence suggesting that simply scaling up the size and diversity of these domains can solve many of the splits in these benchmarks that may seem superficially challenging in smaller scale versions (e.g., Kagitha, 2020; Hill et al., 2020).

It could be argued that these models require too much data to achieve these compositional skills, hence they’re not nearly as sample efficient as humans, for instance. Therefore, the argument goes, the main goal of this field should be to come up with useful inductive biases that would improve the sample efficiency of the models in acquiring these compositional generalization abilities. But, these kinds of comparisons with humans are a bit misleading in my mind because of the radically different nature of the inputs that humans receive (e.g., multimodal, embodied, and embedded in a much richer world). Perhaps, the seemingly greater demand for data these models require is simply an illusion created by the fundamentally different nature of the inputs.

On the futility of trying to be clever (the bitter lesson redux)

The bitter lesson of history in AI is that “general methods that leverage computation are ultimately the most effective, and by a large margin.” There are various manifestations of our unfortunate unwillingness to learn this bitter lesson. Sutton focuses on one in his essay: trying to leverage human knowledge, trying to build in “how we think we think”, which “does not work in the long run”, because “the actual contents of minds are tremendously, irredeemably complex.” There are others: trying to come up with clever algorithmic ideas and hacks to eke out a small advantage in a narrow domain and in the short run. This describes the overwhelming majority of current research in machine learning and AI (including some of my own). It is an irresistible temptation with strong incentives behind it, but it is ultimately misguided and is not what leads to long-term progress and meaningful impact. In this post, I’ll give two recent examples from deep learning: domain generalization and meta-learning.

Generalization is often considered to be one of the biggest problems for deep learning. You have some data. You have a model. You train the model on the data. Fine. Then, you get some new data that’s different from the training/test data you used before but you feel that it’s similar to the previous data in some fundamental respect and that the model should be able to handle it (just to be concrete here, let’s say you trained your model on natural images and want it to generalize to drawings or paintings of the same kinds of things), because look, we humans don’t have any problem making these kinds of seemingly non-trivial generalizations! So, you try your trained model on the new data and it fails miserably. That’s, of course, disappointing. Then, researchers spend an inordinate amount of effort trying to come up with ever cleverer algorithmic or architectural schemes to make models generalize a tiny bit better to novel data/domains given the same fixed (and crucially often relatively small) training data. But, what if this whole enterprise is misguided? Why are we assuming that our training data is fixed and small? And what if there’s simply no clever algorithmic or architectural shortcut to training our models on very large, diverse datasets (if we want to have models that can generalize well)? There’s certainly strong prima facie evidence that this may well be the case.

Take invariant risk minimization (IRM), one of the more popular domain generalization methods proposed recently. IRM considers a classification problem that takes place in multiple domains or environments, e_1, e_2, …, e_E (in an image classification setting, these could be natural images, drawings, paintings, computer-rendered images etc.). We decompose the learning problem into learning a feature backbone \Phi (a featurizer), and a linear readout \beta on top of it. Intuitively, in our classifier, we only want to make use of features that are invariant across different environments (for instance, the shapes of objects in our image classification example), and not features that vary from environment to environment (for example, the local textures of objects). This is because the invariant features are more likely to generalize to a new environment. We could, of course, do the old, boring empirical risk minimization (ERM), your grandmother’s dumb method. This would simply lump the training data from all environments into one single giant training set and minimize the loss on that, with the hope that whatever features are more or less invariant across the environments will automatically emerge out of this optimization. Mathematically, ERM in this setting corresponds to solving the following well-known optimization problem (assuming the same amount of training data from each domain):

\min_{\Phi, \hat{\beta}} \frac{1}{E} \sum_e \mathfrak{R}^e(\Phi, \hat{\beta}), where \mathfrak{R}^e is the empirical risk in environment e.

IRM proposes something much more complicated instead: why don’t we learn a featurizer with the same optimal linear readout on top of it in every environment? The hope is that in this way, the extractor will only learn the invariant features, because the non-invariant features will change from environment to environment and can’t be decoded optimally using the same fixed readout. The IRM objective thus involves a difficult bi-level optimization problem:

\min_{\Phi, \hat{\beta}} \frac{1}{E} \sum_e \mathfrak{R}^e(\Phi, \hat{\beta}) s.t. \hat{\beta} \in \arg \min_{\beta}\mathfrak{R}^e(\Phi, \beta) for all environments e.

In my view, it should always ring an alarm bell in your mind if your proposed method involves solving a gnarly optimization problem, because it suggest that it may not be a general, scalable method. But is it at least effective at extracting those invariant features? Or does it at least work better than your grandmother’s dumb ERM in this respect? It turns out the answer is a decisive no! IRM fails utterly and completely in this respect. In a recent ICLR paper, Rosenfeld et al. show that in the linear case, IRM will fail to extract the invariant features except in some unrealistic settings where basically anything will work, and in the non-linear case, it won’t work any better than ERM in finding the invariant classifier (please see the paper for a more precise statement of the results).

IRM assumes the existence of a featurizer \Phi where the expectation \langle Y|\Phi(X) \rangle is invariant across environments. Inspired by IRM, even stronger constraints have been imposed in the literature, for example, demanding that the whole distribution p(Y|\Phi(X)) be invariant instead. Rosenfeld et al. show that these methods will also fail to work any better than ERM for similar reasons.

Another ICLR paper this year by Gulrajani and Lopez-Paz (incidentally, two of the co-authors of the original IRM paper) reaches the same conclusion through a series of carefully conducted experiments: when compared head-to-head, no fancy, bespoke, boutique domain generalization algorithm (and they have now evaluated more than a dozen algorithms) significantly outperforms ERM. This paper also emphasizes the importance of specifying a model selection method as an integral component of domain generalization algorithms.

Of course, these results don’t prove that it is impossible to beat ERM in domain generalization (I would be eternally grateful to anybody who proves a result like this), but they do suggest to me that ERM is a very simple, general, effective method that will be hard to beat by a significant margin. So, I think it is prudent for researchers to keep this in mind when deciding how to spend their research efforts most productively.

The second example I’d like to give is meta-learning, another hot topic in machine learning replete with clever ideas. First, a word of caution: people unfortunately use the term “meta-learning” in quite different senses in machine learning. Sometimes it’s used to refer to a multi-loop optimization process (as in MAML) and sometimes it should really just be called “multi-task learning” (or how about simply “learning”), but “meta-learning” (or worse still “learning to learn”) is used presumably because it sounds more sophisticated and impressive. I just want to make it abundantly clear that here I’ll be talking about meta-learning in the first sense only, i.e. multi-loop optimization. This approach is often used for few-shot learning (another supposed shortcoming of deep learning models, which is again really just a shortcoming of small models trained with too little data), because it can directly target few-shot learning performance through inner loop optimization. The idea is that the outer loop optimizes the inner loop which directly corresponds to fast adaptation or few-shot learning performance when the inner loop is run for a small number of steps. But two recent papers, first by Raghu*, Raghu* et al. and second by Tian*, Wang* et al., show that in practice the inner loop run doesn’t really do much in these algorithms, so much so that one can safely do away with the inner loop entirely. This means that the success of these algorithms can be explained completely by standard (single-loop) learning on the entire lumped meta-training dataset. Another recent beautiful theory paper by Du et al. sheds some light on these experimental results.

Perhaps, at this point you feel that this post paints a very pessimistic (nihilistic even) picture of the machine learning/AI research landscape. Coming up with new, clever, creative algorithmic ideas is the bread and butter of computer scientists. If that’s a mostly pointless exercise, what is there left to do? First, I’d argue that there’s probably a significant difference between computer science in general and machine learning in this respect: while it is true that algorithmic innovation is central in computer science in general, by its very nature, it is supposed to be less important (although of course not totally pointless) in machine learning, because we’re off-loading a significant chunk of the burden to the machine itself! Second, a research topic is, of course, a deeply personal choice. Who am I to say what one should or should not work on? Who would even listen to me? But I do think that there are many interesting research directions consistent with the philosophy of the bitter lesson that can have more meaningful, longer-term impact than small algorithmic or architectural tweaks. I just want to wrap up this post by giving a couple of examples below:

(1) Probing the limits of model capabilities as a function of training data size: can we get to something close to human-level machine translation by levering everything multi-lingual on the web (I think we’ve learned that the answer to this is basically yes)? Can we get to something similar to human-level language understanding by scaling up the GPT-3 approach a couple of orders of magnitude (I think the answer to this is probably we don’t know yet)? Of course, smaller scale, less ambitious versions of these questions are also incredibly interesting and important.

(2) Finding out what we can learn from different kinds of data and how what we learn differs as a function of this: e.g. learning from raw video data vs. learning from multi-modal data received by an embodied agent interacting with the world; learning from pure text vs. learning from text + images or text + video.

(3) Coming up with new model architectures or training methods that can leverage data and compute more efficiently, e.g. more efficient transformers, residual networks, batch normalization, self-supervised learning algorithms that can scale to large (ideally unlimited) data (e.g. likelihood-based generative pre-training, contrastive learning).

An optimistic perspective on the human-AI nexus

For we know in part, and we prophesy in part.

But when that which is perfect is come, then that which is in part shall be done away.

When I was a child, I spake as a child, I understood as a child, I thought as a child: but when I became a man, I put away childish things.

For now we see through a glass, darkly; but then face to face: now I know in part; but then shall I know even as also I am known.

– 1 Corinthians 13:9-12

In the intelligence explosion scenarios, recursive self-improvement by an AI initially created by humans creates ever more intelligent progeny, making humans (and much else) “redundant” in short order by their absurdly, overwhelmingly superior intelligence. I don’t have a very definite view on how plausible or likely these scenarios are. It’s very likely that we simply don’t know enough about the nature of intelligence itself to even judge with any degree of reliability how likely these scenarios are (here is a humorous take I like that emphasizes this point); for those interested in these scenarios, David Chalmers does a good job of dissecting the argument for an intelligence explosion here.

In these scenarios, it is always assumed that it is inevitable that humans will be made redundant at some point, at least partly because of some hard constraints on our intelligence (usually something to do with our sloppy, slushy, and more or less fixed hardware, the brain). Chalmers puts it thus (p. 13): “Insofar as enhanced brains always rely on a biological core, however, there may be limitations. There are likely to be speed limitations on biological processing, and there may well be cognitive limitations imposed by brain architecture in addition. So beyond a certain point, we might expect non-brain-based systems to be faster and more intelligent than brain-based systems.”

In this post, I’d like to argue to the contrary that: just like we don’t know enough about the nature of intelligence itself to say anything useful about the possibility or the likelihood of a superintelligent AI, we also don’t know enough about the limits of our own human intelligence, especially human intelligence extended and enhanced by the non-superintelligent AI we’re creating, to claim with any degree of certainty that it will inevitably be superseded by a superintelligent AI.

The main point is that although, of course, what Chalmers says about the hardware limitations of biological processing is correct, intelligence is not just a function of hardware, but also of how that hardware is used, i.e. the software that runs on that hardware. And we humans have shown a remarkable degree of agility and adaptability in making use of our sloppy, slushy hardware since our inception as a species.

Think about this: biologically, we’re essentially the same species as our ancestors who lived some 100K years ago on this planet. In terms of material and intellectual culture, we were a much more primitive species back then. It is almost certain that we didn’t even have something that we use to define ourselves as a species today, namely a full-blown language. It is very likely that whatever language these ancestors of ours had back then was extremely primitive (something of the me Tarzan, you Jane variety). And now look how far we’ve come in 100K years! We’re now a species capable of probing the depths of the universe both at the smallest scales and at the largest scales. All with the same hardware! Even from one generation to the next, we’ve been getting more and more intelligent lately: consider the Flynn effect or consider reading a paper in your field written a few generations ago by a giant of the field at the time and see how naïve it’ll sound to you (I had this epiphany recently after reading Alan Turing’s classic 1950 paper on computers and intelligence).

I like to think of this as an algorithmic improvement process: we find ever more efficient ways of using our limited hardware by our constant cultural and technological innovations, discoveries, and we simply don’t know the limits of this process, i.e. how far and how fast we can follow this cultural-technological-“algorithmic” route before we hit a true hardware “wall”.

I see the non-superintelligent AI we’re creating today as part of this cultural-technological-“algorithmic” route too. They’re the microscopes and telescopes of our age, only much more general purpose, hence much more powerful. Like the microscopes and telescopes of an earlier age, they allow us to see whole new worlds we wouldn’t have been able to see unaided.

Look at this picture:

“Adversarial examples are not bugs, they are features” (link)

Would you have guessed that there’s actually a frog in this picture? Would you have guessed that you could recognize frogs using weird features like this (probably much better than humans could)? Knowing this opens up a wonderful whole new world for us, full of patterns we hadn’t even suspected were there before. We could now probe this wonderful new world with our “microscopes” and perhaps one day we could even use it to our advantage for some practical purpose.

Or consider how expert chess players describe AlphaZero’s capabilities: “Chess is full of superhuman expert systems, yet AlphaZero discovered an uncharted space in which its self-taught insights were both startling and valuable. That uncharted space was so significant that AlphaZero was able to convincingly defeat the strongest expert system at the time of testing. Bearing that in mind, you can’t help but to be positive for the application of AlphaZero-like techniques in environments that are less well-researched than chess. Maybe soon, scientists will be echoing our cry during the World Championship: “AlphaZero, find us a path!” Although not every detail of AlphaZero’s decisions will be transparent to human players, we can still glean useful high-level insights (and sometimes even lower-level, more detailed insights) from its playing style that can help improve human players. It was, for example, notable that AlphaZero seemed to place much less value on material than a human player would, preferring activity or dynamism over material instead.

Just last week, a paper came out in Nature showing that an AI system improved the yield from certain chemical reactions over expert human chemists by trying out less mainstream, more adventurous reagents than human chemists who, by comparison, had a more conservative bias in choosing reagents.

These are just a few simple examples among countless others of human-built AI systems opening up whole new ways of seeing and thinking for us, helping us understand our weaknesses better, and offering possible ways of improving ourselves. Undoubtedly, there will be many more (and more significant) such examples in the coming years. I’m personally particularly interested in the possibility of harnessing the help of AI in improving the design of our social, political, and economic institutions (e.g. this). These institutions are susceptible to our collective human weaknesses and also likely constitute the most significant bottleneck in our continued self-improvement as a species on this planet. In this way, I hope we will be able to continue to make ever more efficient use of our fixed, limited, seemingly meager, sloppy, slushy hardware for a long while more.

Thoughts on Image-GPT

The following are some short notes on OpenAI’s Image-GPT paper, which is in my opinion one of the most important papers that came out in recent years.

The motivating question behind this paper is this: can likelihood-based generative pre-training lead to strong transfer learning results in computer vision? This question is inspired by the success of the same technique in NLP (where it’s commonly known as language modeling). In computer vision on the other hand, successful transfer has so far been achieved mostly through other (non-generative) pre-training objectives, like supervised pre-training (on ImageNet etc.), or more recently self-supervised pre-training (MoCo, SimCLR, BYOL, etc.). This raises the interesting question of whether there might be some fundamental differences between language and vision tasks that make these different methods more appropriate for these two respective domains. The Image-GPT paper answers this question in the negative and shows for the first time that likelihood-based generative pre-training can also lead to very strong transfer learning results provided that we use the right kind of architecture (GPT) at the right scale.

The other main interesting result from this paper is that the very same GPT architecture is shown to perform well in both language prediction and image prediction tasks, suggesting that these (and similar) tasks share a deep common computational core (something very general like: predict given as much context as possible) despite their many superficial differences and can be solved effectively by the same kind of computational architecture. I think this observation has important implications for the brain and evolution. To me, one of the things it suggests is that whatever inductive biases our brains (more specifically, our neocortex) may have, they’re probably not very domain-specific biases like many psychologists seem to believe (e.g. specific biases about visual objects, agents, or language). Rather, it’s likely that they’re much more generic (and less easily conceptualizable/articulable) biases, having to do with better information processing in some very general sense, like being able to integrate efficiently over a much larger context, or being able to do better deep credit assignment (i.e. dealing with vanishing/exploding gradients) etc. It is important to emphasize here that the GPT architecture itself embodies absolutely no language-specific or vision-specific inductive biases.

This idea also accords well with the sources of recent progress in machine learning. When I look at what drives significant architectural progress in machine learning today, most of the time it’s somebody proposing a solution to a very generic information processing problem: e.g. in ResNets, solving an optimization problem (vanishing/exploding gradients); in transformers, getting rid of the serial processing bottleneck of RNNs to make it feasible to integrate over a much longer context; in batch normalization, dealing with the covariate shift problem during training etc. Certainly, biological evolution doesn’t have to respect the same rules as human innovation, but at least to me this suggests that there’s maybe more bang for the buck in targeting these general information processing related problems than targeting more domain specific issues, which makes it more plausible that evolution may also be primarily targeting the same general issues.

One final interesting result in the Image-GPT paper is that even for the same validation loss in the generative pre-training task (i.e. ImageNet modeling), bigger models seem to show better transfer learning performance (Figure 3 in the paper). This is interesting in light of my criticism of the GPT-3 paper, where different sized models were not given the same amount of compute and it seemed likely that the smaller models would reach the same (or maybe even better) validation loss as the largest 175B-parameter model if they were given the same amount of compute. The results in the Image-GPT paper suggest that even in that case, the larger models might have had an advantage in terms of transfer performance, but it would have been much better if, just like the Image-GPT paper, the GPT-3 paper had actually carried out this important experiment to see if the larger models have a transfer advantage above and beyond what can be accounted for by validation loss (or compute) alone.

I would have liked to see more analysis of the learned representations in this paper and a more detailed comparison between the visual representations learned in this likelihood-based generative way vs. those learned in discriminative settings (e.g. in contrastive self-supervised learning). One interesting hypothesis is that the representations learned with likelihood-based generative objectives can handle out-of-distribution (ood) stimuli better (e.g. adversarial examples). Intuitively, this could be because likelihood-based objectives require all aspects of the data to be explained and hence reduce the possibility of taking “shortcuts”, which seems to be a common problem with discriminative objectives. Consistent with this idea, there’s some prior work suggesting that likelihood-based generative models can improve the adversarial robustness of deep neural networks.

Thoughts on GPT-3

A couple of months ago, OpenAI released a paper describing their latest language model, GPT-3. GPT-3 is distinguished from its predecessors by nothing other than its sheer scale: compared to its previous incarnations, it’s just a bigger language model trained with a bigger dataset (~1-2 orders of magnitude bigger in both model size and training data size). So, the paper is essentially an exercise in scaling. The main novel result in the paper is an impressive demonstration of the (in-context) few-shot learning abilities of such large-scale language models (it can be argued that even this main result is not entirely novel, as it was foreshadowed in some earlier language modeling work, e.g. see this and this). The paper reminded me, once again, of Philip Anderson’s famous More Is Different paper, where Anderson argues that quantitative changes in nature can sometimes lead to qualitative changes and that people (even scientists) don’t always appreciate the consequences of this fact enough. It was also inspiring for me to see all the amazing demos people have quickly built with GPT-3 and shared with the world (here is a nice collection of such demos as a Twitter thread).

In this post, I’d like to briefly discuss a few criticisms I had of the GPT-3 paper.

Umm, yeah, did we really need that 175B-parameter model?

The first one is about the actual need for scale: i.e. whether they really needed to train a giant 175B-parameter model or not. Figure 3.1 in the paper (reproduced above) clearly shows that many of their smaller models were not trained to saturation; this figure also shows that the smaller models are actually more compute-efficient up to the total compute used for those smaller models. To me, this strongly suggests that they actually didn’t have to train a 175B-parameter model, a ~5B-parameter model would probably have performed just as well (if not better) if trained longer. This point was also noted by Graham Neubig on Twitter.

This renders all the figures in the paper showing model size on the x-axis and performance on the y-axis (which is most of the figures in the paper) a bit suspect in my mind, because the smaller models were not given the same amount of compute in those figures.

So why did they train a 175B-parameter model then? One possibility is just because they could; they perhaps wanted to prepare this kind of infrastructure for projects down the line that actually do require models at this scale. A more sinister interpretation is that they want to commercialize this product at some point (this would be consistent with their CEO’s expressly stated objective of “capturing the light cone of all future value in the universe”) and a giant model is more “controllable” for this purpose: a client can easily put a 5B-parameter model on a few GPUs of their own to do inference and fine-tuning as they wish, but they can’t do this with a 175B-parameter model, making them more reliant on OpenAI’s specialized hardware.

A second difficulty with the paper for me was my constant struggle to understand to what extent the model was doing abstraction (or generalization) vs. rote memorization. In other words, to what extent the impressive looking results from the model can be attributed to the sheer size of the training data vs. the abstraction capacity of the model. To understand this better, it would have been extremely useful if, for example, at least for a subset of the tasks and examples, the authors showed the embedding space nearest neighbors to a given query among the training data, but surprisingly they never do this in the paper (I don’t suppose this would be technically more challenging than running a search over the input space, which they do multiple times in the paper). If these nearest neighbors are intuitively highly similar to the query and the model’s outputs more or less resemble the actual continuations of these nearest neighbors (say, with simple substitutions), that would favor a dataset size based explanation for the performance of the model. They do try to rule out the rote memorization based explanation in some of their experiments, but these were not entirely convincing for me. For example, in the arithmetic tasks, they look for patterns of the form “<NUM1> + <NUM2> =” and “<NUM1> plus <NUM2>” in their training data to investigate if the model is just memorizing these arithmetic equations. They find only a small number of matches, concluding that a rote memorization strategy seems unlikely. But the problem here is that these are just two of the almost endless ways the same arithmetic equations could be encoded in the training data (note that their training data includes a snapshot of the entire world wide web, which is a really really big place!): e.g. “<NUM1> <NUM2>”, “<NUM1> & <NUM2>”, “<NUM1> | <NUM2>”, “<NUM1> p <NUM2>”, “<NUM1> pl. <NUM2>”, “<NUM1> || <NUM2>”, etc. Here, again, it would have been much more meaningful if they showed us some nearest neighbor retrievals instead.

So, where do we go from here? Is training ever bigger language models on ever bigger training data the way forward for an ever more general kind of intelligence? I don’t think so. One immediate difficulty is that unlike compute, it is hard to imagine how the training data can be increased another couple of orders of magnitude. As mentioned above, their training data already includes a snapshot of the entire web (and then some). Perhaps more book datasets can be added to the training data or some improvements can be made in data quality through better cleaning up of the web data (which is, in itself, a significant challenge), but I just don’t see how these can be combined into a few orders of magnitude increase in the effective data size.

In my view, a much more promising route would be to try to add some sort of grounding to these language models, e.g. through pictures or videos from the web. I think grounding is crucial for models to have a better understanding of the world; and anecdotal evidence from human experience suggests to me that these models perhaps wouldn’t need nearly as much grounding experience as they need text data to achieve a reasonably good grounded understanding of the world. This is because it seems to me that we humans acquire most of our grounding early in our development through interactions with a fairly limited environment, and acquire pretty much all the rest of our knowledge only indirectly, through social and cultural means, for example, by learning things from other people, or by reading about them in books, articles, web pages etc. (Anthropologist Joe Henrich makes a similar point in his book The Secret of Our Success). Current language models already seem to be highly efficient at extracting information from extremely large scale text data. To complement this already super-human ability, finding good grounding objectives and grounding data for training large-scale grounded language models would be a very promising and exciting direction, I think (see this, this, and this for some recent attempts in this direction).

Update (09/04/2020): I apparently missed this earlier, but OpenAI has made its intention to make GPT-3 a commercial product very clear right from the beginning (see here). They even mention the large size of the model as an excuse not to release it:

… many of the models underlying the API are very large, taking a lot of expertise to develop and deploy and making them very expensive to run.

So, it seems like my sinister interpretation above for OpenAI training a much larger model than was actually warranted was not too much off the mark!

Deep learning can make more use of available data

This is just a short post on something I’ve been thinking about lately. The argument is often made that deep learning needs stronger, better priors, usually in the form of architectural improvements. I’m not necessarily against this idea, however in this post I’d like to make the complementary case that even with the current architectures and training algorithms, deep learning can probably make more use of the available data, i.e. it can squeeze more juice out of available data. Why do I think so and how can deep learning achieve this? There are a couple of reasons that make me think so:

  1. Argument from cold posteriors: in Bayesian neural networks, it has been empirically observed that the best predictive performance is obtained not with the actual posterior, but with “cold posteriors”, which correspond to artificially manipulated posteriors that overcount the effect of the data and undercount the effect of the (usually generic) prior. Conversely, this suggests that current techniques in deep learning may be undercounting the potential of the data given that one has to resort to an artificial boosting of its effect in Bayesian neural networks.
  2. Argument from slow and brittle convergence to “natural” solutions: there is some interesting theoretical work suggesting that in some simplified problems, standard deep learning techniques will converge to what I would consider the “natural” solutions, but the convergence is painfully slow and brittle. Let me give two examples: Soudry et al. (2018) show that in logistic regression with separable data, gradient descent converges to the max margin solution (which can be considered as the natural solution for this type of problem), but convergence is extremely slow, i.e. something like O(\log \log t / \log t), and is brittle in the sense that it doesn’t hold for some popular adaptive gradient descent algorithms like Adam. Ji & Telgarsky (2019) show a similar result for logistic regression problems with non-separable data, but the convergence here is again extremely slow, i.e. the rate of convergence in direction to the max margin solution is again O(\log \log t / \log t). On the other hand, it is clear that the convergence to the max margin solution in these problems can be significantly sped up with simple data-dependent initialization schemes. In a similar vein, some prior works have suggested that important generalization properties of neural networks, such as their ability to generalize compositionally, is extremely sensitive to initialization, again implying that starting from a data-agnostic, generic initialization may not be optimal.
  3. Argument from empirical Bayes: how can deep learning make more use of the available data? A straightforward idea is the one I mentioned at the end of the last paragraph, i.e. using a data-dependent initialization scheme (I gave a simple example of this kind of scheme in a previous post). This approach is reminiscent of the empirical Bayes method in Bayesian statistics, which underlies a whole host of beautiful and surprising results like the Stein phenomenon. The basic idea in empirical Bayes is refusing to assume a non-informative, generic prior for the variables of interest (for a neural network, these could be the parameters, for instance), but estimating these priors from the data instead. You can see that this idea accords nicely with a data-dependent initialization scheme for neural network parameters. Empirical Bayes enjoys some appealing theoretical performance guarantees compared to common alternatives like maximum likelihood, which suggests that similar improvements may hold for data-dependent initialization schemes for neural networks as well.

In defense of pure learning

It is often claimed animals are more sample efficient, meaning that they can learn from many fewer examples (sometimes from single examples), than artificial neural networks and that one of the reasons for this is the useful innate inductive biases sculpted into the brains of animals through adaptations over evolutionary time scales. This claim is then usually followed by a prescriptive advice that we do something similar with artificial neural networks, namely, build in more innate structure into our models (or at least include an outer optimization loop in our models that would learn such useful inductive biases over longer time scales, as in meta-learning). Tony Zador and Gary Marcus made arguments of this sort recently.

In this post, I’d like to take issue with arguments of this sort. Actually, my objection to these kinds of arguments is already well-known. Zador has a section in his paper addressing precisely this objection (the section is titled “supervised learning or supervised evolution?”). So, what is the objection? The objection is that arguments of this sort conflate biology and simulation. They assume that the learning that happens in an artificial neural network is comparable to the learning that happens in a biological system over its individual lifespan. But there’s no good reason to think of artificial learning in this way. We should rather think of it as a combination of the learning that happens over an individual lifespan and the adaptations that take place over evolutionary time scales. When we think of artificial learning in this light, the sample efficiency argument in favor of animals falls by the wayside, because biological evolution has been running the most intense optimization algorithm in the biggest and the most detailed simulation environment ever (called “the real world”) for billions of years (so much for “one-shot” learning).

As I said, Zador is aware of this objection, so what is his response to it? As far as I can tell, he doesn’t really have a very convincing response. He correctly points out the differences between biological optimization and learning in artificial networks, but this doesn’t mean that they can’t generate functionally equivalent networks.

For example, Zador notes that biological optimization runs two nested optimization loops, the inner loop characterizing the learning processes in individual lifespans, the outer loop characterizing the adaptations over evolutionary time scales. This is similar to a learning paradigm called meta-learning in machine learning. And because of its similarity to biology, Zador is very much sympathetic to meta-learning. But in my mind the jury is still out on whether meta-learning has any significant advantages over other standard learning paradigms in machine learning. There are recent results suggesting that in practical problems one doesn’t really need the two separate optimization loops in meta-learning (one loop is all you need!). Moreover, if one trains one’s model in a sufficiently diverse range of problems (but crucially using a standard learning paradigm, such as supervised learning or reinforcement learning), “meta-learning” like effects emerge automatically without any need for two separate optimization loops (see also this beautiful new theory paper explaining some of these experimental results).

The core problem here, I think, is again conflating biology and simulation. Just because we see something in biology doesn’t mean we should emulate it blindly. Biology is constrained in many ways simulation isn’t (and vice versa). Of course it makes sense to use two separate optimization loops in biology, because individual lifespans are limited, but this isn’t true in simulation. We can run our models arbitrarily long on arbitrarily many tasks in simulation.

I think this (i.e. the mismatch between biology and simulation) is also why naive ways of emulating the brain’s innate inductive biases, like trying to directly replicate the concept of “cell types” in the brain is usually not very effective in artificial neural networks, because in my opinion these features are essentially consequences of the brain’s suboptimal learning algorithms (over developmental time scales), which means that it has to off-load a significant chunk of the optimization burden to evolution, which needs to craft these intricate cell types to compensate for the suboptimality of learning (over developmental time scales). Learning in artificial neural networks, on the other, is much more powerful, it is not constrained by all the things that biological learning is constrained by (for example, locality and limited individual lifespans), so it doesn’t really need to resort to these kinds of tricks (like different innate cell types) to easily learn something functionally equivalent over the individual lifespan.

Does the brain have to do deep credit assignment?

In other words, does the brain have to do something like backprop? For those who are already familiar with this literature, my short answer is that no, the brain doesn’t have to do, and it probably doesn’t do, deep credit assignment. In this post, I’d like to discuss two reasons that make me think so. I should note from the outset that these are not really “knock-out” arguments, but more like plausibility arguments.

I have to first clarify what exactly I mean by “deep credit assignment” or “something like backprop”. This is still not going to be very exact, but by this I basically mean a global credit assignment scheme that propagates precise “credit signals” from elsewhere in a deep network in order to compute a local credit signal. I thus include any gradient-based method (first-, second-, or higher-order) in this category, as well as imperfect, heuristic versions of it such as feedback alignment or weight mirrors. There are some borderline cases such as decoupled neural interfaces that compute credits locally, but also learn gradient estimates over longer timescales. I’m inclined to include these in the “deep credit assignment” category as well, but I would have to think a bit more carefully about it before doing so confidently.

Now moving on to the two reasons that make me think the brain probably doesn’t do “deep credit assignment”. The first reason is this. I think it is very natural to think that the brain should be doing something like backprop, because it is what works today! Deep neural networks trained with gradient descent have been enormously successful in a variety of important and challenging tasks, like object recognition, object detection, speech recognition, machine translation etc. But it is very important to also remember that these successes depend on the current hardware technology. The methods that work well today are methods that work well on current hardware. But the current hardware technology is constrained in a variety of ways that brains aren’t.

Take the example of memory. Fast memory (for example, on-chip memory) is extremely limited in size in the current chip technology (although this may be beginning to change with novel chip designs such as Graphcore’s IPU and Cerebras’s WSE). But, there is no reason to think that the brain is limited in the same way, because the brain has a completely different computational architecture!

What is the significance of this observation? Well, even today we know that there are promising alternatives to backprop training in deep nets that are currently intractable precisely because of memory constraints: I’m thinking, in particular, of the deep nets as Gaussian processes (GPs) perspective. Amazingly, this method doesn’t require any training in the usual sense. Inference is accomplished through a single forward pass just like in backprop-trained nets. The catch is that, unlike in backprop-trained nets, this forward pass doesn’t scale well with the data size: it requires manipulating huge matrices. To my knowledge, these methods currently remain completely intractable for, say, ImageNet scale data, where the said matrices become terabyte-sized (there’s this recent paper that carries out exact GP computations on a million data points, but they use low-dimensional datasets and they don’t use deep-net kernels in their experiments; exact GPs on high-dimensional data using deep-net kernels remain highly intractable, to the best of my knowledge).

As a side note here, I would like to express my gut feeling that there may be more tractable versions of this idea (i.e. training-free or almost training-free deep nets) that are not being explored thoroughly enough by the community. One simple idea that I have been thinking about recently is the following. Suppose we take the architecture of your favorite deep neural network and set the parameters of this network layer by layer using only information inherent in the training data, say, a large image dataset. This would work as follows. Suppose the network has k filters in its first (convolutional) layer. Then we can either crop k random patches of the appropriate size from the training images or maybe do something more intelligent, like exhaustively cropping all non-overlapping patches of the appropriate size and then doing something like k-means clustering to reduce the number of crops to k and then setting those to be the first layer filters. This then fixes the first layer parameters (assuming the biases are zero). We can iterate this process layer by layer, at each layer computing a number of clusters over the activations of the previous layer across the entire training data and then basically “pasting” those clusters to the layer weights. Note that in this scheme even though there is learning (after all we are using the training data), it is minimal and non-parametric (we’re doing only k-means, for example), and nothing like a gradient-based learning scheme. I think it would be interesting to find out how well one can do with a scheme like this that uses minimal learning and utilizes almost exclusively the prior information inherent in the network architecture and the training data instead.

So, this was the first reason why I think the brain probably doesn’t do something like backprop, to wit: backprop seems to me too closely wedded to the current hardware technology. My hunch is that there are many more interesting, novel (and probably more biologically plausible) ways of building intelligent systems that don’t require anything like backprop, but we’re currently not exploring or considering these because they remain intractable with current hardware (large scale GPs being one concrete example).

The second reason is that even with the current hardware we have some recent hints that purely local learning schemes that don’t require any deep credit assignment can rival the performance of backprop training in realistic tasks (and if the brain doesn’t have to do something, the chances are it’s not going to do it!). I’d like to mention two recent papers, in particular: Greedy layerwise learning can scale to ImageNet by Belilovsky et al. and Training neural networks with local error signals by Nokland and Eidnes. These papers both introduce completely local, layerwise training schemes and show that they can work as well as end-to-end backprop in standard image recognition problems.

Although these results are impressive, in a way I consider these as still initial efforts. I feel pretty confident that if more collective effort is put into this field, even better local training schemes will be discovered. So, to me these results suggest that the problems we typically solve with end-to-end deep learning these days may not be hard enough to require the full force of end-to-end backprop. Furthermore, with each new clever local learning trick, we will discover these problems to be even easier than we had imagined previously, so in the end coming full circle from a time when we had considered computer vision to be an easy problem, to discovering that it is in fact hard, to discovering that it isn’t that hard after all!

Update (01/16/2020): I just found out that the idea I described in this post for building a gradient descent-free image recognition model using k-means clustering was explored in some early work by Adam Coates and colleagues with promising results (for example, see this paper and this). They use relatively small datasets and simple models in these papers (early days of deep learning!), so maybe it is time to reconsider this idea again and try to scale it up.