Severely Theoretical

Machine learning, computational neuroscience, cognitive science

Catastrophic forgetting is yet another pet problem rendered obsolete by scale

For a while now, much of the academic ML research has basically been a stubborn refusal to acknowledge the blindingly obvious, undeniable fact that scale renders most of the pet problems of this field obsolete: few-shot learning, compositionality, out-of-distribution generalization, “meta-learning”, disentanglement etc. I wrote about these issues in several earlier posts before (e.g. this, this, and this). These so-called problems are simply artifacts of the small scales and the toy settings researchers choose to study, so researchers should just stop worrying about these non-problems already (and wasting their and other people’s energy and money) now that it’s clear they will disappear in larger scales and more realistic settings. I was reminded of this once again after reading this beautiful anonymous ICLR submission that shows that catastrophic forgetting also belongs to this grisly junkyard. That’s right, catastrophic forgetting is not a real problem for large models trained on large, rich, realistic datasets. So, can people please stop writing pointless papers on this non-problem masquerading as a problem in meaningless toy settings now? Thank you.

How much “human-like” visual experience do current self-supervised learning algorithms need to achieve human-level object recognition?

I just posted a new paper to arxiv where I try to estimate the sample efficiency of the state-of-the-art self-supervised visual representation learning algorithms vis-a-vis humans in a complex, realistic visual object recognition task, namely ImageNet. I come up with an estimate that suggests that compared to humans these algorithms would need several orders of magnitude more “human-like”, natural video data in order to reach human-level performance in ImageNet. This is a very challenging estimation problem and my estimate thus comes with a lot of caveats (I discuss some of the main caveats in the paper), but it is the first serious, quantitative attempt to address this important question that I know of.

Ditching academic research in AI/ML

The news of the existence of at least one collusion ring in the AI/ML conference peer-review system has made some waves recently (here and here are two recent reddit threads on this topic). What would be the most meaningful response to this kind of explicit fraud in the system? In this post, I’d like to express some possibly unpopular and uncomfortable opinions (which is something I like to do in general apparently :)) and toy with some radical ideas/suggestions for improving the overall AI/ML research ecosystem.

First of all, it’s important to realize that people respond to incentives. Although, of course, pointing this out doesn’t absolve individual culpability, issues like this point to systemic problems that need to be addressed systemically. It is hard to imagine something like this happening, for instance, if conferences weren’t such a high-stake game in AI/ML research. So, we have to ask ourselves why the stakes are so high. Michael Littman’s article partially answers this question:

… stakes are high because acceptance rates are low (15%–25%), opportunities for publishing at any given conference are limited to once a year, and publications play a central role in building a researcher’s reputation and ultimate professional success. Academic positions are highly competitive, so each paper rejection—especially for graduate students—has a real impact on future job prospects. Some countries correlate promotion and salary decisions to the number of papers accepted at a specific set of high-profile conferences (and journals).

Why are academic positions highly competitive? It’s because there are too many candidates for too few positions. These too many candidates produce too too many papers, too many of which are, to put it bluntly, worthless. Even when these papers are technically sound, they don’t address any interesting or important problems, they propose simplistic ideas in the context of toy problems that obviously won’t pan out for any sufficiently interesting and important large-scale realistic problem. The sad truth is that even if these papers are accepted by a conference, they won’t be read by anybody, won’t provide any benefit for any practical use, and won’t even have any tangible impact whatsoever on the field in the long run. There’s no reason for anybody to waste their time on papers like these, other than the Machiavellian reasons touched upon by Littman (basically to signal to their potential employers that they are “productive” and to chase after power, prestige, and money). There’s no good reason for the public to fund this kind of unproductive research with taxpayers’ money.

It could be argued that this situation is inevitable: most ideas will lead to dead ends, only a very small number of ideas will win out in the long run through a process of natural selection of ideas. But, this is not true: yes, some ideas will, of course, not pan out in the long run, but the current quality/quantity combination for research outputs in AI/ML is clearly not ideal. In my opinion, an alternative research landscape more or less exclusively dominated by a small number of large industry labs like OpenAI, Google Brain, FAIR, etc. as opposed to a large number of small academic labs would clearly land us at a much more favorable position in the space of quality/quantity of research outputs, so the current situation is not inevitable.

This problem, by the way, isn’t specific to AI/ML research, it afflicts most of academia, but probably becomes especially acute when a field becomes “hot.” I sometimes genuinely wonder: at what point do academics in general admit that their field is basically artificially driven by government money and by irrational incentives and rent-seeking behavior? That there are just too many people employed in their field going after too many unproductive, obviously flawed ideas, or uninteresting, insignificant questions? Perhaps the answer is never, because as Upton Sinclair once observed, “it is difficult to get a man to understand something when his salary depends on his not understanding it.” Can academics really justify that they should get this money instead of a public school, or a public hospital, or a homeless shelter, for instance?

What is my proposal then? What would a more rational system look like? First of all, I suggest that there should be a lot fewer people working professionally in AI/ML research. In recent years, most of the interesting and impactful work in this field has come from large industry labs that have the resources to run large scale experiments, so perhaps they should employ the overwhelming majority of the people working professionally in the field. This would mean basically winding down most of the low-impact academic research in AI/ML. Also, in a more rational research landscape, a lot more collective effort/resources than now would be spent on improving hardware and collecting/curating data.

For the rest, I propose a system similar to the marketplace for music production/consumption. The barriers to entry into the field aren’t very high in AI/ML research. Fortunately, large industry players generally share their tools/models publicly. Obviously, they can always do a better job in this respect, for example by making their internal large scale datasets public, by making large scale compute more affordable, more readily accessible to amateur researchers. Motivated amateurs would then produce “content” using these tools and share it publicly: if you think you built something cool, you should just put it out there: write up what you did in a report, put it on arxiv, put your models and code on github in an easily accessible format for others to use and most importantly, make demos to get people excited. If you really did something cool, people will notice it, including prospective professional employers. This would then be the motivated, talented amateur’s ticket to a professional career in AI/ML research.

As this system would eliminate most academic research in the field, there wouldn’t be any need for conferences/journals (of course, conferences could still be organized to meet with people and discuss ideas in person, but they would be a much more informal affair, perhaps more like workshops today). Peer review would be carried out publicly in the marketplace of ideas. There would probably be much less output overall, and whatever output is produced would be more likely to be interesting and impactful, because it would be produced by people genuinely driven to create something interesting and useful to others.

A good yardstick that I like to think about in this connection is OpenAI. Wikipedia says they employ over 120 people. Now, I don’t know how many of those are involved in research, but let’s say ~100. It’s probably safe to say that these are some of the smartest, most talented people in the field. Yet, if we consider their research output quantitatively, it’s not that much. Every year, they put out only a handful of extremely high-impact, high-quality papers/products, like GPT-3, DALL-E, CLIP etc. If the very same set of people were employed in academia instead, they’d probably produce at least one or two orders of magnitude more papers between them, but these papers would be much much less impactful and lower in quality, again attesting to the irrational, unproductive incentive structure of academia.

I should make it clear that I’m not advocating winding down AI/ML education in academia, just research. In fact, education could be the main legitimate purpose of academia under this system. I should also make it clear that I’m not suggesting this system as a model for research in all fields. Some fields with higher technical barriers for research (for example, molecular biology) clearly produce very useful, practical knowledge and/or make meaningful contributions to our understanding of nature (although as I mentioned above, I think the same bad incentives are at play in most places in academia to some degree, so shrinking the size of academic research in general would perhaps not be such a bad idea).

I know at least two other fields quite intimately: cogsci/psychology and neuroscience. Now, I’m going to make an extremely incendiary claim and suggest that research in neither of these fields has produced anything of much value in our understanding of how the mind/brain works and so both deserve a significant shrinkage in size in academia as well. It’s not an exaggeration to say that I have personally learned a lot more about the nature of intelligence, cognition, perception and about how our brains might be doing all these things (supposedly the main subject matter of psychology/neuroscience) from the deep learning research that came out in the last 5-10 years than from decades of simplistic, unfruitful, and sometimes frankly straight up silly psychology/neuroscience research (I’d be extremely willing to debate this issue with anybody who has a different opinion about it). I humbly but sincerely suggest that as a first small step toward improving itself, psychology/neuroscience research can start by putting an indefinite moratorium on the mind-numbingly and soul-crushingly dull and uninteresting left-right random dot motion discrimination task and all its equally uninteresting and insignificant variants. Please do it!

Pinker on why humans are smarter than rats

I’ve been reading Steven Pinker’s The Blank Slate and was struck by a passage I wanted to share. Early in the book, Pinker takes up the question of what makes humans smarter than rats, a question originally posed by Rumelhart and McClelland in the famous PDP book. Rumelhart and McClelland’s answer is to point out: (1) humans have a much bigger cortex than rats and (2) humans and rats live in very different milieus, the human milieu being much more culture-laden than the rat milieu:

Pinker finds this answer, especially the first component (that the human cortex is basically a massively scaled-up version of the rat cortex), patently wrong and even ridiculous, so much so that he goes on to mock this idea several times in later chapters.

Now, I don’t know if this hypothesis (that the human cortex is, to a good approximation, just a scaled-up version of the rat cortex) is true or false. But, it doesn’t strike me as obviously false. Pinker is clearly underestimating the computational power of the sheer scaling-up of the model size here (even without a concomitant increase in data size and diversity or an increase in training time). The human cortex has roughly three orders of magnitude more neurons than the rat cortex. Assuming a similar level of connection sparsity between the two species, this would translate into a whopping six orders of magnitude difference in the number of synapses, or “parameters” (the assumption of similar connection sparsity levels in the human and rat cortices is probably unrealistic; I expect the actual scaling factor for the number of synapses to be somewhere between three and six orders of magnitude, but I couldn’t find a reliable estimate for this). If we learned one thing from recent results in machine learning, it is that increases in model size on this scale can lead to very large, qualitative changes in model behavior. Here’s an example from the GPT-3 paper:

Note that the x-axis in this figure covers a range that is roughly three orders of magnitude in size, hence it would likely be an underestimate of the analogous human vs. rat difference. Note also that in many individual tasks (the faint lines), the model goes through what appears to be a qualitative shift in performance as the model size is increased, with the smaller models performing at near zero accuracies, while the largest one performing at much higher accuracy.

A seemingly innocuous but actually striking prediction from this kind of model size scaling effect is that bigger models should be broadly better than smaller ones across a diverse range of tasks. The individual tasks in the figure above, for example, represent a broad range of text-based tasks, but this would be true for even more different tasks possibly involving other modalities. For example, if we were to plug in visual inputs to the models plotted above and trained them on some visual tasks, the larger models would still outperform the smaller ones.

As I have recently learned from Robert Plomin’s excellent book, Blueprint, this prediction in fact turns out to be true even when we just consider individual differences between humans (so no need to make cross-species comparisons), that is, people who are good at a particular cognitive-perceptual task often tend to be good at other seemingly completely unrelated perceptual-cognitive tasks as well, and these correlations are driven by what Plomin calls “generalist genes”, i.e. genes that have diffuse effects on a broad range of cognitive abilities.

This result is easy to explain if we assume that individual differences between the brain structure of different people relate to innate, but very generic properties, like the number of neurons or the number of connections etc., because as mentioned above a strong correlation between performance in a diverse range of tasks is exactly what you would expect under the scenario of variation in such generic properties like model size. The same result is, however, very hard to explain under a Pinkerite innate-specialized-modularist account of the human brain. I want to highlight a few relevant and important quotes from Plomin touching on this issue:

I think this example from Pinker is unfortunately not an isolated example. Psychologists often don’t have solid, reliable intuitions about the computational complexity of the perceptual and cognitive problems humans face and the importance of various factors such as model size and data size and diversity on performance in these problems. I would actually go so far as to suggest that the entire psychology literature is replete with cases where psychologists make unfounded and unjustified poverty of the stimulus claims based on their unreliable, incorrect intuitions about these computational questions. I hope to write more about this important issue some time in the near future.

AI research, wise passiveness, and negative capability

AI research needs more wise passiveness and negative capability. Wise passiveness is an idea introduced by William Wordsworth in his poem Expostulation and Reply. This poem appears as the first poem in his famous Lyrical Ballads. In the poem, Wordsworth advocates a quiet receptiveness, a passive, non-systematizing openness to the world:

The eye–it cannot choose but see;
We cannot bid the ear be still;
Our bodies feel, where’er they be,
Against or with our will.

Nor less I deem that there are Powers
Which of themselves our minds impress;
That we can feed this mind of ours
In a wise passiveness.

Think you, ‘mid all this mighty sum
Of things for ever speaking,
That nothing of itself will come,
But we must still be seeking?

Here is a longer, superb dissection of the whole poem. Wordsworth invites us to simply listen to the world as it unceasingly speaks to us; then, perhaps we wouldn’t even have to seek knowledge from extraneous, indirect sources like books and/or dead men, which can be interpreted as tradition or received wisdom more generally.

John Keats entertained a similar idea with his concept of negative capability: “… capable of being in uncertainties, mysteries, doubts, without any irritable reaching after fact and reason.” As Crichlow Goellnicht explains, this means a passive, receptive “acceptance of the world in all its diverse aspects, without having to analyze, rationalize, and categorize those aspects, without having to explain away every mystery and doubt, without having to fit everything into a neat, philosophical system.” The reason Keats called this negative capability is presumably because it involves being at peace with uncertainty, doubt, mystery, vagueness, murkiness, and ambiguity, all concepts with at least some degree of negative connotation.

Where am I going with all this? What does this have anything to do with AI or machine learning? Here’s the connection: I think there are whole subfields in AI and machine learning research centered around ideas or concepts that cease to make a whole lot of sense if we become more receptive (gently, passively receptive) to the irreducible richness and complexity of the world without trying to impose our own preconceived theories or ideas on it. I think ideas such as disentanglement, objects, part-whole hierarchies, compositionality, etc. all belong to this unfortunate genre. These are all an educated person’s folk theories about how the world works. The real world and our minds are invariably infinitely more complicated and interesting than can be adequately captured by folk theories like these.

I’d like to end this short post by recommending a few other readings that have argued for a similar non-reductionist view of the world and the mind that embraces their full richness and complexity:

The Bitter Lesson by Rich Sutton (of course :))

Reality has a surprising amount of detail by John Salvatier (h/t Eric Jang)

Science and Engineering for Learning Robots by Eric Jang

On Chomsky and the Two Cultures of Statistical Learning by Peter Norvig

The Unreasonable Effectiveness of Data by Halevy, Norvig, and Pereira

On the Origin of Objects by Brian Cantwell Smith (please be warned that this book may be a bit too philosophical, too “lyrical” 🙂 for a working scientist)

Is compositionality/systematic generalization really a problem for neural networks?

In my last post, I discussed two issues that are widely considered to be serious problems for deep learning models: generalization and few-shot learning (more specifically, meta-learning as a proposal for performing few-shot learning). I argued that these are only problems when we consider small models trained with very limited amounts of data. In this post, I’d like to give one more example of this kind of thing: compositionality or systematic generalization. I’ll again argue that this is only a problem when we consider small toy domains without a lot of structure. It’ll mostly cease to be a problem when we start thinking about the much richer structure of the world we live in, and of our bodies and minds (including our language) that inherit this richness.

There are by now probably more than a dozen benchmarks that evaluate slightly different notions of compositionality or systematic generalization: e.g., SCAN, gSCAN, CURI, COGS, PCFG SET, BabyAI, CLOSURE, SQOOP etc. to name just a few that I’m most familiar with. A common feature shared by most of these benchmarks is that they take place in simple, toy domains without a lot of “affordances”, which necessarily restricts the abundance and richness of the linguistic and semantic/conceptual structures that can be created in them. Some of these benchmarks use natural language or something close to it (e.g., COGS, CFQ), so they don’t necessarily suffer from this particular shortcoming, although they may have other potential weaknesses, like not having a large enough training set or the target task involving a somewhat arbitrary and artificial semantic form (but this is a separate discussion).

For example, a common evaluation condition in these benchmarks is to generalize from just a handful of combinations like x_1 \circ y and x_2 \circ y (e.g., eat furiously and read furiously) to a novel combination x_3 \circ y (e.g., sleep furiously), where x_3 is assumed to be learned from other contexts and x_1, x_2, x_3 are usually the only items of their kind in the domain (e.g., actions). But why do we even expect something like this to work? The world we live in, the world inside our minds (our conceptual world), and our language are nothing like this barren landscape.

When we infer the meaning of a novel combination like sleep furiously, we don’t just have two other actions, eat and read, to rely on. Instead, we have an immensely rich, interconnected web of concepts that we bring to bear on this task. An average English speaker knows tens of thousands of words and our conceptual world is presumably much richer than this number would indicate, because there are no single words for many of our concepts and some of our concepts are altogether difficult to precisely articulate in language. But more than its sheer size, what gives this conceptual web its true richness and power is its highly interconnected and structured nature. For example, among the dizzying, almost stupefying range of things we know about sleeping is the fact that it can sometimes involve restless states, wild movements, hellish nightmares, intense dreams, loud snoring etc., which are all associated with the concept of fury, or the state of being furious, through various more or less circuitous conceptual routes, so we could easily imagine what it would be like to sleep furiously by tracing these routes, even if we heard this particular combination for the first time.

And when applied at scale, neural networks are in fact remarkably good at capturing and utilizing these kinds of associations to make sense of novel combinations. Recent large scale deep learning models like DALL-E and GPT-3 are very good demonstrations of this in my view. Look at the remarkable agility and accuracy with which DALL-E seems to make sense of novel combinations like “a store front that has the word ‘openai’ written on it” (we know that this is a novel combination, because it doesn’t exist in the real world):

Or consider this utterly mind-blowing demonstration of the compositional skills of GPT-3 (source):

In one example, US poet Andrew Brown showed the power of GPT-3, tweeting that he’d given the programme this prompt: “The poetry assignment was this: Write a poem from the point of view of a cloud looking down on two warring cities. The clever student poet turned in the following rhyming poem:”

GPT-3 responded:

“I think I’ll start to rain,

Because I don’t think I can stand the pain,

Of seeing you two,

Fighting like you do.”

And even in simpler, toy domains, which common compositionality benchmarks often focus on, there’s some recent evidence suggesting that simply scaling up the size and diversity of these domains can solve many of the splits in these benchmarks that may seem superficially challenging in smaller scale versions (e.g., Kagitha, 2020; Hill et al., 2020).

It could be argued that these models require too much data to achieve these compositional skills, hence they’re not nearly as sample efficient as humans, for instance. Therefore, the argument goes, the main goal of this field should be to come up with useful inductive biases that would improve the sample efficiency of the models in acquiring these compositional generalization abilities. But, these kinds of comparisons with humans are a bit misleading in my mind because of the radically different nature of the inputs that humans receive (e.g., multimodal, embodied, and embedded in a much richer world). Perhaps, the seemingly greater demand for data these models require is simply an illusion created by the fundamentally different nature of the inputs.

On the futility of trying to be clever (the bitter lesson redux)

The bitter lesson of history in AI is that “general methods that leverage computation are ultimately the most effective, and by a large margin.” There are various manifestations of our unfortunate unwillingness to learn this bitter lesson. Sutton focuses on one in his essay: trying to leverage human knowledge, trying to build in “how we think we think”, which “does not work in the long run”, because “the actual contents of minds are tremendously, irredeemably complex.” There are others: trying to come up with clever algorithmic ideas and hacks to eke out a small advantage in a narrow domain and in the short run. This describes the overwhelming majority of current research in machine learning and AI (including some of my own). It is an irresistible temptation with strong incentives behind it, but it is ultimately misguided and is not what leads to long-term progress and meaningful impact. In this post, I’ll give two recent examples from deep learning: domain generalization and meta-learning.

Generalization is often considered to be one of the biggest problems for deep learning. You have some data. You have a model. You train the model on the data. Fine. Then, you get some new data that’s different from the training/test data you used before but you feel that it’s similar to the previous data in some fundamental respect and that the model should be able to handle it (just to be concrete here, let’s say you trained your model on natural images and want it to generalize to drawings or paintings of the same kinds of things), because look, we humans don’t have any problem making these kinds of seemingly non-trivial generalizations! So, you try your trained model on the new data and it fails miserably. That’s, of course, disappointing. Then, researchers spend an inordinate amount of effort trying to come up with ever cleverer algorithmic or architectural schemes to make models generalize a tiny bit better to novel data/domains given the same fixed (and crucially often relatively small) training data. But, what if this whole enterprise is misguided? Why are we assuming that our training data is fixed and small? And what if there’s simply no clever algorithmic or architectural shortcut to training our models on very large, diverse datasets (if we want to have models that can generalize well)? There’s certainly strong prima facie evidence that this may well be the case.

Take invariant risk minimization (IRM), one of the more popular domain generalization methods proposed recently. IRM considers a classification problem that takes place in multiple domains or environments, e_1, e_2, …, e_E (in an image classification setting, these could be natural images, drawings, paintings, computer-rendered images etc.). We decompose the learning problem into learning a feature backbone \Phi (a featurizer), and a linear readout \beta on top of it. Intuitively, in our classifier, we only want to make use of features that are invariant across different environments (for instance, the shapes of objects in our image classification example), and not features that vary from environment to environment (for example, the local textures of objects). This is because the invariant features are more likely to generalize to a new environment. We could, of course, do the old, boring empirical risk minimization (ERM), your grandmother’s dumb method. This would simply lump the training data from all environments into one single giant training set and minimize the loss on that, with the hope that whatever features are more or less invariant across the environments will automatically emerge out of this optimization. Mathematically, ERM in this setting corresponds to solving the following well-known optimization problem (assuming the same amount of training data from each domain):

\min_{\Phi, \hat{\beta}} \frac{1}{E} \sum_e \mathfrak{R}^e(\Phi, \hat{\beta}), where \mathfrak{R}^e is the empirical risk in environment e.

IRM proposes something much more complicated instead: why don’t we learn a featurizer with the same optimal linear readout on top of it in every environment? The hope is that in this way, the extractor will only learn the invariant features, because the non-invariant features will change from environment to environment and can’t be decoded optimally using the same fixed readout. The IRM objective thus involves a difficult bi-level optimization problem:

\min_{\Phi, \hat{\beta}} \frac{1}{E} \sum_e \mathfrak{R}^e(\Phi, \hat{\beta}) s.t. \hat{\beta} \in \arg \min_{\beta}\mathfrak{R}^e(\Phi, \beta) for all environments e.

In my view, it should always ring an alarm bell in your mind if your proposed method involves solving a gnarly optimization problem, because it suggest that it may not be a general, scalable method. But is it at least effective at extracting those invariant features? Or does it at least work better than your grandmother’s dumb ERM in this respect? It turns out the answer is a decisive no! IRM fails utterly and completely in this respect. In a recent ICLR paper, Rosenfeld et al. show that in the linear case, IRM will fail to extract the invariant features except in some unrealistic settings where basically anything will work, and in the non-linear case, it won’t work any better than ERM in finding the invariant classifier (please see the paper for a more precise statement of the results).

IRM assumes the existence of a featurizer \Phi where the expectation \langle Y|\Phi(X) \rangle is invariant across environments. Inspired by IRM, even stronger constraints have been imposed in the literature, for example, demanding that the whole distribution p(Y|\Phi(X)) be invariant instead. Rosenfeld et al. show that these methods will also fail to work any better than ERM for similar reasons.

Another ICLR paper this year by Gulrajani and Lopez-Paz (incidentally, two of the co-authors of the original IRM paper) reaches the same conclusion through a series of carefully conducted experiments: when compared head-to-head, no fancy, bespoke, boutique domain generalization algorithm (and they have now evaluated more than a dozen algorithms) significantly outperforms ERM. This paper also emphasizes the importance of specifying a model selection method as an integral component of domain generalization algorithms.

Of course, these results don’t prove that it is impossible to beat ERM in domain generalization (I would be eternally grateful to anybody who proves a result like this), but they do suggest to me that ERM is a very simple, general, effective method that will be hard to beat by a significant margin. So, I think it is prudent for researchers to keep this in mind when deciding how to spend their research efforts most productively.

The second example I’d like to give is meta-learning, another hot topic in machine learning replete with clever ideas. First, a word of caution: people unfortunately use the term “meta-learning” in quite different senses in machine learning. Sometimes it’s used to refer to a multi-loop optimization process (as in MAML) and sometimes it should really just be called “multi-task learning” (or how about simply “learning”), but “meta-learning” (or worse still “learning to learn”) is used presumably because it sounds more sophisticated and impressive. I just want to make it abundantly clear that here I’ll be talking about meta-learning in the first sense only, i.e. multi-loop optimization. This approach is often used for few-shot learning (another supposed shortcoming of deep learning models, which is again really just a shortcoming of small models trained with too little data), because it can directly target few-shot learning performance through inner loop optimization. The idea is that the outer loop optimizes the inner loop which directly corresponds to fast adaptation or few-shot learning performance when the inner loop is run for a small number of steps. But two recent papers, first by Raghu*, Raghu* et al. and second by Tian*, Wang* et al., show that in practice the inner loop run doesn’t really do much in these algorithms, so much so that one can safely do away with the inner loop entirely. This means that the success of these algorithms can be explained completely by standard (single-loop) learning on the entire lumped meta-training dataset. Another recent beautiful theory paper by Du et al. sheds some light on these experimental results.

Perhaps, at this point you feel that this post paints a very pessimistic (nihilistic even) picture of the machine learning/AI research landscape. Coming up with new, clever, creative algorithmic ideas is the bread and butter of computer scientists. If that’s a mostly pointless exercise, what is there left to do? First, I’d argue that there’s probably a significant difference between computer science in general and machine learning in this respect: while it is true that algorithmic innovation is central in computer science in general, by its very nature, it is supposed to be less important (although of course not totally pointless) in machine learning, because we’re off-loading a significant chunk of the burden to the machine itself! Second, a research topic is, of course, a deeply personal choice. Who am I to say what one should or should not work on? Who would even listen to me? But I do think that there are many interesting research directions consistent with the philosophy of the bitter lesson that can have more meaningful, longer-term impact than small algorithmic or architectural tweaks. I just want to wrap up this post by giving a couple of examples below:

(1) Probing the limits of model capabilities as a function of training data size: can we get to something close to human-level machine translation by levering everything multi-lingual on the web (I think we’ve learned that the answer to this is basically yes)? Can we get to something similar to human-level language understanding by scaling up the GPT-3 approach a couple of orders of magnitude (I think the answer to this is probably we don’t know yet)? Of course, smaller scale, less ambitious versions of these questions are also incredibly interesting and important.

(2) Finding out what we can learn from different kinds of data and how what we learn differs as a function of this: e.g. learning from raw video data vs. learning from multi-modal data received by an embodied agent interacting with the world; learning from pure text vs. learning from text + images or text + video.

(3) Coming up with new model architectures or training methods that can leverage data and compute more efficiently, e.g. more efficient transformers, residual networks, batch normalization, self-supervised learning algorithms that can scale to large (ideally unlimited) data (e.g. likelihood-based generative pre-training, contrastive learning).

An optimistic perspective on the human-AI nexus

For we know in part, and we prophesy in part.

But when that which is perfect is come, then that which is in part shall be done away.

When I was a child, I spake as a child, I understood as a child, I thought as a child: but when I became a man, I put away childish things.

For now we see through a glass, darkly; but then face to face: now I know in part; but then shall I know even as also I am known.

– 1 Corinthians 13:9-12

In the intelligence explosion scenarios, recursive self-improvement by an AI initially created by humans creates ever more intelligent progeny, making humans (and much else) “redundant” in short order by their absurdly, overwhelmingly superior intelligence. I don’t have a very definite view on how plausible or likely these scenarios are. It’s very likely that we simply don’t know enough about the nature of intelligence itself to even judge with any degree of reliability how likely these scenarios are (here is a humorous take I like that emphasizes this point); for those interested in these scenarios, David Chalmers does a good job of dissecting the argument for an intelligence explosion here.

In these scenarios, it is always assumed that it is inevitable that humans will be made redundant at some point, at least partly because of some hard constraints on our intelligence (usually something to do with our sloppy, slushy, and more or less fixed hardware, the brain). Chalmers puts it thus (p. 13): “Insofar as enhanced brains always rely on a biological core, however, there may be limitations. There are likely to be speed limitations on biological processing, and there may well be cognitive limitations imposed by brain architecture in addition. So beyond a certain point, we might expect non-brain-based systems to be faster and more intelligent than brain-based systems.”

In this post, I’d like to argue to the contrary that: just like we don’t know enough about the nature of intelligence itself to say anything useful about the possibility or the likelihood of a superintelligent AI, we also don’t know enough about the limits of our own human intelligence, especially human intelligence extended and enhanced by the non-superintelligent AI we’re creating, to claim with any degree of certainty that it will inevitably be superseded by a superintelligent AI.

The main point is that although, of course, what Chalmers says about the hardware limitations of biological processing is correct, intelligence is not just a function of hardware, but also of how that hardware is used, i.e. the software that runs on that hardware. And we humans have shown a remarkable degree of agility and adaptability in making use of our sloppy, slushy hardware since our inception as a species.

Think about this: biologically, we’re essentially the same species as our ancestors who lived some 100K years ago on this planet. In terms of material and intellectual culture, we were a much more primitive species back then. It is almost certain that we didn’t even have something that we use to define ourselves as a species today, namely a full-blown language. It is very likely that whatever language these ancestors of ours had back then was extremely primitive (something of the me Tarzan, you Jane variety). And now look how far we’ve come in 100K years! We’re now a species capable of probing the depths of the universe both at the smallest scales and at the largest scales. All with the same hardware! Even from one generation to the next, we’ve been getting more and more intelligent lately: consider the Flynn effect or consider reading a paper in your field written a few generations ago by a giant of the field at the time and see how naïve it’ll sound to you (I had this epiphany recently after reading Alan Turing’s classic 1950 paper on computers and intelligence).

I like to think of this as an algorithmic improvement process: we find ever more efficient ways of using our limited hardware by our constant cultural and technological innovations, discoveries, and we simply don’t know the limits of this process, i.e. how far and how fast we can follow this cultural-technological-“algorithmic” route before we hit a true hardware “wall”.

I see the non-superintelligent AI we’re creating today as part of this cultural-technological-“algorithmic” route too. They’re the microscopes and telescopes of our age, only much more general purpose, hence much more powerful. Like the microscopes and telescopes of an earlier age, they allow us to see whole new worlds we wouldn’t have been able to see unaided.

Look at this picture:

“Adversarial examples are not bugs, they are features” (link)

Would you have guessed that there’s actually a frog in this picture? Would you have guessed that you could recognize frogs using weird features like this (probably much better than humans could)? Knowing this opens up a wonderful whole new world for us, full of patterns we hadn’t even suspected were there before. We could now probe this wonderful new world with our “microscopes” and perhaps one day we could even use it to our advantage for some practical purpose.

Or consider how expert chess players describe AlphaZero’s capabilities: “Chess is full of superhuman expert systems, yet AlphaZero discovered an uncharted space in which its self-taught insights were both startling and valuable. That uncharted space was so significant that AlphaZero was able to convincingly defeat the strongest expert system at the time of testing. Bearing that in mind, you can’t help but to be positive for the application of AlphaZero-like techniques in environments that are less well-researched than chess. Maybe soon, scientists will be echoing our cry during the World Championship: “AlphaZero, find us a path!” Although not every detail of AlphaZero’s decisions will be transparent to human players, we can still glean useful high-level insights (and sometimes even lower-level, more detailed insights) from its playing style that can help improve human players. It was, for example, notable that AlphaZero seemed to place much less value on material than a human player would, preferring activity or dynamism over material instead.

Just last week, a paper came out in Nature showing that an AI system improved the yield from certain chemical reactions over expert human chemists by trying out less mainstream, more adventurous reagents than human chemists who, by comparison, had a more conservative bias in choosing reagents.

These are just a few simple examples among countless others of human-built AI systems opening up whole new ways of seeing and thinking for us, helping us understand our weaknesses better, and offering possible ways of improving ourselves. Undoubtedly, there will be many more (and more significant) such examples in the coming years. I’m personally particularly interested in the possibility of harnessing the help of AI in improving the design of our social, political, and economic institutions (e.g. this). These institutions are susceptible to our collective human weaknesses and also likely constitute the most significant bottleneck in our continued self-improvement as a species on this planet. In this way, I hope we will be able to continue to make ever more efficient use of our fixed, limited, seemingly meager, sloppy, slushy hardware for a long while more.

Thoughts on Image-GPT

The following are some short notes on OpenAI’s Image-GPT paper, which is in my opinion one of the most important papers that came out in recent years.

The motivating question behind this paper is this: can likelihood-based generative pre-training lead to strong transfer learning results in computer vision? This question is inspired by the success of the same technique in NLP (where it’s commonly known as language modeling). In computer vision on the other hand, successful transfer has so far been achieved mostly through other (non-generative) pre-training objectives, like supervised pre-training (on ImageNet etc.), or more recently self-supervised pre-training (MoCo, SimCLR, BYOL, etc.). This raises the interesting question of whether there might be some fundamental differences between language and vision tasks that make these different methods more appropriate for these two respective domains. The Image-GPT paper answers this question in the negative and shows for the first time that likelihood-based generative pre-training can also lead to very strong transfer learning results provided that we use the right kind of architecture (GPT) at the right scale.

The other main interesting result from this paper is that the very same GPT architecture is shown to perform well in both language prediction and image prediction tasks, suggesting that these (and similar) tasks share a deep common computational core (something very general like: predict given as much context as possible) despite their many superficial differences and can be solved effectively by the same kind of computational architecture. I think this observation has important implications for the brain and evolution. To me, one of the things it suggests is that whatever inductive biases our brains (more specifically, our neocortex) may have, they’re probably not very domain-specific biases like many psychologists seem to believe (e.g. specific biases about visual objects, agents, or language). Rather, it’s likely that they’re much more generic (and less easily conceptualizable/articulable) biases, having to do with better information processing in some very general sense, like being able to integrate efficiently over a much larger context, or being able to do better deep credit assignment (i.e. dealing with vanishing/exploding gradients) etc. It is important to emphasize here that the GPT architecture itself embodies absolutely no language-specific or vision-specific inductive biases.

This idea also accords well with the sources of recent progress in machine learning. When I look at what drives significant architectural progress in machine learning today, most of the time it’s somebody proposing a solution to a very generic information processing problem: e.g. in ResNets, solving an optimization problem (vanishing/exploding gradients); in transformers, getting rid of the serial processing bottleneck of RNNs to make it feasible to integrate over a much longer context; in batch normalization, dealing with the covariate shift problem during training etc. Certainly, biological evolution doesn’t have to respect the same rules as human innovation, but at least to me this suggests that there’s maybe more bang for the buck in targeting these general information processing related problems than targeting more domain specific issues, which makes it more plausible that evolution may also be primarily targeting the same general issues.

One final interesting result in the Image-GPT paper is that even for the same validation loss in the generative pre-training task (i.e. ImageNet modeling), bigger models seem to show better transfer learning performance (Figure 3 in the paper). This is interesting in light of my criticism of the GPT-3 paper, where different sized models were not given the same amount of compute and it seemed likely that the smaller models would reach the same (or maybe even better) validation loss as the largest 175B-parameter model if they were given the same amount of compute. The results in the Image-GPT paper suggest that even in that case, the larger models might have had an advantage in terms of transfer performance, but it would have been much better if, just like the Image-GPT paper, the GPT-3 paper had actually carried out this important experiment to see if the larger models have a transfer advantage above and beyond what can be accounted for by validation loss (or compute) alone.

I would have liked to see more analysis of the learned representations in this paper and a more detailed comparison between the visual representations learned in this likelihood-based generative way vs. those learned in discriminative settings (e.g. in contrastive self-supervised learning). One interesting hypothesis is that the representations learned with likelihood-based generative objectives can handle out-of-distribution (ood) stimuli better (e.g. adversarial examples). Intuitively, this could be because likelihood-based objectives require all aspects of the data to be explained and hence reduce the possibility of taking “shortcuts”, which seems to be a common problem with discriminative objectives. Consistent with this idea, there’s some prior work suggesting that likelihood-based generative models can improve the adversarial robustness of deep neural networks.

Thoughts on GPT-3

A couple of months ago, OpenAI released a paper describing their latest language model, GPT-3. GPT-3 is distinguished from its predecessors by nothing other than its sheer scale: compared to its previous incarnations, it’s just a bigger language model trained with a bigger dataset (~1-2 orders of magnitude bigger in both model size and training data size). So, the paper is essentially an exercise in scaling. The main novel result in the paper is an impressive demonstration of the (in-context) few-shot learning abilities of such large-scale language models (it can be argued that even this main result is not entirely novel, as it was foreshadowed in some earlier language modeling work, e.g. see this and this). The paper reminded me, once again, of Philip Anderson’s famous More Is Different paper, where Anderson argues that quantitative changes in nature can sometimes lead to qualitative changes and that people (even scientists) don’t always appreciate the consequences of this fact enough. It was also inspiring for me to see all the amazing demos people have quickly built with GPT-3 and shared with the world (here is a nice collection of such demos as a Twitter thread).

In this post, I’d like to briefly discuss a few criticisms I had of the GPT-3 paper.

Umm, yeah, did we really need that 175B-parameter model?

The first one is about the actual need for scale: i.e. whether they really needed to train a giant 175B-parameter model or not. Figure 3.1 in the paper (reproduced above) clearly shows that many of their smaller models were not trained to saturation; this figure also shows that the smaller models are actually more compute-efficient up to the total compute used for those smaller models. To me, this strongly suggests that they actually didn’t have to train a 175B-parameter model, a ~5B-parameter model would probably have performed just as well (if not better) if trained longer. This point was also noted by Graham Neubig on Twitter.

This renders all the figures in the paper showing model size on the x-axis and performance on the y-axis (which is most of the figures in the paper) a bit suspect in my mind, because the smaller models were not given the same amount of compute in those figures.

So why did they train a 175B-parameter model then? One possibility is just because they could; they perhaps wanted to prepare this kind of infrastructure for projects down the line that actually do require models at this scale. A more sinister interpretation is that they want to commercialize this product at some point (this would be consistent with their CEO’s expressly stated objective of “capturing the light cone of all future value in the universe”) and a giant model is more “controllable” for this purpose: a client can easily put a 5B-parameter model on a few GPUs of their own to do inference and fine-tuning as they wish, but they can’t do this with a 175B-parameter model, making them more reliant on OpenAI’s specialized hardware.

A second difficulty with the paper for me was my constant struggle to understand to what extent the model was doing abstraction (or generalization) vs. rote memorization. In other words, to what extent the impressive looking results from the model can be attributed to the sheer size of the training data vs. the abstraction capacity of the model. To understand this better, it would have been extremely useful if, for example, at least for a subset of the tasks and examples, the authors showed the embedding space nearest neighbors to a given query among the training data, but surprisingly they never do this in the paper (I don’t suppose this would be technically more challenging than running a search over the input space, which they do multiple times in the paper). If these nearest neighbors are intuitively highly similar to the query and the model’s outputs more or less resemble the actual continuations of these nearest neighbors (say, with simple substitutions), that would favor a dataset size based explanation for the performance of the model. They do try to rule out the rote memorization based explanation in some of their experiments, but these were not entirely convincing for me. For example, in the arithmetic tasks, they look for patterns of the form “<NUM1> + <NUM2> =” and “<NUM1> plus <NUM2>” in their training data to investigate if the model is just memorizing these arithmetic equations. They find only a small number of matches, concluding that a rote memorization strategy seems unlikely. But the problem here is that these are just two of the almost endless ways the same arithmetic equations could be encoded in the training data (note that their training data includes a snapshot of the entire world wide web, which is a really really big place!): e.g. “<NUM1> <NUM2>”, “<NUM1> & <NUM2>”, “<NUM1> | <NUM2>”, “<NUM1> p <NUM2>”, “<NUM1> pl. <NUM2>”, “<NUM1> || <NUM2>”, etc. Here, again, it would have been much more meaningful if they showed us some nearest neighbor retrievals instead.

So, where do we go from here? Is training ever bigger language models on ever bigger training data the way forward for an ever more general kind of intelligence? I don’t think so. One immediate difficulty is that unlike compute, it is hard to imagine how the training data can be increased another couple of orders of magnitude. As mentioned above, their training data already includes a snapshot of the entire web (and then some). Perhaps more book datasets can be added to the training data or some improvements can be made in data quality through better cleaning up of the web data (which is, in itself, a significant challenge), but I just don’t see how these can be combined into a few orders of magnitude increase in the effective data size.

In my view, a much more promising route would be to try to add some sort of grounding to these language models, e.g. through pictures or videos from the web. I think grounding is crucial for models to have a better understanding of the world; and anecdotal evidence from human experience suggests to me that these models perhaps wouldn’t need nearly as much grounding experience as they need text data to achieve a reasonably good grounded understanding of the world. This is because it seems to me that we humans acquire most of our grounding early in our development through interactions with a fairly limited environment, and acquire pretty much all the rest of our knowledge only indirectly, through social and cultural means, for example, by learning things from other people, or by reading about them in books, articles, web pages etc. (Anthropologist Joe Henrich makes a similar point in his book The Secret of Our Success). Current language models already seem to be highly efficient at extracting information from extremely large scale text data. To complement this already super-human ability, finding good grounding objectives and grounding data for training large-scale grounded language models would be a very promising and exciting direction, I think (see this, this, and this for some recent attempts in this direction).

Update (09/04/2020): I apparently missed this earlier, but OpenAI has made its intention to make GPT-3 a commercial product very clear right from the beginning (see here). They even mention the large size of the model as an excuse not to release it:

… many of the models underlying the API are very large, taking a lot of expertise to develop and deploy and making them very expensive to run.

So, it seems like my sinister interpretation above for OpenAI training a much larger model than was actually warranted was not too much off the mark!

Deep learning can make more use of available data

This is just a short post on something I’ve been thinking about lately. The argument is often made that deep learning needs stronger, better priors, usually in the form of architectural improvements. I’m not necessarily against this idea, however in this post I’d like to make the complementary case that even with the current architectures and training algorithms, deep learning can probably make more use of the available data, i.e. it can squeeze more juice out of available data. Why do I think so and how can deep learning achieve this? There are a couple of reasons that make me think so:

  1. Argument from cold posteriors: in Bayesian neural networks, it has been empirically observed that the best predictive performance is obtained not with the actual posterior, but with “cold posteriors”, which correspond to artificially manipulated posteriors that overcount the effect of the data and undercount the effect of the (usually generic) prior. Conversely, this suggests that current techniques in deep learning may be undercounting the potential of the data given that one has to resort to an artificial boosting of its effect in Bayesian neural networks.
  2. Argument from slow and brittle convergence to “natural” solutions: there is some interesting theoretical work suggesting that in some simplified problems, standard deep learning techniques will converge to what I would consider the “natural” solutions, but the convergence is painfully slow and brittle. Let me give two examples: Soudry et al. (2018) show that in logistic regression with separable data, gradient descent converges to the max margin solution (which can be considered as the natural solution for this type of problem), but convergence is extremely slow, i.e. something like O(\log \log t / \log t), and is brittle in the sense that it doesn’t hold for some popular adaptive gradient descent algorithms like Adam. Ji & Telgarsky (2019) show a similar result for logistic regression problems with non-separable data, but the convergence here is again extremely slow, i.e. the rate of convergence in direction to the max margin solution is again O(\log \log t / \log t). On the other hand, it is clear that the convergence to the max margin solution in these problems can be significantly sped up with simple data-dependent initialization schemes. In a similar vein, some prior works have suggested that important generalization properties of neural networks, such as their ability to generalize compositionally, is extremely sensitive to initialization, again implying that starting from a data-agnostic, generic initialization may not be optimal.
  3. Argument from empirical Bayes: how can deep learning make more use of the available data? A straightforward idea is the one I mentioned at the end of the last paragraph, i.e. using a data-dependent initialization scheme (I gave a simple example of this kind of scheme in a previous post). This approach is reminiscent of the empirical Bayes method in Bayesian statistics, which underlies a whole host of beautiful and surprising results like the Stein phenomenon. The basic idea in empirical Bayes is refusing to assume a non-informative, generic prior for the variables of interest (for a neural network, these could be the parameters, for instance), but estimating these priors from the data instead. You can see that this idea accords nicely with a data-dependent initialization scheme for neural network parameters. Empirical Bayes enjoys some appealing theoretical performance guarantees compared to common alternatives like maximum likelihood, which suggests that similar improvements may hold for data-dependent initialization schemes for neural networks as well.

In defense of pure learning

It is often claimed animals are more sample efficient, meaning that they can learn from many fewer examples (sometimes from single examples), than artificial neural networks and that one of the reasons for this is the useful innate inductive biases sculpted into the brains of animals through adaptations over evolutionary time scales. This claim is then usually followed by a prescriptive advice that we do something similar with artificial neural networks, namely, build in more innate structure into our models (or at least include an outer optimization loop in our models that would learn such useful inductive biases over longer time scales, as in meta-learning). Tony Zador and Gary Marcus made arguments of this sort recently.

In this post, I’d like to take issue with arguments of this sort. Actually, my objection to these kinds of arguments is already well-known. Zador has a section in his paper addressing precisely this objection (the section is titled “supervised learning or supervised evolution?”). So, what is the objection? The objection is that arguments of this sort conflate biology and simulation. They assume that the learning that happens in an artificial neural network is comparable to the learning that happens in a biological system over its individual lifespan. But there’s no good reason to think of artificial learning in this way. We should rather think of it as a combination of the learning that happens over an individual lifespan and the adaptations that take place over evolutionary time scales. When we think of artificial learning in this light, the sample efficiency argument in favor of animals falls by the wayside, because biological evolution has been running the most intense optimization algorithm in the biggest and the most detailed simulation environment ever (called “the real world”) for billions of years (so much for “one-shot” learning).

As I said, Zador is aware of this objection, so what is his response to it? As far as I can tell, he doesn’t really have a very convincing response. He correctly points out the differences between biological optimization and learning in artificial networks, but this doesn’t mean that they can’t generate functionally equivalent networks.

For example, Zador notes that biological optimization runs two nested optimization loops, the inner loop characterizing the learning processes in individual lifespans, the outer loop characterizing the adaptations over evolutionary time scales. This is similar to a learning paradigm called meta-learning in machine learning. And because of its similarity to biology, Zador is very much sympathetic to meta-learning. But in my mind the jury is still out on whether meta-learning has any significant advantages over other standard learning paradigms in machine learning. There are recent results suggesting that in practical problems one doesn’t really need the two separate optimization loops in meta-learning (one loop is all you need!). Moreover, if one trains one’s model in a sufficiently diverse range of problems (but crucially using a standard learning paradigm, such as supervised learning or reinforcement learning), “meta-learning” like effects emerge automatically without any need for two separate optimization loops (see also this beautiful new theory paper explaining some of these experimental results).

The core problem here, I think, is again conflating biology and simulation. Just because we see something in biology doesn’t mean we should emulate it blindly. Biology is constrained in many ways simulation isn’t (and vice versa). Of course it makes sense to use two separate optimization loops in biology, because individual lifespans are limited, but this isn’t true in simulation. We can run our models arbitrarily long on arbitrarily many tasks in simulation.

I think this (i.e. the mismatch between biology and simulation) is also why naive ways of emulating the brain’s innate inductive biases, like trying to directly replicate the concept of “cell types” in the brain is usually not very effective in artificial neural networks, because in my opinion these features are essentially consequences of the brain’s suboptimal learning algorithms (over developmental time scales), which means that it has to off-load a significant chunk of the optimization burden to evolution, which needs to craft these intricate cell types to compensate for the suboptimality of learning (over developmental time scales). Learning in artificial neural networks, on the other, is much more powerful, it is not constrained by all the things that biological learning is constrained by (for example, locality and limited individual lifespans), so it doesn’t really need to resort to these kinds of tricks (like different innate cell types) to easily learn something functionally equivalent over the individual lifespan.

Does the brain have to do deep credit assignment?

In other words, does the brain have to do something like backprop? For those who are already familiar with this literature, my short answer is that no, the brain doesn’t have to do, and it probably doesn’t do, deep credit assignment. In this post, I’d like to discuss two reasons that make me think so. I should note from the outset that these are not really “knock-out” arguments, but more like plausibility arguments.

I have to first clarify what exactly I mean by “deep credit assignment” or “something like backprop”. This is still not going to be very exact, but by this I basically mean a global credit assignment scheme that propagates precise “credit signals” from elsewhere in a deep network in order to compute a local credit signal. I thus include any gradient-based method (first-, second-, or higher-order) in this category, as well as imperfect, heuristic versions of it such as feedback alignment or weight mirrors. There are some borderline cases such as decoupled neural interfaces that compute credits locally, but also learn gradient estimates over longer timescales. I’m inclined to include these in the “deep credit assignment” category as well, but I would have to think a bit more carefully about it before doing so confidently.

Now moving on to the two reasons that make me think the brain probably doesn’t do “deep credit assignment”. The first reason is this. I think it is very natural to think that the brain should be doing something like backprop, because it is what works today! Deep neural networks trained with gradient descent have been enormously successful in a variety of important and challenging tasks, like object recognition, object detection, speech recognition, machine translation etc. But it is very important to also remember that these successes depend on the current hardware technology. The methods that work well today are methods that work well on current hardware. But the current hardware technology is constrained in a variety of ways that brains aren’t.

Take the example of memory. Fast memory (for example, on-chip memory) is extremely limited in size in the current chip technology (although this may be beginning to change with novel chip designs such as Graphcore’s IPU and Cerebras’s WSE). But, there is no reason to think that the brain is limited in the same way, because the brain has a completely different computational architecture!

What is the significance of this observation? Well, even today we know that there are promising alternatives to backprop training in deep nets that are currently intractable precisely because of memory constraints: I’m thinking, in particular, of the deep nets as Gaussian processes (GPs) perspective. Amazingly, this method doesn’t require any training in the usual sense. Inference is accomplished through a single forward pass just like in backprop-trained nets. The catch is that, unlike in backprop-trained nets, this forward pass doesn’t scale well with the data size: it requires manipulating huge matrices. To my knowledge, these methods currently remain completely intractable for, say, ImageNet scale data, where the said matrices become terabyte-sized (there’s this recent paper that carries out exact GP computations on a million data points, but they use low-dimensional datasets and they don’t use deep-net kernels in their experiments; exact GPs on high-dimensional data using deep-net kernels remain highly intractable, to the best of my knowledge).

As a side note here, I would like to express my gut feeling that there may be more tractable versions of this idea (i.e. training-free or almost training-free deep nets) that are not being explored thoroughly enough by the community. One simple idea that I have been thinking about recently is the following. Suppose we take the architecture of your favorite deep neural network and set the parameters of this network layer by layer using only information inherent in the training data, say, a large image dataset. This would work as follows. Suppose the network has k filters in its first (convolutional) layer. Then we can either crop k random patches of the appropriate size from the training images or maybe do something more intelligent, like exhaustively cropping all non-overlapping patches of the appropriate size and then doing something like k-means clustering to reduce the number of crops to k and then setting those to be the first layer filters. This then fixes the first layer parameters (assuming the biases are zero). We can iterate this process layer by layer, at each layer computing a number of clusters over the activations of the previous layer across the entire training data and then basically “pasting” those clusters to the layer weights. Note that in this scheme even though there is learning (after all we are using the training data), it is minimal and non-parametric (we’re doing only k-means, for example), and nothing like a gradient-based learning scheme. I think it would be interesting to find out how well one can do with a scheme like this that uses minimal learning and utilizes almost exclusively the prior information inherent in the network architecture and the training data instead.

So, this was the first reason why I think the brain probably doesn’t do something like backprop, to wit: backprop seems to me too closely wedded to the current hardware technology. My hunch is that there are many more interesting, novel (and probably more biologically plausible) ways of building intelligent systems that don’t require anything like backprop, but we’re currently not exploring or considering these because they remain intractable with current hardware (large scale GPs being one concrete example).

The second reason is that even with the current hardware we have some recent hints that purely local learning schemes that don’t require any deep credit assignment can rival the performance of backprop training in realistic tasks (and if the brain doesn’t have to do something, the chances are it’s not going to do it!). I’d like to mention two recent papers, in particular: Greedy layerwise learning can scale to ImageNet by Belilovsky et al. and Training neural networks with local error signals by Nokland and Eidnes. These papers both introduce completely local, layerwise training schemes and show that they can work as well as end-to-end backprop in standard image recognition problems.

Although these results are impressive, in a way I consider these as still initial efforts. I feel pretty confident that if more collective effort is put into this field, even better local training schemes will be discovered. So, to me these results suggest that the problems we typically solve with end-to-end deep learning these days may not be hard enough to require the full force of end-to-end backprop. Furthermore, with each new clever local learning trick, we will discover these problems to be even easier than we had imagined previously, so in the end coming full circle from a time when we had considered computer vision to be an easy problem, to discovering that it is in fact hard, to discovering that it isn’t that hard after all!

Update (01/16/2020): I just found out that the idea I described in this post for building a gradient descent-free image recognition model using k-means clustering was explored in some early work by Adam Coates and colleagues with promising results (for example, see this paper and this). They use relatively small datasets and simple models in these papers (early days of deep learning!), so maybe it is time to reconsider this idea again and try to scale it up.

Google’s new paper on large-scale weakly supervised learning

Last week, a group of researchers from Google posted a paper on arxiv describing an object recognition model claimed to achieve state-of-the-art (sota) results on three benchmarks measuring out-of-sample generalization performance of object recognition models (ImageNet-A, ImageNet-C, and ImageNet-P), as well as on ImageNet itself. These claims have yet to be independently verified (the trained models have not been released yet), but the reported gains on previous sota results are staggering:

The previous sota results on ImageNet-A, ImageNet-C, and ImageNet-P were reported in a paper I posted on arxiv in July this year, and they were achieved by a large model trained by Facebook AI researchers on ~1B images from Instagram using “weak” (i.e. noisy) labels, then fine-tuned on ImageNet (these models are called ResNeXt WSL models, WSL standing for weakly supervised learning). People who have worked on these benchmarks before will appreciate how impressive these numbers are. Particularly impressive for me are the ImageNet-A results. This benchmark itself was introduced in the summer this year and given the lackluster performance of even the best ResNeXt WSL models reported in my paper, I thought it would take a while to see reasonably high accuracies on this challenging benchmark. I was spectacularly wrong!

So, how did they do it? Their method relies on the old idea of co-training: starting from a model trained on a relatively small amount of high-quality labeled examples (in this case, ImageNet trained models), they infer labels on a much larger unlabeled dataset (in this case, the private JFT-300M dataset), then they train a model on the combined dataset (labeled + unlabeled) using random data augmentation during training, then they iterate this whole process several times.

In my arxiv paper posted back in July, I had confidently claimed that:

We find it unlikely that simply scaling up the standard object classification tasks and models to even more data will be sufficient to feasibly achieve genuinely human-like, general-purpose visual representations: adversarially robust, more shape-based and, in general, better able to handle out-of-sample generalization.

Although the Google paper doesn’t use a “standard” training paradigm, I would definitely consider it pretty close (after all, they simply find a much better way to make use of the large amount of unlabeled data, bootstrapping from a relatively small amount of labeled data, otherwise the setup is a pretty standard semi-supervised learning setup). So, I would happily admit that these results at least partially disprove my claim (it still remains to be seen to what extent this model behaves more “human-like”, I would love to investigate this thoroughly once the trained models are released).

This paper also highlights a conflict that I feel very often these days (also discussed in this earlier post). Whenever I feel pretty confident that standard deep learning models and methods have hit the wall and that there’s no way to make significant progress without introducing more priors, somebody comes along and shatters this idea by showing that there’s actually still a lot of room for progress by slightly improved (yet still very generic) versions of the standard models and methods (with no need for stronger priors). I guess the lesson is that we really don’t have very good intuitions about these things, so it’s best not to have very strong opinions about them. In my mind, the empirical spirit driving machine learning these days (“just try it and see how well it works”) is probably the best way forward at this point.

Another lesson from this paper is that bootstrapping, self-training type algorithms might be powerful beyond our (or at least my) paltry imagination. GANs and self-play type algorithms in RL are other examples of this class. We definitely have to better understand when and why these algorithms work as well as they seem to do.

Update (12/03/19): Another interesting paper from Google came out recently, proposing adversarial examples as a data augmentation strategy in training large scale image recognition models. Surprisingly, this doesn’t seem to lead to the usual clean accuracy drop if the BatchNorm statistics are handled separately for the adversarial examples vs. the clean examples and the perturbation size is kept small. Interestingly for me, the paper also reports non-trivial ImageNet-A results for the large baseline EfficientNet models. For example, the standard ImageNet-trained EfficientNet-B7 model has a reported top-1 accuracy of 37.7%. This is far better than the 16.6% top-1 accuracy achieved by the largest ResNeXt WSL model. These large EfficientNet models use higher resolution images as inputs, so it seems like just increasing the resolution gains us non-trivial improvements on ImageNet-A. This doesn’t diminish the impressiveness of the self-training results discussed in the main post above, but it suggests that part of the improvements there can simply be attributed to using higher resolution images.

kin2vec: learning kinship relations

I should preface this post by cautioning that it may contain some premature ideas, as I’m writing this mainly to clarify my own thoughts about the topic of this post.

In a reading group on program induction, we’ve recently discussed an interesting paper by Mollica & Piantadosi on learning kinship words, e.g. father, mother, sister, brother, uncle, aunt, wife, husband etc. In this paper, they are formalizing this as a probabilistic program induction problem. This approach comes with all the benefits of explicit program-like representations: compositionality, sample efficiency, systematic generalization etc. However, I’m always interested in neurally plausible ways of implementing these types of representations. The paper discussed an earlier work by Paccanaro & Hinton, which proposes a vector space embedding approach to the same problem. So, I decided to check out that paper.

Paccanaro & Hinton model people as vectors and relations between people as matrices (so \mathbf{y} = \mathbf{R} \mathbf{x} might mean “\mathbf{y} is the father of \mathbf{x}“). The idea is then to learn vector representations of the people in the domain, \mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n, and matrix representations of the relations between them, \mathbf{R}_1, \mathbf{R}_2, \ldots, \mathbf{R}_k, such that the distance between \mathbf{x}_i and \mathbf{Rx}_j is minimized if the corresponding relation holds between \mathbf{x}_i and \mathbf{x}_j, and maximized otherwise. This is (by now) a very standard approach to learning vector space embeddings of all sorts of objects. I have discussed this same approach in several other posts on this blog (e.g. see here and here).

Paccanaro & Hinton model each relation with a separate unconstrained matrix. Unfortunately, I think this is not really the best way to approach this problem, since it ignores a whole lot of symmetry and compositionality in relationships (which very likely negatively impacts the generalization performance and the sample efficiency of the model): for example, if \mathbf{y} is the father of \mathbf{x}, then \mathbf{x} is a son or a daughter of \mathbf{y}. Most primitive relations are actually identical up to inversion and gender. Other relations can be expressed as compositions of more primitive relations as in Mollica & Piantadosi.

So, I tried to come up with a more efficient scheme than Paccanaro & Hinton. My first attempt was to use only two primitive relations, \mathbf{A} (e.g. mother of) and \mathbf{W} (e.g. wife of) and to use matrix inversion and transpose to express the symmetric and opposite-gendered versions of a relationship. Here are some examples:

\mathbf{A}: mother

\mathbf{A}^\top: father

\mathbf{A}^{-1}: daughter

\mathbf{A}^{-\top}: son

\mathbf{A} \mathbf{A}^\top: father’s mother

\mathbf{A}^\top \mathbf{A}: mother’s father

\mathbf{A}^{-\top} \mathbf{A}: mother’s son (brother)

\mathbf{A}^{-1} \mathbf{A}^\top: father’s daughter (sister)

At this point, we run into a problem: mother’s daughter and father’s son always evaluate to self (the identity matrix), and this just doesn’t quite feel right. Intuitively, we feel that extensions of these concepts should include our sisters and brothers as well, not just us. The fundamental problem here is that we want at least some of these concepts to be able to pick out a set of vectors, not just a single vector; but this is simply impossible when we’re using matrices to represent these concepts (when applied to a vector, they will give back another vector). This seems like a fairly basic deficiency in the expressive power of this type of model. If anybody reading this has any idea about how to deal with this issue in the context of vector space models, I’d be interested to hear about it.

Another question is: assuming something like this is a reasonably good model of kinship relations (or similar relations), how do we learn the right concepts given some relationship data, e.g. (y, \; father, \; x), (z, \; mother, \; x) etc.? If we want to build an end-to-end differentiable model, one idea is to use something like a deep sparsely gated mixture of experts model where at each “layer” we pick one of our 7 primitive relations (indexed from 0 to 6):

\mathbf{I}, \mathbf{A}, \mathbf{A}^{-1}, \mathbf{A}^\top, \mathbf{W}, \mathbf{W}^{-1}, \mathbf{W}^\top

and the specific gating chosen depends on the input and output, \mathbf{x} and \mathbf{y}, g(\mathbf{x}, \mathbf{y}).

So, to give an example, if we allow up to 5 applications of the primitives, the output of the gating function for a particular input-output pair might be something like: g(\mathbf{x}, \mathbf{y})=[3, 0, 0, 0, 0] or a suitable continuous relaxation of this. This particular gating expresses the relationship, \mathbf{y} = \mathbf{A}^\top \mathbf{x}, whereas g(\mathbf{x}, \mathbf{y})=[3, 2, 0, 0, 0] would correspond to \mathbf{y} = \mathbf{A}^{-1} \mathbf{A}^\top \mathbf{x}. If we use a suitably chosen continuous relaxation for the discrete gating function, the whole model becomes end-to-end differentiable and can be trained in the same way as in Paccanaro & Hinton. We can also add a bias favoring the identity primitive over the others in order to learn simpler mappings (as in Mollica & Piantadosi). It would be interesting to test how well this model performs compared to the probabilistic program induction model of Mollica & Piantadosi and compared to less constrained end-to-end differentiable models.

Update (10/11/19): There’s some obvious redundancy in the scheme for representing compositional relations described in the last paragraph: applications of the identity don’t have any effect on the resulting matrix and successive applications of \mathbf{A} and \mathbf{A}^{-1} (or \mathbf{W} and \mathbf{W}^{-1}) cancel out each other. So, a leaner scheme might be to first decide on the number of non-identity primitives to be applied and generate a sequence of exactly that length using only the 6 non-identity primitives. The successive application of inverted pairs can be further eliminated by essentially hard-coding this constraint into g(\cdot, \cdot). These details may or may not turn out to be important.

The relative value of learning over memorizing

At the end of my last post, I mentioned the possibility that a large episodic memory might obviate the need for sophisticated learning algorithms. As a fun and potentially informative exercise, I decided to quantify this argument with a little experiment. Specifically, given a finite amount of data, I wanted to quantify the relative value of learning from that data (i.e. by updating the parameters of a model using that data) vs. just memorizing the data.

To do this, I compared models that employ a mixture of learning and memorizing strategies. Given a finite amount of “training” data, a k%-learner uses k% of this data for learning and memorizes the rest of the data using a simple key-value based cache memory. A 100%-learner is a pure learner that is typical in machine learning. For the learning model, I used a ResNet-32 and for the memory model, I used the cache model described in this paper. The predictions of a k%-learner are given by a linear combination of the predictions obtained from the learner (ResNet-32) and the predictions obtained from the cache memory:

prediction = w * prediction from the learning model + (1-w ) * prediction from the cache memory

where w is a hyper-parameter that is estimated separately for each k%-learner (I assume that the cost of learning a single hyper-parameter is negligible compared to the cost of learning the parameters of a model).

Suppose I already used up k% of the data for training my ResNet-32 model and this achieves a generalization accuracy of x. Now the question is: what should I do with the rest of the data? I can either use that data to continue to train my model, which leads to a 100% learner and let’s say this 100% learner achieves an accuracy of y; alternatively I can just memorize the remaining data by caching (with the help of my partially trained ResNet-32 model), which leads to a k%-learner and let’s say this k%-learner achieves an accuracy of z. Then, given that I have already used k% of the data for learning, the relative value of learning the remaining data over just memorizing it is defined by:

relative_value_of_learning(k) = (y-x) / (z-x)

that is, the improvement in accuracy achieved by a 100%-learner divided by the improvement in accuracy achieved by the k%-learner. A large value here indicates that learning is much more valuable than memorizing (i.e. it pays off to learn from the remaining data rather than just memorizing it) and a value of 1 would indicate that learning and memorizing are equally valuable. In the latter case, given that learning is usually computationally much more expensive than memorizing, we would probably be inclined to memorize rather than learn.

The following figure shows the relative_value_of_learning(k) as a function of k for the CIFAR-10 benchmark.

So, by this measure learning is ~10 times as valuable as memorizing in this task. There appears to be a decreasing trend in the value of learning as k becomes larger, but the data is a bit noisy (ideally, I should have run this simulation multiple times to get more reliable estimates).

Is this result surprising? It was surprising to me! I was expecting the relative value of learning to be smaller and the curve shown above to approach 1 much more quickly. So, now I am a bit less skeptical of the growing literature on biologically plausible analogues of backpropagation after this exercise. There is definitely a lot of value in learning good representations (much more value than I had initially thought).

Some caveats: this exercise is specific to a particular task and particular learning and memorizing models. The results might be different in different setups. Given that much of the effort in machine learning is directed toward coming up with better pure learning models (rather than better memory models), I expect that the relative value of learning estimated here is an overestimate, in the sense that one can improve the performance of memorizing models by using more sophisticated memory models than the simple key-value cache model assumed in this exercise.

Finally, an analysis like this should help us perform a cost-benefit analysis for learning vs. memorizing both in natural and artificial agents. Coming up with cost estimates is probably easier in artificial agents: for example, one can estimate the FLOPS involved in learning vs. memorizing a given amount of data; or one can include memory costs as well. Depending on our exact cost function, the optimal strategy would involve a specific mix, or a specific trajectory, of learning vs. memorizing during the lifetime of the agent.

A conjecture about “how the brain works”

I put “how the brain works” in quotes in the title, because it is in fact a misleading expression. There is no single way “the brain works.” There are different mechanisms the brain uses to solve different types of problems. My conjecture is specifically about object recognition type problems that current deep learning methods arguably excel at. As is well-known, the way current deep learning methods solve these types of problems is by training a very deep network with lots of labeled examples. The success of these methods has led many to think that the brain may be solving the same problem, or similar problems, in more or less the same way (same in terms of the final mechanism, not necessarily in terms of how the brain gets there). A manifestation of this way of thinking is the ongoing search for biologically plausible variants of the backpropagation algorithm, the “workhorse” of deep learning (see a recent review here), which is biologically patently unrealistic the way it is used in current deep learning models.

To be fair, there are good reasons to think like this. Deep learning models trained in this way are currently the best models of the ventral visual cortical areas in primates and just considering their zeroth-order performance, nothing else really even comes close to achieving near human or, in some cases, even super-human performance in sufficiently challenging object recognition tasks.

Of course, when we look a bit more closely, there are also very good reasons to be skeptical of the claims that these models are adequate models of the primate visual systems in general (and human visual system in particular). Chief among those reasons is the surprising (almost shocking) sensitivity of these models to adversarial and natural perturbations, very unlike human vision. Another reason to be skeptical is that when people actually do a more fine-grained analysis of how humans vs. deep vision models perform on realistic image recognition tasks, they find significant differences between how the two behave.

In this post, I would like to add one more reason to the skeptic’s arsenal and argue that current deep learning models for object recognition behave psychologically unrealistically and that our brains don’t seem to me to be solving object recognition type problems in the same way. My argument is exceedingly simple. It’s an argument from subjective experience and it goes as follows. When I recognize an object, it usually comes with a strong sense of novelty or familiarity. When I recognize a coffee mug, for instance, I don’t just recognize it as a mug, but as this particular mug that I have seen before (maybe even as my own mug) or as a novel mug that I haven’t seen before. This sense of familiarity/novelty comes automatically, involuntarily, even when we are not explicitly trying to judge the familiarity/novelty of an object we are seeing. More controlled psychological experiments also confirm this: humans have a phenomenally good memory for familiarity with a massive capacity even in difficult one-shot settings (see e.g. this classic study by Lionel Standing or this more recent study by Tim Brady and colleagues).

In other words, our recognitions have a strong and automatic episodic component. This episodic component is mostly lacking in current deep vision models. They don’t have a natural way of telling whether an object is novel or familiar at the same time as they are performing the recognition task.

There may be indirect ways of doing this in trained networks, for example, maybe novel and familiar –i.e. training and test– objects produce different activation distributions in a trained network. I actually don’t know if this is the case or not, but my point is just that current deep vision models do not perform this computation naturally and automatically as part of the computation they perform for recognizing the objects in the first place. This appears to me to be a big difference from the way we humans seem to do similar tasks.

So, how can we add this episodic component to the current generation of deep vision models? Shameless plug: I wrote a paper on this. The solution turns out to be really simple: just cache everything you can (ideally everything you have seen so far), using sufficiently high-level features (not too low-level stuff). And use the cache while making predictions. Retrieval from the cache is essentially a form of episodic memory. This is not even a novel solution. People have been proposing similar ideas in reinforcement learning and in language modeling (in fact, my paper was directly inspired by this last paper). In my paper, I showed that this cache-based model is incredibly robust to adversarial perturbations, so much so that when using only the cache memory to make predictions, I wasn’t able to generate any convincing adversarial examples, even with very strong attack methods (similar robustness results have been demonstrated in other papers as well). I strongly believe such cache-based models will also be much more adequate models of the human (and primate) visual system.

In a recent interview, Geoff Hinton said something quite similar to what I have tried to argue in this post about the difference between the current generation of deep learning models and the brain (if I interpret it correctly):

The brain is solving a very different problem from most of our neural nets… I think the brain isn’t concerned with squeezing a lot of knowledge into a few connections, it’s concerned with extracting knowledge quickly using lots of connections.

I think Hinton is fundamentally right here and I think a massive episodic memory is one of the basic mechanisms the brain uses to “extract knowledge quickly using lots of connections.” Among other things, I think one of the important implications of this point of view is that the current emphasis in some circles on trying to find sophisticated and powerful learning algorithms in the brain, which I alluded to above, may be misplaced. I actually think that backpropagation is probably much more sophisticated and powerful than anything we will find in the brain. Any learning algorithm in the brain is restricted in various ways machine learning algorithms don’t have to be (e.g. locality, respecting the rules governing different cell types etc.). On the other hand, in terms of the sheer number of neurons and the sheer number of connections, the human brain is massive compared to any model we have ever trained. It seems to me that we will soon find out that the algorithms relevant for the kind of machine the brain is are much more different than the machine learning algorithms relevant for today’s piddling models (piddling relative to the size of the human brain, of course). For example, I have always thought that hashing algorithms, essential for performing similarity search over very large sets of high-dimensional objects, should be at least as important and as relevant as backpropagation (and probably more) in our quest to understand the brain. And I have at least some corroborating evidence from the fly brain, of all places!

Surprising things you can learn from lavish data with no supervision

There is a well-known argument in psychology of language put forward by Noam Chomsky called the poverty of the stimulus argument. Roughly speaking, this is the claim that the linguistic data a child receives during language acquisition are vastly incomplete and insufficient to converge on the correct grammar. This claim is then used to bolster nativist arguments to the effect that large portions of grammar must be innate, already present in the child’s brain from birth in the form of a universal grammar.

There are several problematic aspects of this line of argument. The first and probably the most obvious one is that when people make a claim like this, they rarely, if ever, quantify how much linguistic data a child actually receives during the first few years of its life. Secondly, even if one does go ahead and quantify exactly the kind and amount of data a child receives during language acquisition, one still has to do the hard work and show that convergence to the correct grammar cannot happen (or is very unlikely to happen) with relatively weak, generic biases, but instead requires strong language-specific biases (i.e. that the biases have to be in the form of some kind of universal grammar). This can be tested either with architecture-agnostic methods such as Bayesian learners or with specific learning architectures like neural networks. Perfors et al., for example, show, through the Bayesian route, that linguistic input contains enough information to favor a hierarchical (as opposed to linear or flat) grammar with no prior bias favoring hierarchical grammars, directly refuting the often-made Chomskyan claim that language learners must have a strong innate prior bias in favor of hierarchical grammars. This just demonstrates how error-prone our intuitions  can be regarding the learnability or otherwise of certain structures from data without strong priors and the importance of actually checking what one can or cannot learn from given data.

As we are increasingly able to train very large models on very large datasets, I think we are beginning to grapple with a fundamental question about the nature of human-level or super-human intelligence: how far can we go with fairly generic, but very large architectures trained on very large datasets optimizing very generic objectives like prediction or curiosity? Is it possible to get all the way to human-level or super-human perceptual and cognitive abilities in this way, or alternatively is it necessary to incorporate strong inductive biases into the network architecture and we simply don’t know how to do this yet, both because we don’t know what the right inductive biases are and also because we don’t know how to implement them in our models? Personally, I would easily rate this as one of the most important outstanding questions in cognitive sciences and AI today.

My own thinking on this question has been evolving in the direction of the first possibility lately, i.e. that we can learn a lot more than we might naively imagine using fairly generic architectures and fairly generic unsupervised objectives. Part of the reason for this shift is a whole slew of recent work demonstrating that one can indeed learn highly non-trivial, surprising things even from relatively modest amounts of data using very generic network architectures and generic training objectives. In this post, I’d like to highlight a few of these recent results.

Building on earlier work demonstrating the power of transfer learning in language-related tasks, there has been a lot of progress this year in unsupervised pre-training of language models with large amounts of unlabeled data. For example, Radford et al. first pre-trained a large Transformer model on a language modeling task (i.e. given some prior context, predict the next token) with a relatively large dataset (see also this earlier paper by Howard & Ruder that implements essentially the same idea, but with a different model and different datasets). They then fine-tuned this pre-trained model on a variety of downstream supervised classification tasks and observed large gains in most downstream tasks over state of the art models specifically trained on those tasks. The dataset that they pre-trained the model on was a corpus of ~7000 books. Although this may seem like a big dataset (and it is big compared to the datasets typically used in NLP research), it is in fact a miniscule dataset relative to how large it could potentially be. For example, as of October 2015, Google Books contained ~25 million books, i.e. \sim O(10^7), which is about 4 orders of magnitude larger than the corpus used in this study. I’m not sure about the number of parameters in the Transformer model used in this paper, but my rough estimate would be that it must be \sim O(10^8). By comparison, the human brain has O(10^{15}) synapses. We’ve never even come close to running models this big. Now try to imagine the capabilities of a system with O(10^{15}) parameters trained on a corpus of O(10^{7}) or more books. It’s almost certain that such a system would shatter all state of the art results on pretty much any NLP benchmark that exists today. It would almost definitely lead to qualitatively recognizable improvements in natural language understanding and common-sense reasoning skills, just as today’s neural machine translation systems are recognizably better than earlier machine translation systems, due in large part to much bigger models trained on much bigger datasets.

Another conceptually similar model that has been shown to work even better is the more recent BERT model by Devlin et al. The two major innovations in this paper over the Radford et al. paper are (i) the use of a bidirectional attention model, instead of the unidirectional –strictly left-to-right– attention model used in Radford et al.; and (ii) the use of two novel unsupervised pre-training objectives. Specifically, they use a masked token prediction task, where the goal is to predict some masked word or words in a sequence, rather than the more standard left-to-right prediction task used in Radford et al. and other language modeling papers. This objective allows the bidirectional attention model to make use of both the left and the right context in order to predict the masked words. In addition, they also use a novel unsupervised next sentence prediction task, where the objective is to simply predict whether two given input sentences actually follow each other or not. Training examples for this objective can be easily generated from the corpus. The motivation behind this second objective is to force the model to learn the relationships between sentences, rather than relationships between lower-level units such as words. This second objective turns out to be crucial for significantly improved performance in question answering and natural language inference tasks. The datasets used for pre-training the model amount to the equivalent of some ~30000 books by my estimation. This is significantly bigger than the dataset used by Radford et al., however it’s still a few orders of magnitude smaller than the number of books that were available on Google Books as of October 2015.

The bidirectional BERT model significantly outperforms the Radford et al. model on the GLUE benchmark even after controlling for the model size. This suggests that although both the model architecture and the pre-training objectives in the paper are still quite generic, not all generic architectures and objectives are the same, and finding the “right” architectures and objectives for unsupervised pre-training requires careful thinking and ingenuity (not to mention a lot of trial and error).

Large-scale study of curiosity-driven learning is another paper that came out this year demonstrating the power of unsupervised learning in reinforcement learning problems. In this quite remarkable paper, the authors show that an agent receiving absolutely no extrinsic reward from the environment (not even the minimal “game over” type terminal reward signal) and instead learning entirely based on an internally generated prediction error signal can learn useful skills in a variety of highly complex environments. The prediction error signal here is the error of an internal model that predicts a representation of the next state of the environment given the current observation and the action taken by the agent. As the internal model is updated over training to minimize the prediction error, the agent takes actions that lead to more unpredictable or uncertain states. One of the important messages of this paper is, again, that not all prediction error signals, hence not all training objectives, are equal. For example, trying to predict the pixels or, in general, some low-level representation of the environment doesn’t really work. The representations have to be sufficiently high-level (i.e. compact or low-dimensional). This is consistent with the crucial importance of the high-level next sentence prediction task in the BERT paper reviewed above.

As the authors note, however, this kind of prediction error objective can suffer from a severe pathology, sometimes called the noisy TV problem (in the context of this paper, this problem can be more appropriately called a “pathological gambling” problem): if the agent itself is a source of stochasticity in the environment, it may choose to exploit this to always choose actions that lead to high-entropy “chancy” states. This strategy may in turn lead to pathological behaviors completely divorced from any external goals or objectives relevant to the task or tasks at hand. The authors illustrate this kind of behavior by introducing a “noisy TV” in one of their tasks and allowing the agent to change the channel on the TV. Predictably, the agent learns to just keep changing the channel, without making any progress in the actual external task, because this strategy produces high-entropy states that can be used to keep updating its internal model, i.e. an endless stream of “interesting”, unpredictable states (incidentally, this kind of pathological behavior seems to be common in humans as well).

Once more, this highlights the importance of choosing the right kind of unsupervised learning objective that would be less prone to such pathologies. One simple way to reduce this kind of pathology might be to yoke the intrinsic reward of prediction error to whatever extrinsic reward is available in the environment: for example, one may value the intrinsic reward only to the extent that it leads to an increase in the extrinsic reward after some number of actions.

To summarize the main points I’ve tried to make in this post and to conclude with a few final thoughts:

  • Unsupervised learning with generic architectures and generic training objectives can be much more powerful than we might naively think. This is why we should refrain from making a priori judgments about the learnability or otherwise of certain structures from given data without hard empirical evidence.
  • I predict that as we apply these approaches to ever larger models and datasets, the capabilities of the resulting systems will continue to surprise us.
  • Although fairly generic architectures and training objectives have so far worked quite well, not all generic training objectives (and architectures) are the same. Some work demonstrably better than others. Finding the right objectives (and architectures) requires careful thinking and a lot of trial and error.
  • One general principle, however, seems to be that one should choose objectives that force the model to learn high-level features or variables in the environment and the relationships between them. Understanding more rigorously why this is the case is an important question in my opinion: are low-level objectives fundamentally incapable of learning the kinds of things learnable through high-level objectives or is it more of a sample efficiency problem?
  • In addition to the examples given above, another great example of the importance of this general principle is the generative query network (GQN) paper by Deepmind, where the authors demonstrate the power of a novel objective that forces the model to learn the high-level latent variables in a visual scene and relationships between those variables. More specifically, the objective proposed in this paper is to predict what a scene would look like from different viewpoints given its appearance from a single viewpoint. This is a powerful objective, since it requires the model to figure out the 3d geometry of the scene, properties of the objects in the scene and their spatial relationships with each other etc. from a single image. Coming up with similar objectives in other domains (e.g. in language) is, I think, a very interesting problem.
  • Probing the capabilities of the resulting trained systems in detail to understand exactly what they can or cannot do is another important problem, I think. For example, do pre-trained language models like BERT display compositionality? Are they more or less compositional than the standard seq2seq models? Etc.

Update: Here‘s an accessible NY Times article on the recent progress in unsupervised pre-training of language models.

Update (01/07/19): Yoav Goldberg posted an interesting paper evaluating the syntactic abilities of the pre-trained BERT model discussed in this post on a variety of English syntactic phenomena such as subject-verb agreement and reflexive anaphora resolution, concluding that “BERT model performs remarkably well on all cases.”

Common initialization schemes for recurrent neural networks are likely suboptimal

Training of recurrent neural networks (RNNs) suffers from the same kind of degeneracy problem faced by deep feedforward networks. In fact, the degeneracy problem is likely compounded in RNNs, because empirically the spectral radius of W^k tends to be much larger than the spectral radius of  W_k W_{k-1} \ldots W_1 where W, W_1, \dots, W_k are random matrices drawn from the same ensemble (e.g. random Gaussian). I don’t know of a rigorous proof of this claim for random matrices (although, heuristically, it is easy to see that something like this should be true for random scalars: \sum_i^{k} w_i \sim \mathcal{N}(0, k), but kw \sim \mathcal{N}(0, k^2) for w, w_1, \ldots, w_k \sim \mathcal{N}(0,1) –this is essentially the difference between a true random walk vs. a biased random walk (I thank Xaq Pitkow for pointing this out to me)–; exponentiating both sides, we can then see that the product of k random scalars should be exponentially larger than the k-th power of a random scalar), but this empirical observation would explain why training linear RNNs would be harder than training deep feedforward networks and one can reasonably expect something like this to hold approximately in the nonlinear case as well.

Researchers have developed methods to deal with this degeneracy problem, hence to overcome training difficulties in RNNs. One of the most well-known of these methods is the identity initialization for the recurrent weight matrix. Others proposed constraining the weight matrix to always be orthogonal, instead of orthogonalizing it at initialization only. The logic behind both of these methods is that since orthogonal transformations are isometries of the Euclidean space, applying a bunch of these transformations in a cascade does not lead to a degeneration of the metric (by “degeneration” here, I mean the collapse of the metric along the overwhelming majority of the directions in the input space and the astronomical expansion of the metric along a very small number of remaining directions). This is guaranteed in the linear case and, again, one hopes and expects (with some justification) that things are not all that different in the nonlinear case as well. So, in other words, a sequence of orthogonal transformations propagate vectors in Euclidean space without distortion, i.e. without changing their norms or the distances between them.

This is all true and fine, however, this analysis ignores a crucial factor that is relevant in training neural networks, namely the effect of noise. Noise comes in both through the stochasticity of SGD and sometimes through direct noise injection (as in Dropout) for regularization purposes. It is a bit hard to precisely characterize the noise that arises due to SGD, but let us assume for the sake of simplicity that the noise is additive so that what we propagate in the end is some kind of “signal + noise”. Now, although it is true that orthogonal transformations propagate the signal without distortion, they also propagate the noise without distortion as well. But, ultimately, we probably want a transformation that maximizes something like the signal-to-noise ratio (SNR) of the propagated signal + noise. Then, it is no longer obvious that orthogonal transformations are optimal for this purpose, because, one can, for example, imagine transformations that would amplify the signal more than they would amplify the noise (hence distorting both the signal and the noise), thus yielding a better SNR than an orthogonal transformation.

And indeed it turns out that for linear systems with additive Gaussian noise, one can mathematically show that optimal transformations (in the sense of maximizing the total SNR of the propagated signal + noise) are not orthogonal. In fact, one can say something even stronger: any optimal transformation has to be non-normal (a normal matrix is a unitarily diagonalizable matrix; all orthogonal matrices are normal, but the reverse is not true). This is the main result of this beautiful and insightful paper by Surya Ganguli and colleagues. Perhaps the simplest example of an optimal transformation in this sense is a feedforward chain: W_{ij} = \alpha \delta_{i,j-1}, where \delta is the Kronecker delta function. This particular example maximizes the total SNR through a mechanism known as transient amplification: it exponentially amplifies the norm of its input transiently before the norm eventually decays to zero.

This brings me to the main message of this post: that the commonly used orthogonal initializations for recurrent neural networks are likely suboptimal because of the often neglected effect of noise. Another evidence for this claim comes from looking at the trained recurrent connectivity matrices in tasks that require memory. In this work (currently under review), we have shown that the trained recurrent connectivity matrices in such tasks always end up non-normal, with a feedforward structure hidden in the recurrent connectivity, even when they are initialized with an approximately normal matrix. How non-normal the trained matrices end up depend on a wide range of factors and investigating those factors was the main motivation for our paper. So, initializing RNNs with a non-normal matrix would potentially be a useful inductive bias for these networks.

In ongoing work, I have been investigating the merits of various non-normal initialization schemes for non-linear RNNs. One particular non-normal initialization scheme that seems to work quite well (and that is very easy to implement) is combining an identity matrix (or a scaled identity matrix) with a chain structure (which was shown by Ganguli et al. to be optimal in the case of a linear model with additive Gaussian noise). More details on these results will be forthcoming in the following weeks, I hope. Another open question at this point is whether non-normal initialization schemes are also useful for the more commonly used gated recurrent architectures like LSTMs or GRUs. These often behave very differently than vanilla recurrent networks, so I am not sure whether non-normal dynamics in these architectures will be as useful as it is in vanilla RNNs.

Update (06/15/19): Our work on a new non-normal initialization scheme for RNNs described in this post is now on arxiv. The accompanying code for reproducing some of the results reported in the paper is available in this public repository.

Simple inductive biases to make neural networks train faster and generalize better: two case studies

Perhaps the most important factor determining how quickly a neural network trains and how well it generalizes beyond the range of data it receives during training is the inductive biases inherent in its architecture. If the inductive biases embodied in the architecture match the kind of data the network receives, that can enable it to both train much faster and generalize much better. A well-known example in this regard is the convolutional architecture of the modern neural network models for vision tasks. The convolutional layers in these models implement the assumption (or the expectation) that the task that the model attempts to solve is more or less translation invariant (i.e. a given feature, of any complexity, can appear anywhere in the image). A more recent example is the relational inductive biases implemented in relational neural networks. Mechanistically, this is usually implemented with an inner-product like mechanism (sometimes also called attention) that computes an inner-product like measure between different parts of the input (e.g. as in this paper) or with a more complex MLP-like module with shared parameters (e.g. as in this paper). This inductive bias expresses the expectation that interactions between features (of any complexity) are likely to be important in solving the task that the model is being applied to. This is clearly the case for obviously relational VQA tasks such CLEVR, but may be true even in less obvious cases such as the standard ImageNet classification task (see the results in this paper).

Coming up with the right inductive biases for a particular type of task (or types of tasks) is not always straightforward and it is, in my mind, one of the things that make machine learning a creative enterprise. Here, by the “right inductive biases”, I mean inductive biases that (i) only exploit the structure in the problem (or problems) we are interested in and nothing more or less, but (ii) are also flexible enough that if the same model is applied to a problem that doesn’t display the relevant structure exactly, the model doesn’t break down disastrously (some “symbol”-based neural machines may suffer from such fragility).

In this post, I’d like to briefly highlight two really nice recent papers that introduce very simple inductive biases that enable neural networks to train faster and generalize better in particular types of problems.

The first one is from Uber AI: An intriguing failing of convolutional neural networks and the CoordConv solution. In this paper, the authors first observe that state of the art convolutional networks fail quite badly in tasks that require spatial coordinate transformations, for example, changing from Cartesian coordinates to image-based coordinates or vice versa (e.g. given the Cartesian coordinates (x,y), draw a square of a certain size centered at (x,y)). This may not be too surprising, since convolutional networks are explicitly designed to be translation-invariant, hence to ignore any spatial information, but the authors correctly note that ignoring spatial information completely (being rigidly translation-invariant) may not always be advisable (this may lead to failures of the type mentioned in (ii) above). It is rather much better to provide the model with the spatial information and let it figure out itself how much translation-invariant it needs to be in any particular task. This is exactly what the authors do. Specifically, they provide the spatial information in an explicit format through additional (fixed) channels that represent the Cartesian coordinates of each “pixel”. For image-based tasks, one thus needs only two additional channels, representing the x and y coordinates of each pixel. Pictorially, their scheme, which they call CoordConv, looks like this (Figure 3 in the paper):


That’s basically it. If the task at hand is highly translation-invariant, the model can learn to set the weights coming from those two Cartesian coordinate channels to small values; if the task at hand requires precise spatial information, on the other hand, the model can learn to utilize those channels appropriately. NLP people may recognize the conceptual similarity of this scheme to the positional encodings of items in sequence-based tasks. For the NLP people, we may thus summarize their contribution by saying that they extend the positional encoding idea from the temporal domain (in sequence-based tasks) to the spatial domain (in image-based tasks). It’s always a good idea to think about such exchanges between different domains!

The authors then go on to demonstrate that introducing a few of these CoordConv layers in standard architectures improves performance in a diverse range of tasks (but not in all tasks), including object detection, GAN training and Atari playing.

The second paper I’d like to highlight, called Neural Arithmetic Logic Units, starts from the observation that generic neural network architectures cannot generalize well in numerical tasks requiring arithmetic operations such addition, multiplication etc., even when they may successfully fit any given training data in such tasks (and sometimes they cannot even achieve that). The authors of this paper introduce very simple, elegant and easy-to-impement inductive biases that enable generic models (LSTMs and MLPs) to extrapolate from training data much better in such tasks. The basic idea is to “nudge” standard neural network operations (linear combination, pointwise nonlinearity etc.) to behave like arithmetic operators. For instance, for addition, they parametrize a dense weight matrix as:

\mathbf{W} = \tanh(\mathbf{V}) \circ \sigma(\mathbf{M})

where \circ denotes elementwise multiplication, and \sigma(\cdot) is the sigmoid nonlinearity. In the saturated regime, this parametrization encourages \mathbf{W} to have entries, -1, 0, 1, and so a linear combination using this kind of \mathbf{W}, i.e. \mathbf{W}\mathbf{x}, tends to behave like an addition or subtraction of its inputs (without scaling). In light of the preceding discussion, it is important to note here again that the model does not force this kind of behavior, but rather it merely facilitates it.

As an inductive bias for multiplication, they use the exponentiated sum of logs formulation:

\exp \mathbf{W} (\log (\mathbf{x} + \epsilon))

using the same matrix \mathbf{W} as above. This (approximately) expresses the multiplication of the elements in \mathbf{x}. A linear combination of these addition and multiplication operations gated by a sigmoid unit (called a NALU in the paper) then can function as either an addition or a multiplication operation (which can be learned as appropriate). One can then stack these operations to express, in principle, arbitrarily complex arithmetic operations.

This beautiful, simple idea apparently works fantastically well! I was quite impressed by the results in the paper. However, I would have liked to see (i) some results with more complex arithmetic operations than they report in the paper and also (ii) some results with tasks that do not have a strong arithmetic component to gauge how strong the introduced arithmetic inductive biases are. Again, the idea is to see whether, or how badly, the model fails when faced with a task without a strong arithmetic component. Ideally, we would hope that the model does not fail too badly in such cases.

Note: I will collect, and report here, examples of inductive biases, like the ones I discussed in this post, that I encounter in the literature, with brief descriptions of the bias introduced, how it is supposed to work and what kinds of problem it is intended to be applied to. To facilitate this, I tagged this post with the tag inductive biases and I will file similar posts under the same tag in the future.