In my last post, I discussed two issues that are widely considered to be serious problems for deep learning models: generalization and few-shot learning (more specifically, meta-learning as a proposal for performing few-shot learning). I argued that these are only problems when we consider small models trained with very limited amounts of data. In this post, I’d like to give one more example of this kind of thing: compositionality or systematic generalization. I’ll again argue that this is only a problem when we consider small toy domains without a lot of structure. It’ll mostly cease to be a problem when we start thinking about the much richer structure of the world we live in, and of our bodies and minds (including our language) that inherit this richness.
There are by now probably more than a dozen benchmarks that evaluate slightly different notions of compositionality or systematic generalization: e.g., SCAN, gSCAN, CURI, COGS, PCFG SET, BabyAI, CLOSURE, SQOOP etc. to name just a few that I’m most familiar with. A common feature shared by most of these benchmarks is that they take place in simple, toy domains without a lot of “affordances”, which necessarily restricts the abundance and richness of the linguistic and semantic/conceptual structures that can be created in them. Some of these benchmarks use natural language or something close to it (e.g., COGS, CFQ), so they don’t necessarily suffer from this particular shortcoming, although they may have other potential weaknesses, like not having a large enough training set or the target task involving a somewhat arbitrary and artificial semantic form (but this is a separate discussion).
For example, a common evaluation condition in these benchmarks is to generalize from just a handful of combinations like and (e.g., eat furiously and read furiously) to a novel combination (e.g., sleep furiously), where is assumed to be learned from other contexts and are usually the only items of their kind in the domain (e.g., actions). But why do we even expect something like this to work? The world we live in, the world inside our minds (our conceptual world), and our language are nothing like this barren landscape.
When we infer the meaning of a novel combination like sleep furiously, we don’t just have two other actions, eat and read, to rely on. Instead, we have an immensely rich, interconnected web of concepts that we bring to bear on this task. An average English speaker knows tens of thousands of words and our conceptual world is presumably much richer than this number would indicate, because there are no single words for many of our concepts and some of our concepts are altogether difficult to precisely articulate in language. But more than its sheer size, what gives this conceptual web its true richness and power is its highly interconnected and structured nature. For example, among the dizzying, almost stupefying range of things we know about sleeping is the fact that it can sometimes involve restless states, wild movements, hellish nightmares, intense dreams, loud snoring etc., which are all associated with the concept of fury, or the state of being furious, through various more or less circuitous conceptual routes, so we could easily imagine what it would be like to sleep furiously by tracing these routes, even if we heard this particular combination for the first time.
And when applied at scale, neural networks are in fact remarkably good at capturing and utilizing these kinds of associations to make sense of novel combinations. Recent large scale deep learning models like DALL-E and GPT-3 are very good demonstrations of this in my view. Look at the remarkable agility and accuracy with which DALL-E seems to make sense of novel combinations like “a store front that has the word ‘openai’ written on it” (we know that this is a novel combination, because it doesn’t exist in the real world):
Or consider this utterly mind-blowing demonstration of the compositional skills of GPT-3 (source):
In one example, US poet Andrew Brown showed the power of GPT-3, tweeting that he’d given the programme this prompt: “The poetry assignment was this: Write a poem from the point of view of a cloud looking down on two warring cities. The clever student poet turned in the following rhyming poem:”
“I think I’ll start to rain,
Because I don’t think I can stand the pain,
Of seeing you two,
Fighting like you do.”
And even in simpler, toy domains, which common compositionality benchmarks often focus on, there’s some recent evidence suggesting that simply scaling up the size and diversity of these domains can solve many of the splits in these benchmarks that may seem superficially challenging in smaller scale versions (e.g., Kagitha, 2020; Hill et al., 2020).
It could be argued that these models require too much data to achieve these compositional skills, hence they’re not nearly as sample efficient as humans, for instance. Therefore, the argument goes, the main goal of this field should be to come up with useful inductive biases that would improve the sample efficiency of the models in acquiring these compositional generalization abilities. But, these kinds of comparisons with humans are a bit misleading in my mind because of the radically different nature of the inputs that humans receive (e.g., multimodal, embodied, and embedded in a much richer world). Perhaps, the seemingly greater demand for data these models require is simply an illusion created by the fundamentally different nature of the inputs.