A couple of months ago, OpenAI released a paper describing their latest language model, GPT-3. GPT-3 is distinguished from its predecessors by nothing other than its sheer scale: compared to its previous incarnations, it’s just a bigger language model trained with a bigger dataset (~1-2 orders of magnitude bigger in both model size and training data size). So, the paper is essentially an exercise in scaling. The main novel result in the paper is an impressive demonstration of the (in-context) few-shot learning abilities of such large-scale language models (it can be argued that even this main result is not entirely novel, as it was foreshadowed in some earlier language modeling work, e.g. see this and this). The paper reminded me, once again, of Philip Anderson’s famous More Is Different paper, where Anderson argues that quantitative changes in nature can sometimes lead to qualitative changes and that people (even scientists) don’t always appreciate the consequences of this fact enough. It was also inspiring for me to see all the amazing demos people have quickly built with GPT-3 and shared with the world (here is a nice collection of such demos as a Twitter thread).
In this post, I’d like to briefly discuss a few criticisms I had of the GPT-3 paper.
The first one is about the actual need for scale: i.e. whether they really needed to train a giant 175B-parameter model or not. Figure 3.1 in the paper (reproduced above) clearly shows that many of their smaller models were not trained to saturation; this figure also shows that the smaller models are actually more compute-efficient up to the total compute used for those smaller models. To me, this strongly suggests that they actually didn’t have to train a 175B-parameter model, a ~5B-parameter model would probably have performed just as well (if not better) if trained longer. This point was also noted by Graham Neubig on Twitter.
This renders all the figures in the paper showing model size on the x-axis and performance on the y-axis (which is most of the figures in the paper) a bit suspect in my mind, because the smaller models were not given the same amount of compute in those figures.
So why did they train a 175B-parameter model then? One possibility is just because they could; they perhaps wanted to prepare this kind of infrastructure for projects down the line that actually do require models at this scale. A more sinister interpretation is that they want to commercialize this product at some point (this would be consistent with their CEO’s expressly stated objective of “capturing the light cone of all future value in the universe”) and a giant model is more “controllable” for this purpose: a client can easily put a 5B-parameter model on a few GPUs of their own to do inference and fine-tuning as they wish, but they can’t do this with a 175B-parameter model, making them more reliant on OpenAI’s specialized hardware.
A second difficulty with the paper for me was my constant struggle to understand to what extent the model was doing abstraction (or generalization) vs. rote memorization. In other words, to what extent the impressive looking results from the model can be attributed to the sheer size of the training data vs. the abstraction capacity of the model. To understand this better, it would have been extremely useful if, for example, at least for a subset of the tasks and examples, the authors showed the embedding space nearest neighbors to a given query among the training data, but surprisingly they never do this in the paper (I don’t suppose this would be technically more challenging than running a search over the input space, which they do multiple times in the paper). If these nearest neighbors are intuitively highly similar to the query and the model’s outputs more or less resemble the actual continuations of these nearest neighbors (say, with simple substitutions), that would favor a dataset size based explanation for the performance of the model. They do try to rule out the rote memorization based explanation in some of their experiments, but these were not entirely convincing for me. For example, in the arithmetic tasks, they look for patterns of the form “<NUM1> + <NUM2> =” and “<NUM1> plus <NUM2>” in their training data to investigate if the model is just memorizing these arithmetic equations. They find only a small number of matches, concluding that a rote memorization strategy seems unlikely. But the problem here is that these are just two of the almost endless ways the same arithmetic equations could be encoded in the training data (note that their training data includes a snapshot of the entire world wide web, which is a really really big place!): e.g. “<NUM1> <NUM2>”, “<NUM1> & <NUM2>”, “<NUM1> | <NUM2>”, “<NUM1> p <NUM2>”, “<NUM1> pl. <NUM2>”, “<NUM1> || <NUM2>”, etc. Here, again, it would have been much more meaningful if they showed us some nearest neighbor retrievals instead.
So, where do we go from here? Is training ever bigger language models on ever bigger training data the way forward for an ever more general kind of intelligence? I don’t think so. One immediate difficulty is that unlike compute, it is hard to imagine how the training data can be increased another couple of orders of magnitude. As mentioned above, their training data already includes a snapshot of the entire web (and then some). Perhaps more book datasets can be added to the training data or some improvements can be made in data quality through better cleaning up of the web data (which is, in itself, a significant challenge), but I just don’t see how these can be combined into a few orders of magnitude increase in the effective data size.
In my view, a much more promising route would be to try to add some sort of grounding to these language models, e.g. through pictures or videos from the web. I think grounding is crucial for models to have a better understanding of the world; and anecdotal evidence from human experience suggests to me that these models perhaps wouldn’t need nearly as much grounding experience as they need text data to achieve a reasonably good grounded understanding of the world. This is because it seems to me that we humans acquire most of our grounding early in our development through interactions with a fairly limited environment, and acquire pretty much all the rest of our knowledge only indirectly, through social and cultural means, for example, by learning things from other people, or by reading about them in books, articles, web pages etc. (Anthropologist Joe Henrich makes a similar point in his book The Secret of Our Success). Current language models already seem to be highly efficient at extracting information from extremely large scale text data. To complement this already super-human ability, finding good grounding objectives and grounding data for training large-scale grounded language models would be a very promising and exciting direction, I think (see this, this, and this for some recent attempts in this direction).
Update (09/04/2020): I apparently missed this earlier, but OpenAI has made its intention to make GPT-3 a commercial product very clear right from the beginning (see here). They even mention the large size of the model as an excuse not to release it:
… many of the models underlying the API are very large, taking a lot of expertise to develop and deploy and making them very expensive to run.
So, it seems like my sinister interpretation above for OpenAI training a much larger model than was actually warranted was not too much off the mark!