The following are some short notes on OpenAI’s Image-GPT paper, which is in my opinion one of the most important papers that came out in recent years.
The motivating question behind this paper is this: can likelihood-based generative pre-training lead to strong transfer learning results in computer vision? This question is inspired by the success of this same technique in NLP (where it’s commonly known as language modeling). In computer vision on the other hand, successful transfer has so far been achieved mostly through other (non-generative) pre-training objectives, like supervised pre-training (on ImageNet etc.), or more recently self-supervised pre-training (MoCo, SimCLR, BYOL, etc.). This raises the interesting question of whether there might be some fundamental differences between language and vision tasks that make these different methods more appropriate for these two respective domains. The Image-GPT paper answers this question in the negative and shows for the first time that likelihood-based generative pre-training can also lead to very strong transfer learning results provided that we use the right kind of architecture (GPT) at the right scale.
The other main interesting result from this paper is that the very same GPT architecture is shown to perform well in both language prediction and image prediction tasks, suggesting that these (and similar) tasks may share a deep common computational core (something very general like: predict given as much context as possible) despite their many superficial differences and can be solved effectively by the same kind of computational architecture. I think this observation has important implications for the brain and evolution. To me, one of the things it suggests is that whatever inductive biases our brains (more specifically, our neocortex) might have, they’re probably not very domain-specific biases like many psychologists seem to believe (e.g. specific biases about visual objects, agents, or language). Rather, they’re probably much more generic biases, having to do with better information processing in some very general sense, like being able to integrate over a much longer context, or being able to do better deep credit assignment (i.e. dealing with vanishing/exploding gradients) etc. It is important to emphasize here that the GPT architecture itself embodies absolutely no language-specific or vision-specific inductive biases.
This idea also accords well with the sources of recent progress in machine learning. When I look at what drives significant architectural progress in machine learning today, most of the time it’s somebody proposing a solution to a very generic information processing problem: in ResNets, solving an optimization problem (vanishing/exploding gradients); in transformers, getting rid of the serial processing bottleneck of RNNs to make it feasible to integrate over a much longer context; in batch normalization, dealing with the covariate shift problem during training etc. Certainly, biological evolution doesn’t have to respect the same rules as human innovation, but at least to me this suggests that there’s maybe more bang for the buck in targeting these general information processing related problems than targeting more domain specific issues, which makes it more plausible that evolution may also be primarily targeting the same general issues.
One final interesting result in the Image-GPT paper is that even for the same validation loss in the generative pre-training task (i.e. ImageNet modeling), bigger models seem to show better transfer learning performance (Figure 3 in the paper). This is interesting in light of my criticism of the GPT-3 paper, where different sized models were not given the same amount of compute and it seemed likely that the smaller models would reach the same (or maybe even better) validation loss as the largest 175B-parameter model if they were given the same amount of compute. The results in the Image-GPT paper suggest that even in that case, the larger models might have had an advantage in terms of transfer performance, but it would have been much better if, just like the Image-GPT paper, the GPT-3 paper had actually carried out this important experiment to see if the larger models have a transfer advantage above and beyond what can be accounted for by validation loss (or compute) alone.