FLOPs are all you need: a conjecture about what really makes deep learning work

One of the most interesting papers I have read recently is a paper titled “Scaling MLPs: A Tale of Inductive Bias” by a group of researchers at ETH Zurich. This paper made me realize or brought to the foreground of my consciousness something that I was perhaps aware of only vaguely, indistinctly, dimly, implicitly: pure MLPs are severely compute-starved (i.e. FLOPs-starved) compared to today’s more “successful” deep learning architectures (e.g. convnets, transformers, MLP-mixers, or gMLPs).

This comes about because these “successful” deep learning architectures do a hefty dose of parameter sharing. Given a fixed amount of memory (GPU memory), there is a trade-off between memory for parameters vs. memory for intermediate computations (or activations). Parameter sharing is a great way to reduce the memory footprint of parameters. This opens up a lot more room for doing intermediate computations instead. MLPs, on the other hand, don’t do any parameter sharing, which limits the amount of intermediate computations that can be performed for a fixed amount of memory allotted to the parameters. I wasn’t quite aware of the scale of this problem for MLPs before reading this paper. Table 5 in the Appendix of the paper, in particular, has a very informative comparison of MLPs vs. “successful” modern deep learning architectures in this respect:

The first row here is a pure MLP model and the other rows are “successful” modern deep learning architectures (a convnet, a transformer, and an MLP-mixer model, respectively) all with roughly the same number of parameters. Unlike the pure MLP model, these three “successful” architectures do a lot of parameter sharing. Now, look at the FLOPs (i.e. the number of intermediate computations) performed by these models shown in the last column. The MLP model does over two orders of magnitude fewer FLOPs on the input than the other models! That’s a massive difference.

My conjecture then is that that number you see in the last column is basically the only thing that determines whether an architecture will be successful or not (i.e. whether it will be highly performant across a range of domains). You just need to be able to perform a certain number of computations on the input. All other low-level details, inductive biases, etc. are basically irrelevant. I conjecture that this is also the reason why pure MLPs are not successful at sufficiently large-scale problems today: they fail to perform enough computations on the input.

This is a very bold conjecture (bold because it’s quite likely to be too strong, hence likely to be wrong as stated), but I make two main predictions if this conjecture is roughly correct:

1) The particular way parameters are shared in a model should be basically irrelevant. The main point of parameter sharing is rather to reduce the memory footprint of the parameters on the GPU. How exactly you do it is essentially irrelevant. Some modern architectures like MLP-mixers and gMLPs already share parameters in very “weird” ways, but if I’m right, even weirder ways of sharing parameters should work more or less OK as well (e.g. sharing parameters more randomly) provided that enough sharing is done to allow for a certain minimum number of FLOPs on the input.

2) The current failure of pure MLPs to be performant at large-scale problems is only temporary. There will come a time when GPUs will have enough memory to allow MLPs to perform the minimum requisite number of FLOPs on the input even without any parameter sharing (this may already be possible with parallelism at industrial scales of compute, but unfortunately it’s not possible for me personally to test it at the piddling academic scales of compute I have access to at the moment). I predict that MLPs may even be more compute-efficient than current parameter-sharing architectures (there’s already a strong hint of this in the results of the scaling experiments in section 4.4 of the “Scaling MLPs” paper): because they don’t perform the same type of computation over and over again at different places, MLP FLOPs may be fundamentally less redundant than transformer FLOPs or convnet FLOPs. More concretely, this would mean, for instance, in the table above, perhaps MLPs don’t have to go all the way up to ~10 GFLOPs like the other architectures, perhaps they would already become competitive with those other architectures at ~1 GFLOPs or something like that, but the current <100 MFLOPs per input just doesn’t seem to be large enough.

Update (08/15/23): It could be argued that in order to roughly match the FLOP count of MLPs with the FLOP counts of currently successful modern deep learning architectures, we might have to blow up their parameter count, which could lead to overfitting. This is certainly possible; however, I don’t really expect overfitting to be a major problem for sufficiently large-scale problems. It’s a bit unclear how large “sufficiently large-scale” has to be, but my guess is that the largest public datasets available today, such as LAION-5B, should be large enough for this purpose.

4 thoughts on “FLOPs are all you need: a conjecture about what really makes deep learning work”

Leave a comment