Meta AI recently released a new language model called LLaMA. And by “released a model”, I mean “didn’t really release a model”. They released a really really nice form instead which you can fill out and then Meta will get back to you after snooping on you just to make sure you haven’t been naughty recently (did I mention the form is really nice and it’s public: EVERYBODY can fill out the form). Presumably, no weights for you (or just random weights for you) if they find out you have been a bit too naughty for their liking.
Anyway. So, these LLaMAs come in four different sizes: from 6.7B parameters (smol) to 65.2B parameters (chonky). The largest two models are trained for 1.4T tokens, whereas the smaller ones are trained for 1T tokens (not really sure why). This is roughly ~1 epoch (effectively) over the training data. The largest model roughly follows the Chinchilla compute-optimal recipe. There’s nothing the least bit remarkable about the models or the training setup. It’s just the standard GPT model trained in the standard way. The training data is said to be all public, although I didn’t check this carefully for myself (one hopes that it’s not public in the Meta sense of public. Just kidding, but not really).
The money figure in the LLaMA paper (for me) is the following figure that shows the training loss curves for the four models (Figure 1):
As you can see, no apparent saturation for the 7B and 13B parameter models. In fact, the training loss seems to be decreasing at roughly the same rate for all four models after around 300B tokens. Seeing this figure, one gets immediately overcome by a sense of dejavu: this is the GPT-3 paper all over again with its severely (criminally!) undertrained small models.
From the above figure, it looks distinctly possible (and indeed I would say quite likely) that were the smallest two models given the same amount of compute as the 65B parameter model, they would have probably matched or even surpassed that model. Giving them the same amount of compute would mean training the 7B parameter model ~12.5x longer and the 13B parameter model ~7.6x longer (I calculated these numbers from the corresponding GPU-hours reported in Table 15 of the paper). Here’s what the training loss curves might have looked like in that scenario (you can click on the image for an enlarged view):
See just how much longer you would have to train the small models to match the compute given to the largest model? Now, you may laugh at my dumbass hand-drawn training loss curves, but I would submit to you that these dumbass hand-drawn curves are in fact much more rigorous than the dumbass “scaling laws” some really smart people came up with. My dumbass hand-drawn curves are also completely harmless, unlike the dumbass “scaling laws”, which had the overall effect of wasting a huge amount of resources and making these models much less accessible than they could have been.
Anyway. So, I’m trying to find a non-cynical explanation for this almost bizarre, persistent unwillingness to train small models for longer, but I can’t really find a very convincing one. Training a humongous model for a total of 1 epoch only over your training data is a phenomenon that does not really exist anywhere else in machine learning, to my knowledge. Take this CoCa paper for comparison, for instance (which is ~sota on ImageNet as of this writing): it trains a ~2.1B parameter model on a billion scale image-text dataset (~5B examples in total) for ~7 epochs (effectively).
Of course, I don’t believe for a second that people training these giant language models are actually dumb or ignorant, although from my experiences in academia, I could probably justifiably claim that they might be a bit too credulous: you can make a surprisingly large number of people in these circles believe some really dumb shit if it’s said or done by a sufficiently high prestige individual or individuals (just look at the insane “superintelligence” stuff, to give an example).
Anyway. So, my cynical interpretation? As I argued here before, trying to make these models less easily accessible, less easily controllable by others might be a feature, not a bug. I don’t believe, for instance, that OpenAI is really using a 175B parameter model for ChatGPT or for their other language products (here is an interesting analysis I saw recently that makes the same point, with some caveats), but they have an incentive for making people believe that they’re using a 175B parameter model and that it’s actually critical to use a giant model like that.
Last but not least, one final life lesson from all this, folks, is that whenever a theoretical physicist starts to talk about power laws, just completely ignore them (and I really mean completely), immediately run away in the opposite direction. It is my contention that nothing good has ever come out of a physicist blabbering about power laws.
Maybe the loss curves would level out significantly above those of the giant models. Maybe. But the fact that they didn’t even try is extremely suspicious. This goes far beyond the usual anti-curiosity that infected academia over the past few decades.