Does Sora understand physics? A few simple observations

I’m a bit late to the fray as usual, but I wanted to write a short post about Sora. Sora is OpenAI’s new video generation model. As of this writing, it’s still not open to the public, so all we’ve got so far is some high-level information about the model and some generated samples shared by OpenAI in a blog post. The samples look impressive in their visual quality and their apparent realism, however most of the videos seem to contain pretty glaring physical inaccuracies that are easy to detect when one looks at the details a bit more carefully (e.g. objects merging into each other and then unmerging, objects spontaneously disintegrating or disappearing, objects spontaneously changing their features, etc.). This prompted some to question whether (or to what extent) Sora really understands physics and even further whether it’s possible to understand physics at all by, effectively, just learning to predict pixels over video clips (which is, at a high level, what Sora does). As my humble contribution to this assize, I’d like to make a few very simple observations. Most of these are probably obvious to anybody who knows anything about anything (or to somebody who knows something about something), but I happen to belong to that rarefied species that finds prodigious value in stating the obvious from time to time, so here we go:

1) There’s an important distinction between “understanding physics” and being able to generate physically accurate videos. Although the model might struggle with generating physically highly accurate videos, it might still be able to reliably recognize that there’s something “weird” going on in physically inaccurate videos. This is roughly the difference between recognition and generation (or the difference between recognition and recall in memory retrieval). The latter is generally harder. So, a potentially more sensitive way to test the model’s understanding of physics would be to run carefully controlled recognition tests, as is typically done in intuitive physics benchmarks, for instance.

2) People’s understanding of physics seems to be mostly of this “recognition” variety too (rather than the “generation” variety). People don’t really have a very accurate physics engine inside their heads that they can use to simulate physically highly accurate scenarios (cf. Marcus & Davis, 2013; Davis & Marcus, 2015; Ludwin-Peery et al., 2021). This is why this capability is often properly described as intuitive physics as opposed to actual physics (or similar).

3) People can also generate fictitious, physically highly implausible or even impossible scenarios in their imagination with remarkable ease and ingenuity (and they have been doing this since time immemorial). Cartoons, fairy tales, fantasies, legends, etc. are full of such examples: levitating creatures, objects passing through solid walls, objects melting or disintegrating into pieces and then regrouping again, etc.

Casper the Friendly Ghost

4) For related reasons, you also do NOT want a video generation model that only generates physically highly accurate videos. You want something that can bend or break physics, ideally in a precisely controllable way (possibly based on the textual prompt, for instance).

5) We know nothing about the distribution of the videos Sora was trained on. Almost certainly, a subset of its training data consists of CGI, digitally edited, or animated videos depicting physically implausible or impossible scenarios (we don’t know how large this subset is). So, part of the reason why Sora sometimes generates physically implausible or inaccurate videos may be traced back to this subset of its training data.

6) Even granting the previous point, however, some of the generated samples seem to show clear signs of gross errors or inaccuracies in whatever physics engine Sora has managed to learn by watching videos. Consider this generated video of wolf pups frolicking, for example. Why do inaccuracies like this arise in the first place and how might they be remedied or ameliorated? At the risk of sounding like a man with a hammer seeing nails everywhere, I will suggest that many of the inaccuracies like this particular one are “granularity problems” that will be fixed when Sora can model videos at a sufficiently fine granularity (both spatially and temporally). For example, this particular scene with wolf pups frolicking is a highly complex, dynamic scene and accurately generating a scene like this requires very fine-grained individuation and tracking of multiple objects. In the absence of this level of granularity, the model instead generates something more coarse-grained, freely merging and unmerging objects in physical proximity without regard to correctness in details, but capturing the overall gist, the gestalt (or “texture”) of the action in the scene, somewhat analogous to how we see things in our visual periphery.

Update: After writing this post, I saw this thoughtful and much more detailed post on Sora by Raphaël Millière, which I recommend as well.

Intelligence is a granularity problem (or the reality has a surprising amount of detail, so must intelligence)

One of the recurring themes in Hans Moravec’s prescient book, Robot: Mere Machine to Transcendent Mind (first published in 1999), is how practically important problems (e.g. agile robot navigation in the real world) become tractable more or less automatically, as the amount of widely accessible compute reaches a soft threshold. Before this threshold is reached, people try to come up with all sorts of ingenious ideas, clever tricks to squeeze the last bit of performance from the available compute, but in the long run, this almost always proves totally unproductive, basically a complete waste of time, as the most straightforward, the simplest, “brute-force”, “dumb” method to solve the problem turns out to work just fine once the available compute reaches the requisite threshold, whereas the “ingenious” tricks almost invariably do not scale nearly as well with compute. This is, of course, another version of Rich Sutton’s famous Bitter Lesson.

The main reason problems become tractable only at particular compute scales is that their solution requires a minimum level of granularity or detail to be modeled. And most of the fundamental, practically important computational problems we face in the real world need a very high degree of granularity for their solution. The main reason for this, in turn, is that reality has a surprising amount of detail and these details are often very important.

Here’s an illustrative example from the book showing 3D maps of two similar visual scenes generated by essentially the same “dumb” (but scalable) mapping algorithm 18 years apart:

18 years of steady increase in the amount of widely available compute finally made real-world robot navigation a reality.

With the drastic increase in the available compute over those 18 years, it became possible to extract many many more features from the scene and estimate their locations to a much higher degree of resolution. This much finer granularity in 3D mapping is what finally enabled acceptably good robot navigation in the real world.

Here are some other examples of this phenomenon:

  • Visual object recognition: You can’t do fine-grained real-world object recognition with 8×8 images (nor even with 28×28 images). This is just too small to resolve the important details of many real-world objects. If the compute available to you only allows for the processing of such small images, I’m afraid you’re just going to have to wait until the compute catches up (much better to work on increasing the compute than to churn out cute little tricks that only work with 8×8 images!).
  • Chess: You can’t beat the world champion at chess if you can search the game tree only up to depth 3 or so. Beating the world champion at chess requires being able to search the game tree much more extensively at sufficiently large depths and breadths. And in fact, “dumb” brute-force search combined with sufficient compute was basically how a computer program defeated a world champion at chess for the first time, although there have been some important developments in making search more efficient since then (i.e. MCTS).

I believe this granularity problem also fundamentally underlies most cases of current AI methods not yet being able to do well in certain domains and it will ultimately be overcome when the widely available compute scale allows for the modeling of the requisite level of granularity in that domain even without any fundamental improvements in the algorithms, just like in Moravec’s 3D mapping example above. To give a few further examples:

  • Robotics: I believe this is why robotics is still hard for AI. For example, fine-grained, dexterous control of robotic hands in the real world requires being able to learn high-dimensional, high-precision, complex temporal patterns (with lots of high-frequency components, for example, due to contacts), which, in turn, requires sufficiently big models trained with a sufficiently large amount of data. This fine-grained, high-dimensional, high-precision control problem is, in fact, presumably so hard that the sensory and motor cortices in the human brain allocate a disproportionately large amount of cortical space to the representation of hands, as illustrated by these cartoonishly grotesque figurines of cortical homunculi (as a side note, it seems to be generally accepted among evolutionary biologists that the evolution of upright posture and the subsequent freeing of the hands for the manufacture and manipulation of objects was indeed one of the main drivers of the rapid expansion of brain size in the genus Homo):
A disproportionately large amount of cortical real estate is allocated to the representation and control of hands in the human brain (source).
  • Data efficiency: I believe that this granularity problem is also (at least partly) behind the apparent data efficiency gap between current deep learning algorithms and humans. To give an example from the visual domain, the human retina contains something like 6M color-sensitive cone receptors very tightly concentrated within a few degrees around the fovea. By moving our eyes, we can resolve different objects or surfaces in a scene to a very high degree of precision. The most commonly used image size in computer vision today, on the other hand, is something like 310×256 pixels (for the entire image), which is about 0.08M pixels, or two orders of magnitude lower resolution than the human retina (directly comparing the number of pixels in an image and the number of photoreceptors in the retina is a bit tricky, but I think it does make sense under fairly reasonable assumptions). My own recent work suggests that the apparent data efficiency gap in the visual domain between current deep learning algorithms and humans might be closed once we start to work with sufficiently large natural images, closer in size to the photoreceptor array in the human retina (~6MP), instead of using much smaller images, which is currently the norm.
  • Long-form video modeling: The granularity problem is the reason why long-form video modeling (long-form video understanding and generation) is still not there yet. Representing even very short clips without too much information loss requires a large number of visual “tokens”. From my own work, for example, I know that even 1 second long natural video clips require at the very least something like 4x16x16 discrete tokens (i.e. 4 tokens in the temporal dimension, 16×16 tokens in the spatial dimensions) in order to represent them faithfully enough. That is roughly 1K tokens. Scaling this up to a 1 hour long video would require roughly 4M tokens. It is not possible to train a large GPT model with a 4M token context length at the moment (not even for big industry labs), but as surely as the sun will rise tomorrow, this will be eminently feasible at some not too distant future and at that point AI models will be able to understand and generate long-form videos (e.g. films) at least as well as humans, but orders of magnitude faster (it will be a very wild world when AI models can generate entire films in a matter of minutes or seconds).
  • Text, hands, faces: The granularity problem is the reason why generative vision models had problems with creating realistic texts, hands, or faces in images, until very recently. These categories of objects all involve a large amount of fine-grained visual detail that needs to be represented and modeled in order to generate and recognize them accurately.
  • Developing and understanding large, complex software projects: Such projects often involve large codebases and their corresponding documentation (perhaps also including auxiliary information such as issues and pull requests, etc.). Similar to the case of long-form video modeling above, it is currently not yet feasible to train large GPT models with a large enough context size to cover all of the relevant pieces of code and documentation contained in a complex, realistic software project.
  • Long-form text modeling: The granularity problem is also the reason why AI models can’t write a convincing novel yet (nor read and understand a novel as well as humans do). The length of a good-sized novel like Anna Karenina is roughly on the order of 1M tokens (give or take a factor of 2). Again, it is currently not feasible to train a large GPT model with a context size this long, but it will surely be feasible at some not too distant future and at that point AI models will be able to write and comprehend novels (and other types of long-form text) at least as well as humans do. But, you may ask, do all those 1M tokens really matter for writing or comprehending a good novel? Yes, absolutely! It takes a lot of detail to build convincing characters, it takes a lot of detail to build rich internal and external lives for the characters in a novel. And we are exquisitely sensitive to these details. Human life is rich and complex, we go through a lot as our lives unfold over the years and, as a result, we are very sensitive to these vicissitudes, twists-and-turns of life. Let me also take this opportunity to wax lyrical about one of my favorite writers and one of my favorite novels: this is precisely why Tolstoy was one of the greatest writers and Anna Karenina is one greatest novels ever written. Tolstoy is particularly adept at creating, expressing, conveying these rich details of both the inner and outer lives of the characters in Anna Karenina, so much so that when you read Anna Karenina, you say “this could be real”; nothing in the novel really sticks out as strained, implausible, or unconvincing.
One of the greatest novels ever written (source).

Are there any problems that cannot be regarded as pure granularity problems for current AI methods, i.e. problems caused by our temporary inability to apply these methods at sufficiently fine granularities? My current working hypothesis is that reaching human-level AI will prove to be nothing but a granularity problem (or a series of granularity problems). I think we will once again be surprised when we find out we can actually solve many of the currently intractable looking problems with increased granularity. But, how about reasoning or planning, for example? Are they also just a granularity problem? First of all, I don’t think that humans really do reasoning or planning in the sense in which these terms are often used, as evidenced by the fact that models that can actually do reasoning and planning wipe the floor with even top human players in board games. What seems like reasoning in humans is most often just the use of shortcuts afforded by abstraction tools, for example, we write and use computer programs to do our reasoning for us. And writing code, as we found out recently, seems eminently amenable to reasoning-free, “pattern recognition” type learning strategies. Otherwise, again, my current hypothesis is that for human-level AI, it is going to be “pattern recognition” all the way down, but at increasingly finer granularities (perhaps with the sole addition of just a little bit of supervised finetuning applied on top).

Self-supervised learning may solve all the major problems of cogsci

One thing I really wish cogsci people appreciated more is the power of self-supervised learning.

Please skip to the update below before reading the following paragraph about the famous cake argument.

The main reason self-supervised learning is absolutely essential for an intelligent system is nicely illustrated by Yann LeCun’s cake metaphor:

LeCake (source)

I should note that I think this slide is slightly misleading, since it’s not the amount of raw information that really matters, as the slide suggests, but the amount of useful or relevant information (for example, semantic information or information about the latent variables we may care about), so the differences in information content between the three learning paradigms may be less dramatic than this slide suggests. That being said, this likely doesn’t change the fact that self-supervised learning is probably still more sample efficient than supervised learning (it leverages more information per sample), which is, in turn, probably more sample efficient than reinforcement learning (although I have to say I’m not aware of any rigorous formalization and experimental verification of these claims, hence the hedging word probably).

Update (4/18): On seconds thoughts, I’m actually not sure about the validity of this famous cake argument any longer. The main problem, as hinted in the paragraph above, is that simply comparing the information content of the target signal in each case is not really meaningful, because these target signals are at different levels of abstraction (e.g. pixels vs. semantic labels), they do not represent the same kinds of things, so one bit of information in the case of supervised learning is not equivalent to one bit of information in the case of self-supervised learning. Maybe some version of this argument might still be resuscitated, I’m not quite sure, but it needs to be formalized and thought through much more carefully. In the meantime, I think the main argument for the importance of self-supervised learning is probably the relative scarcity of explicit, high-level supervision signals (labels, annotations, rewards, etc.).

My impression is that when cogsci people think about data efficiency, most of the time they have something like supervised learning in mind, but this can be very misleading. Self-supervised learning often enormously reduces the amount of explicit supervision (e.g. labeled examples) necessary to achieve a certain capability and it can be very difficult to know a priori exactly how much can be learned from a given amount and type of data using self-supervised learning (the only way to know is usually to just do it).

In this post, I want to give two examples related to word learning and learning a basic aspect of theory of mind, respectively. Maybe these are not the best examples to illustrate my point, but they’re examples I’ve been thinking about recently, so please bear with me.

Fast mapping: Children are often said to learn the meanings of words very efficiently. For example, in their second year, children are claimed to be learning a few words a day on average. This suggests that they are probably able to learn most of their words from just a handful of exposures (perhaps often from a single exposure only), a phenomenon also known as fast mapping. This apparent efficiency impresses developmental psychologists very much, who have historically come up with an equally impressive array of supposedly innate inductive biases or constraints to allegedly explain this alleged efficiency (unfortunately these alleged innate inductive biases are almost always couched in informal verbal descriptions, so it’s impossible to know how exactly they’re supposed to work within the context of a concrete computational model or to know quantitatively how much of the data they would be able to explain exactly; in other words, these are, by and large, garbage theories 🚮). But should we really be impressed by this performance in the first place? To suggest that maybe we shouldn’t, to make it at plausible that we shouldn’t, I give you the example of this self-supervised ImageNet model finetuned with just 1% of the ImageNet training data (i.e. 12-13 labeled examples from each class) achieving a pretty impressive 75% top-1 accuracy:

Self-supervised learning unlocks impressive few-shot learning capabilities (source)

So, with only a dozen labeled examples from each class, this model achieves effectively human-level accuracy on ImageNet and comes close to matching the performance of a supervised model trained on the full training data, which is 100x larger (it’s possible to get the accuracy up to 80% top-1 in this example with a slightly more sophisticated finetuning pipeline, but this is not really important for my purposes here). This is a pretty impressive display of labeled data efficiency! Of course, this model did its self-supervised learning on ImageNet itself and it’s unclear if it would be able to achieve the same feat with self-supervised learning from more human-like data instead. My own work suggests that we’re probably not there yet, but I’m hopeful that we may soon be able to achieve it with a few relatively straightforward tricks (an important progress update on this will be forthcoming from myself in a couple of weeks).

Understanding interlocutor intent: The second example that I wanted to mention comes from the supervised finetuning of large language models (LLMs). It’s well-known that untuned language models often don’t understand user intent very well. When you ask an LLM to do something by giving it a prompt, it’s not uncommon for the model to give back to you variations on your prompt, instead of doing what you asked for. The GPT-4 blog post, for example, interestingly notes that the untuned “base model requires prompt engineering to even know that it should answer the questions.”

A model that can’t understand user intent well is obviously not very useful, but it’s very easy to drastically change the behavior of the model with a relatively small amount of supervised finetuning (optionally with a small amount of additional reinforcement learning as well, known as RLHF). In the seminal InstructGPT paper, for example, they were able to achieve this with just 13k annotated examples (and this number can probably be reduced with a combination of supervised tuning + RLHF, instead of doing only finetuning). This is a tiny tiny fraction of the amount of self-supervised data the model was trained on. The figure below shows that the outputs of the finetuned model were preferred to the outputs of the base model by human annotators roughly 80% of the time.

Finetuning with a few thousand supervised examples is enough to make an LLM pretty good at recognizing user intent (source). In the example indicated by the green arrow, the finetuned model is preferred to the untuned base model roughly 80% of the time.

The finetuned model here also displayed impressive generalization capabilities: for example, even though the finetuning data was entirely in English, the model’s learned instruction following behavior automatically generalized to other languages like French.

This second example was again meant to illustrate the idea that once you have internalized a rich fount of knowledge through self-supervised learning, it becomes surprisingly easy to achieve very impressive capabilities (in this case, acquiring a basic component of theory of mind in the form of recognizing user intent) through a very small amount (a smidgen) of supervised learning applied on top of it. Geoff Hinton recently expressed this idea very nicely in connection with RLHF, but you can make the same point about supervised finetuning as well:

With this second example too, it’s a bit unclear at the moment how much one can learn through self-supervised learning from more human-like language data as opposed to orders of magnitude larger amounts of digital text (more human-like both in terms of content and in terms of amount), but it seems plausible to think that it would have a qualitatively similar effect, namely significantly boosting the effectiveness of supervised finetuning.