Does Sora understand physics? A few simple observations

I’m a bit late to the fray as usual, but I wanted to write a short post about Sora. Sora is OpenAI’s new video generation model. As of this writing, it’s still not open to the public, so all we’ve got so far is some high-level information about the model and some generated samples shared by OpenAI in a blog post. The samples look impressive in their visual quality and their apparent realism, however most of the videos seem to contain pretty glaring physical inaccuracies that are easy to detect when one looks at the details a bit more carefully (e.g. objects merging into each other and then unmerging, objects spontaneously disintegrating or disappearing, objects spontaneously changing their features, etc.). This prompted some to question whether (or to what extent) Sora really understands physics and even further whether it’s possible to understand physics at all by, effectively, just learning to predict pixels over video clips (which is, at a high level, what Sora does). As my humble contribution to this assize, I’d like to make a few very simple observations. Most of these are probably obvious to anybody who knows anything about anything (or to somebody who knows something about something), but I happen to belong to that rarefied species that finds prodigious value in stating the obvious from time to time, so here we go:

1) There’s an important distinction between “understanding physics” and being able to generate physically accurate videos. Although the model might struggle with generating physically highly accurate videos, it might still be able to reliably recognize that there’s something “weird” going on in physically inaccurate videos. This is roughly the difference between recognition and generation (or the difference between recognition and recall in memory retrieval). The latter is generally harder. So, a potentially more sensitive way to test the model’s understanding of physics would be to run carefully controlled recognition tests, as is typically done in intuitive physics benchmarks, for instance.

2) People’s understanding of physics seems to be mostly of this “recognition” variety too (rather than the “generation” variety). People don’t really have a very accurate physics engine inside their heads that they can use to simulate physically accurate scenarios (cf. Marcus & Davis, 2013; Davis & Marcus, 2015; Ludwin-Peery et al., 2021). This is why this capability is often properly described as intuitive physics as opposed to actual physics (or similar).

3) People can also generate fictitious, physically highly implausible or even impossible scenarios in their imagination with remarkable ease and ingenuity (and they have been doing this since time immemorial). Cartoons, fairy tales, fantasies, legends, etc. are full of such examples: levitating creatures, objects passing through solid walls, objects melting or disintegrating into pieces and then regrouping again, etc.

Casper the Friendly Ghost

4) For related reasons, you definitely do NOT want a video generation model that only generates physically highly accurate videos. You want something that can bend or break physics, ideally in a precisely controllable way (possibly based on the textual prompt, for instance).

5) We know nothing about the distribution of the videos Sora was trained on. Almost certainly, a subset of its training data consists of CGI, digitally edited, or animated videos depicting physically implausible or impossible scenarios (we don’t know how large this subset is). So, part of the reason why Sora sometimes generates physically implausible or inaccurate videos may be traced back to this subset of its training data.

6) Even granting the previous point, however, some of the generated samples seem to show clear signs of gross errors or inaccuracies in whatever physics engine Sora has managed to learn by watching videos. Consider this generated video of wolf pups frolicking, for example. Why do inaccuracies like this arise in the first place and how might they be remedied or ameliorated? At the risk of sounding like a man with a hammer seeing nails everywhere, I will suggest that many of the inaccuracies like this particular one are “granularity problems” that will be fixed when Sora can model videos at a sufficiently fine granularity (both spatially and temporally). For example, this particular scene with wolf pups frolicking is a highly complex, dynamic scene and accurately generating a scene like this requires very fine-grained individuation and tracking of multiple objects. In the absence of this level of granularity, the model instead generates something more coarse-grained, freely merging and unmerging objects in physical proximity without regard to correctness in details, but capturing the overall gist, the gestalt (or “texture”) of the action in the scene, somewhat analogous to how we see things in our visual periphery.

Update: After writing this post, I saw this thoughtful and much more detailed post on Sora by Raphaël Millière, which I recommend as well.

Leave a comment