"Sora serves as a foundation for models that can understand and simulate the real world, a capability we believe will be an important milestone for achieving AGI."
This also helps explain why the model is so good since it is trained to simulate the real world, as opposed to imitate the pixels.
More importantly, its capabilities suggest AGI and general robotics could be closer than many think (even though some key weaknesses remain and further improvements are necessary before the goal is reached.)
EDIT: I just saw this relevant comment by an expert at Nvidia:
“If you think OpenAI Sora is a creative toy like DALLE, ... think again. Sora is a data-driven physics engine. It is a simulation of many worlds, real or fantastical. The simulator learns intricate rendering, "intuitive" physics, long-horizon reasoning, and semantic grounding, all by some denoising and gradient maths.
I won't be surprised if Sora is trained on lots of synthetic data using Unreal Engine 5. It has to be!
Let's breakdown the following video. Prompt: "Photorealistic closeup video of two pirate ships battling each other as they sail inside a cup of coffee." ….”
I was impressed with their video of a drone race on Mars during a sunset. In part of the video, the sun is in view, but then the camera turns so it’s out of view. When the camera turns back, the sun is where it’s supposed to be.
there's mention of memory in the post — the model can remember where it put objects for a short while, so if it pans away and pans back it should keep that object "permanence".
Well the video in the weaknesses section with the archeologists makes me think it's not just predicting pixels. The fact that a second chair spawns out of nothing looks like a typical AI uncanny valley mistake you'd expect, but then it starts hovering which looks more like a video game physics glitch than an incorrect interpretation of pixels on screen.
I think it's just inherent to the problem space. Obviously it understands something about the world to be able to generate convincing depictions of it.
Just having a better or bigger model? Better training data, better feedback process, etc.
Seems more likely then "it can simulate reality".
Also I take anecdotal reviews like that with a grain of salt. I follow numerous AI groups on Reddit and elsewhere and many users seem to have strong opinions that their tool of choice is the best. These reviews are highly biased.
Not to say I'm not impressed, but it's just been released.
Others have provided explanations for things like object persistence, for example keeping a memory of the rendering outside of the frame.
The comment from the expert is definitely interesting and compelling, but clearly still speculation based on the following comment.
> I won't be surprised if Sora is trained on lots of synthetic data using Unreal Engine 5. It has to be!
I like the speculation though, the comments provide some convincing explanations for how this might work. For example, the idea that it is trained using synthetic 3-dimensional data from something like UE5 seems like a brilliant idea. I love it.
Also in his example video the physics look very wrong to me. The movement of the coffee waves are realistic-ish at best. The boat motion also looks wrong and doesn't match up with the liquid much of the time.
I think you are reading too far into this. The title of the technical paper is “ Video generation models as world simulators”.
This is “just” a transformer that takes in a sequence of noisy image (video frame) tokens + prompt, and produces a sequence of less noisy video tokens. Repeat until noise gone.
The point they’re making, which is totally valid, is that in order for such a model to produce videos with realistic physics, the underlying model is forced to learn a model of physics (a “world simulation”).
AlphaGo and AlphaZero were able to achieve superhuman performance due to the availability of perfect simulators for the game of Go. There is no such simulator for the real world we live in. (Although pure LLMs sorta learn a rough, abstract representation of the world as perceived by humans.) Sora is an attempt to build such a simulator using deep learning.
This actually affirms my comment above.
“Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.”
`since it is trained to simulate the real world, as opposed to imitate the pixels.`
It's not that its learning a model of the world instead of imitating pixels - the world model is just a necessary emergent phenomenon from the pixel imitation. It's still really impressive and very useful, but it's still 'pixel imitation'
What I want is an AI trained to simulate the human body, allowing scientists to perform artificial human trials on all kind of medicines. Cutting trial times from years to months.
Movie making is going to become fine-tuning these foundational video models. For example, if you want Brad Pitt in your movie you'll need to use his data to fine-tune his character.
Pretty sure many latent spaces are not trained to represent 3D motions and some detailed physics of the real world. Those in pure text LLMs, for example.
More importantly, its capabilities suggest AGI and general robotics could be closer than many think (even though some key weaknesses remain and further improvements are necessary before the goal is reached.)
EDIT: I just saw this relevant comment by an expert at Nvidia:
https://twitter.com/DrJimFan/status/1758210245799920123