So one of the big reasons there was hype about Sora is that it felt very likely from watching a few videos that there was an internal physical simulation of the world happening and the video was more like a camera recording that physical and 3D scene simulation. It was just a sort of naïve sense that there HAD to be more going on behind the scenes than gluing bits of other videos together.
This is evidence, and it’s appearing even in still image generators. The models essentially learn how to render a 3D scene and take a picture of it. That’s incredible considering that we weren’t trying to create a 3D engine, we just threw a bunch of images at some linear algebra and optimized. Out popped a world simulator.
We humans live in a 3D world and our training set is a continuous stereo stream of a constant scene from different angles. Sora, on the other hand, learned the world by watching TV. It needs to play more video games, in order to learn 3D scenes (implicit representation of a world) and taking their pictures (rendering). Maybe that was the case I don’t know.
Who is people? Most folks only vaguely know about generative AI if they know about it at all. I'm technical but not in software specifically and usually ignore most AI news, so GP's comment is in fact news to me!
There's a lot of nonsensical babble out there from people who have minimal technical insight - both on the sceptic and on the enthusiast side of AI.
It really wasn't clear for the longest time how these models generate things so well. Articles like this one are still rare and comparatively new. And they certainly haven't been around when less informed enthusiasts where already heralding AGI.
On the other hand, we have accrued quite a bit of evidence by now that these models do far more than glue together training data. But there are still sceptics out there who spread this sort misinformation.
Also, there's a paper I can't find now which shows you can find the name of the operation in the middle layers if you provide example of what to do. That is, if you prompt: "9 2 5 -> 5 2 9, 0 7 3 ->", you'll find the internals generate "reverse" even though it doesn't appear in the output.
If you try it with a longer sequence of numbers it seems to stumble. E.g. here's what gemini spits out when I input "9 2 5 8 4 3 7 1 0 2 9 4 -> 4 9 2 0 1 7 3 4 8 5 2 9, 0 7 3 8 6 2 9 4 1 7 5 2 0 3 4 8 5 1 ->":
> You're correct! The pattern you identified applies to both sets of numbers. Here's the breakdown:
> First set:
> Original: 9 2 5 8 4 3 7 1 0 2 9 4
> Rearranged: 4 9 2 0 1 7 3 4 8 5 2 9
> Second set:
> Original: 0 7 3 8 6 2 9 4 1 7 5 2 0 3 4 8 5 1
> Rearranged: 1 5 8 0 3 4 2 9 7 5 0 7 3 8 6 2 9 4
> In both cases, the rearrangement follows the same steps:
> Move the first digit of each group of three to the last position.
> Keep the middle digit in the same position.
> Repeat steps 1 and 2 for all groups of three.
> Therefore, the pattern holds true for both sets of numbers you provided.
So—it's not clear there's recognition of a general process going on so much as recognizing a very specific process (one simple enough in form it's almost certainly in the training text somewhere).
So you can programmatically approximate the scene intrinsics in this paper with traditional programs as well given a single image.
All these scene data is already suggested by the data, what the AI model is doing is approximating the data it needs to generate a scene.
It’s not just gluing data together, it’s discovering interrelationships in a very high dimensional feature space.
It’s not creating anything it hasn’t seen either, which is why many image models (mostly smaller ones) have so much trouble with making fine details make sense, but are good with universal patterns like physics, lighting, the human shape, and composition.
(not sure if you were trying to imply that, but it felt like it)
> It really wasn't clear for the longest time how these models generate things so well.
Honestly, I think it was pretty much just as clear in 2021 as it is in 2024. Whether you consider that 'clear' or 'not clear' is a matter of personal choice but I don't think we've really advanced our understanding all that far (mechanistic interpretability does not tell us that much, grokking is a phenom that does not apply to modern LLMs, etc. )
> we have accrued quite a bit of evidence by now that these models do far more than glue together training data. But there are still sceptics out there who spread this sort misinformation.
Few people who actually worked in this field and were familiar with the concept of 'implicit regularization' really ever thought the 'glue together training data' or 'stochastic parrot' explanations were very compelling at all.
Like the cat springing a 5th leg and then losing it just as quick, in a cherry-picked video from the software makers? How does that fit your wishful narrative?
Look at how anyone who's not an artist (especially children) draws pretty much anything, whether it's people or animals or bicycles. You'll see a lot of asymmetric faces, wrong shading, completely unrealistic hairs etc and yet all those people are intelligent and excellent at understanding how the world works. Intelligence can't be expressed in a directly measurable way.
Those videos may be cherry-picked but they're almost good enough that you have to pick problems to point out, problems that will likely be gone a couple iterations later.
Less than a decade ago we read articles about Google's weird AI model producing super trippy images of dogs, now we're at deepfakes that successfully scam companies out of millions and AI models that can't yet reliably preserve consistency across an entire image. In another ten years every new desktop, laptop and smartphone will have an integrated AI accelerator and most of those issues will be fixed.
Anthromporphizing models lets assumptions of human behavior slip into discussions. Most people dont even know how their brains process information while models are not children.
New models appear to be close and based on arguments here, closing the gap is simply a matter of time. This is based on observing the rate of progress, not the actual underlying actions. A tendency belied by assumptions based on human thinking, not the signficant amounts of processing that happens unconciously by us.
Sure expanding the data set has improved results, yet bad hands, ghost legs, and other weirdness persist. If you have a world model, then this shouldn't happen - there is a logical rule to follow, not simply a pixel level correlation.
Working from the other side, if image/video gen has reached the point that it is faithfully recreating 3D rules, then we should expect 3D wireframes to be generated. We should see an update to https://github.com/openai/point-e.
This isn't splitting hairs - this behavior can be hand waved away when building a PoC, not when you make a production ready product.
As for scams, they target weaknesses, not the strongest parts of a verification processes, making them strawmen arguments for AI capabilty.
>If you have a world model, then this shouldn't happen - there is a logical rule to follow, not simply a pixel level correlation.
Oh I guess humans don't have world models then. It's so weird seeing this rhetoric said again and again. No a world model doesn't mean a perfect one. A world model doesn't mean a logical one either. Humans clearly don't work by logic.
NNs are _not_linear algebra. The genius of NN is that it is a half-linear algebra (assuming most of them these days use ReLU activation.) And thst half linearity gives power.
So one of the big reasons there was hype about Sora is that it felt very likely from watching a few videos that there was an internal physical simulation of the world happening and the video was more like a camera recording that physical and 3D scene simulation. It was just a sort of naïve sense that there HAD to be more going on behind the scenes than gluing bits of other videos together.
This is evidence, and it’s appearing even in still image generators. The models essentially learn how to render a 3D scene and take a picture of it. That’s incredible considering that we weren’t trying to create a 3D engine, we just threw a bunch of images at some linear algebra and optimized. Out popped a world simulator.