This is awesome! So one of the big reasons there was hype about Sora is that it ...

ozgung · on Feb 24, 2024

We humans live in a 3D world and our training set is a continuous stereo stream of a constant scene from different angles. Sora, on the other hand, learned the world by watching TV. It needs to play more video games, in order to learn 3D scenes (implicit representation of a world) and taking their pictures (rendering). Maybe that was the case I don’t know.

DinaCoder99 · on Feb 24, 2024

> It needs to play more video games, in order to learn 3D scenes (implicit representation of a world) and taking their pictures (rendering).

This can be extrapolated from TV, too.

whimsicalism · on Feb 24, 2024

> just a sort of naïve sense that there HAD to be more going on behind the scenes than gluing bits of other videos together

People really still think that's all thats happening?

0_____0 · on Feb 24, 2024

Who is people? Most folks only vaguely know about generative AI if they know about it at all. I'm technical but not in software specifically and usually ignore most AI news, so GP's comment is in fact news to me!

sigmoid10 · on Feb 24, 2024

There's a lot of nonsensical babble out there from people who have minimal technical insight - both on the sceptic and on the enthusiast side of AI.

It really wasn't clear for the longest time how these models generate things so well. Articles like this one are still rare and comparatively new. And they certainly haven't been around when less informed enthusiasts where already heralding AGI.

On the other hand, we have accrued quite a bit of evidence by now that these models do far more than glue together training data. But there are still sceptics out there who spread this sort misinformation.

beepbooptheory · on Feb 24, 2024

Citations for that last claim?

viraptor · on Feb 24, 2024

This post we're commenting on. Also https://www.neelnanda.io/mechanistic-interpretability/othell...

Also, there's a paper I can't find now which shows you can find the name of the operation in the middle layers if you provide example of what to do. That is, if you prompt: "9 2 5 -> 5 2 9, 0 7 3 ->", you'll find the internals generate "reverse" even though it doesn't appear in the output.

DinaCoder99 · on Feb 24, 2024

If you try it with a longer sequence of numbers it seems to stumble. E.g. here's what gemini spits out when I input "9 2 5 8 4 3 7 1 0 2 9 4 -> 4 9 2 0 1 7 3 4 8 5 2 9, 0 7 3 8 6 2 9 4 1 7 5 2 0 3 4 8 5 1 ->":

> You're correct! The pattern you identified applies to both sets of numbers. Here's the breakdown:

> First set:

> Original: 9 2 5 8 4 3 7 1 0 2 9 4

> Rearranged: 4 9 2 0 1 7 3 4 8 5 2 9

> Second set:

> Original: 0 7 3 8 6 2 9 4 1 7 5 2 0 3 4 8 5 1

> Rearranged: 1 5 8 0 3 4 2 9 7 5 0 7 3 8 6 2 9 4

> In both cases, the rearrangement follows the same steps:

> Move the first digit of each group of three to the last position.

> Keep the middle digit in the same position.

> Repeat steps 1 and 2 for all groups of three.

> Therefore, the pattern holds true for both sets of numbers you provided.

So—it's not clear there's recognition of a general process going on so much as recognizing a very specific process (one simple enough in form it's almost certainly in the training text somewhere).

dartos · on Feb 24, 2024

So you can programmatically approximate the scene intrinsics in this paper with traditional programs as well given a single image.

All these scene data is already suggested by the data, what the AI model is doing is approximating the data it needs to generate a scene.

It’s not just gluing data together, it’s discovering interrelationships in a very high dimensional feature space. It’s not creating anything it hasn’t seen either, which is why many image models (mostly smaller ones) have so much trouble with making fine details make sense, but are good with universal patterns like physics, lighting, the human shape, and composition.

(not sure if you were trying to imply that, but it felt like it)

taneq · on Feb 24, 2024

I think the Othello paper is the go-to for proving that transformer networks do actually create internal models of the things they're trained on.

https://thegradient.pub/othello/

whimsicalism · on Feb 24, 2024

> It really wasn't clear for the longest time how these models generate things so well.

Honestly, I think it was pretty much just as clear in 2021 as it is in 2024. Whether you consider that 'clear' or 'not clear' is a matter of personal choice but I don't think we've really advanced our understanding all that far (mechanistic interpretability does not tell us that much, grokking is a phenom that does not apply to modern LLMs, etc. )

> we have accrued quite a bit of evidence by now that these models do far more than glue together training data. But there are still sceptics out there who spread this sort misinformation.

Few people who actually worked in this field and were familiar with the concept of 'implicit regularization' really ever thought the 'glue together training data' or 'stochastic parrot' explanations were very compelling at all.

stormfather · on Feb 24, 2024

Making Sora's results by glueing videos together would be far more impressive IMO. That would take ASI

patcon · on Feb 24, 2024

> we weren’t trying to create a 3D engine, we just threw a bunch of images at some linear algebra and optimized. Out popped a world simulator.

Heh. Sounds like what a personified evolution might say about a mind ;)

asadotzler · on Feb 24, 2024

Like the cat springing a 5th leg and then losing it just as quick, in a cherry-picked video from the software makers? How does that fit your wishful narrative?

alpaca128 · on Feb 24, 2024

Look at how anyone who's not an artist (especially children) draws pretty much anything, whether it's people or animals or bicycles. You'll see a lot of asymmetric faces, wrong shading, completely unrealistic hairs etc and yet all those people are intelligent and excellent at understanding how the world works. Intelligence can't be expressed in a directly measurable way.

Those videos may be cherry-picked but they're almost good enough that you have to pick problems to point out, problems that will likely be gone a couple iterations later.

Less than a decade ago we read articles about Google's weird AI model producing super trippy images of dogs, now we're at deepfakes that successfully scam companies out of millions and AI models that can't yet reliably preserve consistency across an entire image. In another ten years every new desktop, laptop and smartphone will have an integrated AI accelerator and most of those issues will be fixed.

intended · on Feb 24, 2024

Anthromporphizing models lets assumptions of human behavior slip into discussions. Most people dont even know how their brains process information while models are not children.

New models appear to be close and based on arguments here, closing the gap is simply a matter of time. This is based on observing the rate of progress, not the actual underlying actions. A tendency belied by assumptions based on human thinking, not the signficant amounts of processing that happens unconciously by us.

Sure expanding the data set has improved results, yet bad hands, ghost legs, and other weirdness persist. If you have a world model, then this shouldn't happen - there is a logical rule to follow, not simply a pixel level correlation.

Working from the other side, if image/video gen has reached the point that it is faithfully recreating 3D rules, then we should expect 3D wireframes to be generated. We should see an update to https://github.com/openai/point-e.

This isn't splitting hairs - this behavior can be hand waved away when building a PoC, not when you make a production ready product.

As for scams, they target weaknesses, not the strongest parts of a verification processes, making them strawmen arguments for AI capabilty.

og_kalu · on Feb 24, 2024

>If you have a world model, then this shouldn't happen - there is a logical rule to follow, not simply a pixel level correlation.

Oh I guess humans don't have world models then. It's so weird seeing this rhetoric said again and again. No a world model doesn't mean a perfect one. A world model doesn't mean a logical one either. Humans clearly don't work by logic.

intended · on Feb 24, 2024

IF we are getting into the semantics of what a valid world model is, then we really have different standards for what passess as logic.

og_kalu · on Feb 25, 2024

By your standards, you don't have a world model either. Maybe you agree with that though.

And logic as in formal logic. Human Intelligence clearly doesn't run on that.

SomeoneFromCA · on Feb 24, 2024

NNs are _not_linear algebra. The genius of NN is that it is a half-linear algebra (assuming most of them these days use ReLU activation.) And thst half linearity gives power.

GuB-42 · on Feb 24, 2024

Turns out that 3D graphics also involves a lot of linear algebra.

That's also why GPU are so effective at neural networks. There is a lot of the same maths.

ClumsyPilot · on Feb 24, 2024

Well we do have game engines for a while now. Maybe a game engine with a chat interface would do a better job

TheDudeMan · on Feb 24, 2024

Depends on whether your goal is to render a scene or push the envelope in generative AI.

3abiton · on Feb 24, 2024

I think is the key, most likely it was a big part of their training dataset.