> Sora currently exhibits numerous limitations as a simulator. For example, it does not accurately model the physics of many basic interactions, like glass shattering. Other interactions, like eating food, do not always yield correct changes in object states
Regardless whether all the frames are generated at once, or one by one, you can see in their examples it's still just pixel based. See the first example with the dog with blue hat, the woman has a blue thing suddenly spawn into her hand because her hand went over another blue area of the image.
I'm not denying that there are obvious limitations. However, attributing them to being "pixel-based" seems misguided. First off, the model acts in latent space, not directly on pixels. Secondly, there is no fundamental limitation here. The model has already acquired limited-yet-impressive ability to understand movement, texture, social behavior, etc., just from watching videos.
I learned to understand reality by interpreting photons and various sensory inputs. Does that make my model of reality fundamentally flawed? In the sense that I only have a partial intuitive understanding of it, yes. But I don't need to know Maxwell's equations to get a sense of what happens when I open the blinds or turn on my phone.
I think many of the limitations we are seeing here - poor glass physics, flawed object permanence - will be overcome given enough training data and compute.
We will most likely need to incorporate exploration, but we can get really far with astute observation.
Actually your comment gives me hope that we will never have AI singularity, since how the brain works is flawed, and were trying to copy it.
Heck a super AI might not even be possible, what if we're peak intelligence with our millions of years of evolution?
Just adding compute speed will not help much -- say the goal of an intelligence is to win a war. If you're tasked with it then it doesn't matter if you have a month or a decade (assume that time is.frozen while you do your research), its a too complex problem and simply cannot be solved, and the same goes for an AI.
Or it will be like with chess solvers, machines will be more intelligent than us simply because they can load much more context to solve a problem than us in their "working memory"
> Actually your comment gives me hope that we will never have AI singularity, since how the brain works is flawed, and were trying to copy it.
As someone working in the field, the vast majority of AI research isn't concerned with copying the brain, simply with building solutions that work better than what came before. Biomimetism is actually quite limited in practice.
The idea of observing the world in motion in order to internalize some of its properties is a very general one. There are countless ways to concretize it; child development is but one of them.
> If you're tasked with it then it doesn't matter if you have a month or a decade (assume that time is.frozen while you do your research), its a too complex problem and simply cannot be solved, and the same goes for an AI.
I highly disagree.
Let's assume a superintelligent AI can break down a problem into subproblems recursively, find patterns and loopholes in absurd amounts of data, run simulations of the potential consequences of its actions while estimating the likelihood of various scenarios, and do so much faster than humans ever could.
To take your example of winning a war, the task is clearly not unsolvable. In some capacity, military commanders are tasked with it on a regular basis (with varying degrees of success).
With the capabilities described above, why couldn't the AI find and exploit weaknesses in the enemy's key infrastructure (digital and real-world) and people? Why couldn't it strategically sow dissent, confuse, corrupt, and efficiently acquire intelligence to update its model of the situation minute-by-minute?
I don't think it's reasonable to think of a would-be superintelligence as an oracle that gives you perfect solutions. It will still be bound by the constraints of reality, but it might be able to work within them with incredible efficiency.
This is an excellent comparison and I agree with you.
Unfortunately we are flawed. We do know how physics work intuitively and can somewhat predict them, but not perfectly. We can imagine how a ball will move, but the image is blurry and trajectory only partially correct. This is why we invented math and physics studies, to be able to accurately calculate, predict and reproduce those events.
We are far off from creating something as efficient as the human brain. It will take insane amounts of compute power to simply match our basic innacurate brains, imagine how much will be needed to create something that is factually accurate.
Indeed. But a point that is often omitted from comparisons with organic brains is how much "compute equivalent" we spent through evolution. The brain is not a blank slate; it has clear prior structure that is genetically encoded. You can see this as a form of pretraining through a RL process wherein reward ~= surviving and procreating. If you see things this way, data-efficiency comparisons are more appropriate in the context of learning a new task or piece of information, and foundation models tend to do this quite well.
Additionally, most of the energy cost comes from pretraining, but once we have the resulting weights, downstream fine-tuning or inference are comparatively quite cheap. So even if the energy cost is high, it may be worth it if we get powerful generalist models that we can specialize in many different ways.
> This is why we invented math and physics studies, to be able to accurately calculate, predict and reproduce those events.
We won't do away without those, but an intuitive understanding of the world can go a long way towards knowing when and how to use precise quantitative methods.
But, even there it says:
> Sora currently exhibits numerous limitations as a simulator. For example, it does not accurately model the physics of many basic interactions, like glass shattering. Other interactions, like eating food, do not always yield correct changes in object states
Regardless whether all the frames are generated at once, or one by one, you can see in their examples it's still just pixel based. See the first example with the dog with blue hat, the woman has a blue thing suddenly spawn into her hand because her hand went over another blue area of the image.