Once you try the demos, the animated image at the top feels misleading. Each segment cuts at just the right point to make you think you’d be able continue exploring these vast worlds, but in practice you can only walk a couple os steps before hitting an invisible wall, which becomes more frustrating than not being able to move at all. It feels like being trapped in a box. My reaction went from impressed to disappointed fast.
I get these are early steps, but they oversold it.
You can bypass the "Out of bound" message by setting a Javascript breakpoint after
`let t = JSON.parse(d[e].config_str)`
and then run
`Object.values(t.camera.presets).map(o=>o.max_distance=50&&o)`
in the console.
It breaks down pretty quickly once you get outside the default bounds, as expected, though.
I wonder how much of the remaining work boils down to generating a new scene based on the camera's POV when the player hits one of the bounds, and keeping these generated scenes in a tree structure, joining scenes at boundaries.
Yes, and you wouldn't even need to do it in realtime as a user walks around.
Generate incrementally using a pathfinding system for a bot to move around and "create the world" as it goes, as if a Google street view car followed the philosophy of George Berkeley.
Like a bizarro cousin of loop closure in SLAM— which is recognizing when you've found a different path to a place you've been before.
Except this time there is no underlying consistent world, so it would be up to the algorithm to use dead reckoning or some kind of coordinate system to recognize that you're approaching a place you've "been" before, and incorporate whatever you found there into the new scenes it produces.
I was imagining a few limitations to help with consistency: all scenes have the same number of edges (say, 10) ensuring there's a limited set of scenes you can navigate to from the current one and previously generated scenes can get reused, and no flying, that way we can only worry about generating prism-shaped rooms with a single ceiling and floor edge.
I suppose this is the easy part, actually; for me the real trouble might be collision based on the non-deterministic thing that was generated, i.e. how to decide which scene edges the player should be able to travel through, interact with, be stopped by, burned by, etc.
I know you didn’t mean it like this, but this is kind of an insult to the insane amounts of work that go into crafting just the RNG systems behind roguelikes.
Or pair something like this with SLAM to track the motion and constrain its generation - feed it the localisation/particle/Kalman filter (or whatever map representation) as additional context, and it should be able to form consensus fairly quickly?
(Half-baked thoughts)
I first got irritated a bit by this as well, but then the game Myst came to mind.
So I'm willing to accept the limitation, and at this point we know that this can only get better. Next I thought about the likelihood of Nvidia releasing an AI game engine, or more of a renderer, fully AI based. It should be happening within the next 10 years.
Imagine creating a game by describing scenes, like the ones in the article, with a good morphing technology between scenes, so that the transitions between them are like auto-generated scenes which are just as playable.
The effects shown in the article were very interesting, like the ripple, sonar or wave. The wave made me think about how trippy games could get in the future, more extreme versions of the Subnautica video [0] which was released last month.
We could generate video games which would periodically slip into hallucinations, a thing that is barely doable today, akin to shader effects in Far Cry or other games when the player gets poisoned.
It's "old news" I guess at this point, but the AI Minecraft demo (every frame generated from the previous frame, no traditional "engine") is still the most impressive thing to me in this space https://oasis.us.decart.ai/welcome There are some interesting "speed runs" people have been doing like https://www.youtube.com/watch?v=3UaVQ5_euw8
We might all be dead in 10 years, but with big tech companies making their plays, all the VC money flowing in to new startups, and nuclear plants being brought online to power the next base model training runs, there's room for a little mild entertainment like these sorts of gimmicks in the next 3 years or so. I doubt anything that comes of it will top even my top 15 video games list though.
> We might all be dead in 10 years, but with big tech companies making their plays, all the VC money flowing in to new startups, and nuclear plants being brought online to power the next base model training runs, there's room for a little mild entertainment like these sorts of gimmicks in the next 3 years or so. I doubt anything that comes of it will top even my top 15 video games list though.
That’s a contestant for the most depressing tradeoff ever. “Yeah, we’ll all die in agony way before our time, but at least we got to play with a neat but ultimately underwhelming tool for a bit”.
You’re describing a pie in the sky. A vision. Not reality. We have been burned many times already, nothing in this field is a given.
> at this point we know that this can only get better.
We don’t know that. It will probably get better, but will it be better enough? No one knows.
> It should be happening within the next 10 years.
Every revolution in tech is always ten years away. By now that’s a meme. Saying something is ten years away is about as valuable as saying one has no idea how doable it is.
> Imagine
Yes, I understand the goal. Everyone does, it’s not complicated. We can all imagine Star Trek technology, we all know where the compass is pointed, that doesn’t make it a given.
In fact, the one thing we can say for sure about imagining how everything will be great in ten years is that we routinely fail to predict the bad parts. We don’t live in fantasy land, advancements in tech are routinely used for detrimental reasons.
> Imagine creating a game by describing scenes, like the ones in the article, with a good morphing technology between scenes, so that the transitions between them are like auto-generated scenes which are just as playable.
Why do you think this game would be good? I'm not a game maker but the visual layer is not the reason people like or enjoy a game (ex: nintendo). There are teams of professionals making games today that range from awful to great. I get that there are indie games made by a single person that will benefit from generated graphics, but asset creation seems to be a really small part of it.
“We are hard at work improving the size and fidelity of our generated worlds”
I imagine the further you move from the input image, the more the model has to make up information and the harder to keep it consistent. Similar problem with video generation.
> I imagine the further you move from the input image, the more the model has to make up information and the harder to keep it consistent. Similar problem with video generation.
Which is the same thing as saying this may turn out to be a dud, like so many other things in tech and the current crop of what we’re calling AI.
Like I said, I get this is an early demo, but don’t oversell it. They could’ve started by being honest and clarifying they’re generating scenes (or whatever you want to call them, but they’re definitely not “worlds”), letting you play a bit, then explain the potential and progress. As it is, it just sounds like they want to immediately wow people with a fantasy and it detracts from what they do have.
Maybe they think it's a good deal, producing some oversold tech demos in exchange for a decade's worth of funding and not having to produce anything more than an "Our Incredible Journey" letter at the end. The prospect of replacing all human labor has made it easier than ever to run the grift on investors in this time of peak FOMO.
Fair criticism. I’m also not a fan of hyperbole. Still find World Labs stuff super intriguing and I’m optimist about them to be able to fulfill the vision.
In general, it depends on how much the model ends up "understanding" the input. (I use "understand" here in the sense some would claim SOTA LLMs do.)
You can imagine this as a spectrum. On the one end you have models that, at each output pixel, try to predict pixels that are locally similar to ones in previous frame; on the other end, you could imagine models that "parse" the initial input image to understand the scene - objects (buildings, doors, people, etc.) and their relationships, and separately, the style with which they're painted, and use that to extrapolate further frames[0]. The latter would obviously fare better, remaining stylistically consistent for longer.
(This model claims to be of the second kind.)
The way I see it: a human could do it[1], so there's no reason an ML model wouldn't be able to.
--
[0] - Brute-force approach: 1) "style-untransfer" the input, i.e. style-transfer to some common style, e.g. photorealistic or sketch, 2) extrapolate the style-untransfered image, and 3) style-transfer result back using original input as style reference. Feels like it should work somewhat okay-ish; wonder if anyone tried that.
[1] - And the hard part wouldn't be extrapolating the scene, but rather keeping the style.
This indeed looks more like photogrammetry than a diffusion model predicting the next frame. There's 3D information extracted from the input image and likely additional generated poses that allow reconstructing the scene with gaussian splats. Not sure how much segmentation (understanding of each part of the scene) is going on. Probably not much if I have to guess.
Models are really great at making stuff up though. And video models already have very good consistency over thousands of frames. It seems like larger worlds shouldn't be a huge hurdle. I wonder why they launched without that, as this doesn't seem much better than previous work.
As someone completely not involved in this project, I would predict that increasing the scene size while remaining halfway consistent isn't that difficult.
Let me elaborate by using cat-4d.github.io, one of their competitors in this field of research: If you look at the "How it works" section you can see that the first step is to take an input video and then create artificial viewpoints of the same action being observed by other cameras. And then in the 2nd step, those viewpoints are merged into one 4D gaussian splatting scene. That 2nd step is pretty similar to 4D NeRF training, BTW, just with a different data format.
Now if you need a small scene, you generate a few camera locations that are nearby. But there's nothing stopping you from generating other camera locations or even from using previously generated camera locations and moving the camera again, thereby propagating details that the AI invented outwards. So you could imagine this as you start with something "real" at the center of the map and then you create AI fakes with different camera positions in a circle around the real stuff, and then the next circle around the 1st-gen fakes, and the next circle, and so on. This process is mostly limited by 2 things: The ability of your AI model to capture a global theme. World Labs has demonstrated that they can invent details consistent with a theme in this demos, so I would assume they solved this already. And the other limit is computing time. A world box 2x in each direction is 8x the voxel data and I wouldn't be surprised if you need something like 16x to 32x the number of input images to fit the GSplats/NeRF.
So most likely, the box limit is purely because the AI model is slow and execution is expensive and they didn't want to spend 10,000x the resources for making the box 10x larger.
I mean its a marketing hype for their product. Its a pretty good starting step though - assuming they can build on it and expand that world space as opposed to just converting an image to 3D.
Certainly has some value to it.. marketing, hiring, fundraising (Assuming its a private company)
My take is that its a good start and 3-4 years from now it will have a lot of potential value in world creation if they can make the next steps.
It's definitely a balancing act. World labs was stealth for a bit. Without a brand, stated mission, examples / demos of what you are capable of... is harder to hire, fund raise or get the attention and mind-share you need once you are ready to ship product.
The risk is setting expectations that can't be fulfilled.
I'm in the 3D space and I'm optimistic about World Labs.
Obviously, the generation has to stop at some point and obviously from any key image you could continue generating if you had unlimited GPU, which I’m sorry they didn’t provide for you.
I am not sure it's obvious that you could continue generating from any key image and it wouldn't deteriorate into mush. If you take that museum scene and look at the vase-like display piece while walking around it as much as you can it already becomes fuzzy and has the beginnings of weird artifacts growing out of it.
I was also disappointed by a still image showing a vast sky, but in motion you see it's just a painting on a short ceiling. The model interpreted the vast sky as a painting on a short ceiling.
I get these are early steps, but they oversold it.