I'm not convinced. We have "hyper" and "lightning" diffusion models that run 1-4 steps and are pretty quick on consumer hardware. I really have no idea which would be quicker with some optimizations and hardware tailored for the use-case.
The hard part is keeping everything coherent over time in a dynamic scene with a dynamic camera. Hallucinating vaguely plausible lighting may be adequate for a still image, but not so much in a game if you hallucinate shadows or reflections of off-screen objects that aren't really there, or "forget" that off-screen objects exist, or invent light sources that make no sense in context.
The main benefit of raytracing in games is that it has accurate global knowledge of the scene beyond what's directly in front of the camera, as opposed to earlier approximations which tried to work with only what the camera sees. Img2img diffusion is the ultimate form of the latter approach in that it tries to infer everything from what the camera sees, and guesses the rest.
Right, but I'm not actually suggesting we use diffusion. At least, not the same models we're using now. We need to incorporate a few sample rays at least so that it 'knows' what's actually off-screen, and then we just give it lots of training data of partially rendered images and fully rendered images so that it learns how to fill in the gaps. It shouldn't hallucinate very much if we do that. I don't know how to solve for temporal coherence though -- I guess we might want to train on videos instead of still images.
Also, that new Google paper where it generates entire games from a single image has up to 60 seconds of 'memory' I think they said, so I don't think the "forgetting" is actually that big of a problem since we can refresh the memory with a properly rendered image at least every that often.
I'm just spitballing here though, I think all of Unreal 5.4 or 5.5 has put this into practice already with their new lighting system.
> We need to incorporate a few sample rays at least so that it 'knows' what's actually off-screen, and then we just give it lots of training data of partially rendered images and fully rendered images so that it learns how to fill in the gaps.
That's already a thing, there's ML-driven denoisers which take a rough raytraced image and do their best to infer what the fully converged image would look like based on their training data. For example in the offline rendering world there's Nvidia's OptiX denoiser and Intel's OIDN, and in the realtime world there's Nvidia's DLSS Ray Reconstruction which uses an ML model to do both upscaling and denoising at the same time.