This indeed looks more like photogrammetry than a diffusion model predicting the...

This indeed looks more like photogrammetry than a diffusion model predicting the next frame. There's 3D information extracted from the input image and likely additional generated poses that allow reconstructing the scene with gaussian splats. Not sure how much segmentation (understanding of each part of the scene) is going on. Probably not much if I have to guess.