If I understand things correctly, this relies extremely heavily on learned prior shapes, meaning it'll guess depth from a monocular image and then go from there. In line with that, it uses a Vision Transformer like MiDaS (Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer).
But what that also means is that this is closer to generative AI than to objective measurements. If the image to depth estimation goes very wrong, it might hallucinate shapes that aren't there.
> But what that also means is that this is closer to generative AI than to objective measurements. If the image to depth estimation goes very wrong, it might hallucinate shapes that aren't there.
But people do that all the time too. Relying on priors is fine for many practical applications and sometimes there's no way around it.
Seems like we get more and more generalist approaches which are less specific and combine a lot of what used to be individual steps and techniques. In doing so they don't only become conceptually simpler but surprisingly more accurate as well. Possibly because a unified approach is more integrated and thus better at filling the gaps in one sub-problem with information form other sub-problems.
As someone with relatively little knowledge of machine learning, an overview of mvs systems and a desire to understand what these people have done, does anyone have any suggestions on where to start learning so that I could play with the ideas within one lifetime? Would I be wasting my time on basics of machine learning if I only have stereo vision as an interest?
I played around with DUSt3R last night using an iPhone with different lenses (whether you consider this a different camera or not, I defer to you). It worked well. Note that the camera intrinsics aren't used here, so it makes sense that it would tolerate different lenses or cameras. I did not test wildly divergent lens types (e.g., a fisheye lens).
Great! Now I need an ELI5 tutorial on taking pix of a property and generating a 3D model for 3D printing. Then the kid can request buildings for the model train layout!
Different poses? I don't think so, at least not with the current setup.
For example, I tried this with a dog (walking around the seated dog, taking photos as I did so). The dog turned her head while I was taking photos. The portion of the head that moved was not represented in the final output.
Pretty impressive. I wonder though why it was necessary to put the point map of the second image into the coordinate frame of the first. Isn’t it all the same from the point of the neural net?
Getting 3d view from few pictures of an apartment's listing https://twitter.com/JeromeRevaud/status/1764035510236758096
Two pictures of kitchen https://x.com/janusch_patas/status/1764025964915302400
Two pictures of office without any overlap https://x.com/JeromeRevaud/status/1763495315389165963