Hacker News new | past | comments | ask | show | jobs | submit login
DUSt3R: Geometric 3D Vision Made Easy (naverlabs.com)
145 points by smusamashah 10 months ago | hide | past | favorite | 16 comments



People have been posting some really interesting and useful use cases of this tech

Getting 3d view from few pictures of an apartment's listing https://twitter.com/JeromeRevaud/status/1764035510236758096

Two pictures of kitchen https://x.com/janusch_patas/status/1764025964915302400

Two pictures of office without any overlap https://x.com/JeromeRevaud/status/1763495315389165963


If I understand things correctly, this relies extremely heavily on learned prior shapes, meaning it'll guess depth from a monocular image and then go from there. In line with that, it uses a Vision Transformer like MiDaS (Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer).

That's why it can still reconstruct a scene even if the images do not overlap at all: https://twitter.com/JeromeRevaud/status/1763495315389165963

But what that also means is that this is closer to generative AI than to objective measurements. If the image to depth estimation goes very wrong, it might hallucinate shapes that aren't there.


> But what that also means is that this is closer to generative AI than to objective measurements. If the image to depth estimation goes very wrong, it might hallucinate shapes that aren't there.

But people do that all the time too. Relying on priors is fine for many practical applications and sometimes there's no way around it.


This is awesome. Kudos. You have way more respect in my eyes since, not only did you post your paper, you posted the source.

Too many times I’ve read claims without source so no one can reproduce and verify results. Now I can, and have, verified the results. Top notch.


Agreed. And not just the source, but a fully functional local demo! Runs great on my M1 pro using:

PYTORCH_ENABLE_MPS_FALLBACK=1 python3.10 demo.py --weights checkpoints/DUSt3R_ViTLarge_BaseDecoder_512_dpt.pth --device 'mps'


Am I imagining things or is there a trend here?

Seems like we get more and more generalist approaches which are less specific and combine a lot of what used to be individual steps and techniques. In doing so they don't only become conceptually simpler but surprisingly more accurate as well. Possibly because a unified approach is more integrated and thus better at filling the gaps in one sub-problem with information form other sub-problems.


Somewhat related to The Bitter Lesson (though perhaps you’ve already read it):

> One thing that should be learned from the bitter lesson is the great power of general purpose methods

http://www.incompleteideas.net/IncIdeas/BitterLesson.html


That's the trend since 2012 basically, when deep learning took over from hand-tuned feature extraction for image classification.

The fiddly, brittle and multi-step nature of 3D vision endured longer but is going through the same transformation.


As someone with relatively little knowledge of machine learning, an overview of mvs systems and a desire to understand what these people have done, does anyone have any suggestions on where to start learning so that I could play with the ideas within one lifetime? Would I be wasting my time on basics of machine learning if I only have stereo vision as an interest?


Does this method also work when different cameras (or different camera zoom etc. settings) are used for every input image?


I played around with DUSt3R last night using an iPhone with different lenses (whether you consider this a different camera or not, I defer to you). It worked well. Note that the camera intrinsics aren't used here, so it makes sense that it would tolerate different lenses or cameras. I did not test wildly divergent lens types (e.g., a fisheye lens).


Great! Now I need an ELI5 tutorial on taking pix of a property and generating a 3D model for 3D printing. Then the kid can request buildings for the model train layout!


Can this be used for body measurement, eg 4 shots different poses combined together? What kind of accuracy might you get if so ...just curious?


Different poses? I don't think so, at least not with the current setup.

For example, I tried this with a dog (walking around the seated dog, taking photos as I did so). The dog turned her head while I was taking photos. The portion of the head that moved was not represented in the final output.


Could this somehow be used to generate a Zillow-style tour of a physical space?


Pretty impressive. I wonder though why it was necessary to put the point map of the second image into the coordinate frame of the first. Isn’t it all the same from the point of the neural net?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: