User fishbotics already answers a lot of these questions downstream, but just confirming it here as an author of the project/paper:
> - How does it know what objects are? Does it use some sort of realtime object classifier neural net? What limitations are there here?
We use Lang-SAM (https://github.com/luca-medeiros/lang-segment-anything) to do most of this, with CLIP embeddings (https://openai.com/research/clip) doing most of the heavy lifting of connecting image and text. One of the nice properties of using CLIP-like models is that you don't have to specify the classes you may want to query later, you can just come up with them during runtime.
> - Does the robot know when it can't perform a request? I.e. if you ask it to move a large box or very heavy kettlebell?
Nope! As it is right now, the models are very simple and they don't try to do anything fancy. However, that's why we open up our code! So the community can build smarter robots on top of this project that can use even more visual cues about the environment.
> - How well does it do if the object is hidden or obscured? Does it go looking for it? What if it must move another object to get access to the requested one?
It fails when the object is hidden or obscured in the initial scan, but once again we think it could be a great starting point for further research :) One of the nice things, however, is that we take full 3D information in consideration, and so even if some object is visible from only some of the angles, the robot has a chance to find it.
A large motivation behind this line of home-robot work for me is thinking about the elderly, people with disabilities, or busy parents who simply don't have enough time to do it all. I am personally hopeful that we can teach AI to take the jobs that no one wants rather than the jobs that everyone wants :)
Thank you! A large motivation behind this line of home-robot work for me is thinking about the elderly, people with disabilities, or busy parents who simply don't have enough time to do it all. I am personally hopeful that we can teach AI to take the jobs that no one wants rather than the jobs that everyone wants :)
It's always a trade-off! You could have more accurate sensors and motors that are more expensive, or you can have cheaper motors with no sensors and higher accumulated errors. Since this is more of a research project than a product, we went for a cheap robot with the slower-but-more-accurate approach.
Encoders are not that expensive and they don't have to be integrated into the motor. I've done this stuff before, it's not so costly and it really improves the entire system.
The title has multiple meanings, some credits definitely should go to OK computer/radiohead, but also "OK Google" for controlling a home assistant, open-knowledge (OK) models, etc.
No, although it has some of the same people on the team (aka I'm the first author there, and my advisor is advising both projects :) )
The primary difference is that this is zero-shot (meaning the robot needs 0 (zero!) new data in a new home) but has only two skills (pick and drop); where Dobb-E can have many skills but will need you to give some demonstrations in a new home.
The mapping process can be done with any RGB-D cameras, we use an iPhone pro but any apple devide with AR-Kit should work. Once we have a sequence of RGB-D images with associated camera poses, we can just backproject the pixels (and any associated information, like CLIP embeddings) using the depth into voxels.