That's very cool. I have almost no experience with robotics, so excuse the silly questions:
- How does it know what objects are? Does it use some sort of realtime object classifier neural net? What limitations are there here?
- Does the robot know when it can't perform a request? I.e. if you ask it to move a large box or very heavy kettlebell?
- How well does it do if the object is hidden or obscured? Does it go looking for it? What if it must move another object to get access to the requested one?
Disclaimer: I'm not one of the authors, but I work in this area.
You basically hit the nail on the head with these questions. This work is super cool, but you named a lot of the limitations with contemporary robot learning systems.
1. It's using an object classifier. It's described here (https://github.com/ok-robot/ok-robot/tree/main/ok-robot-navi...), but if I understanding it correctly basically they are using a ViT model (basically an image classification model) to do some labeling of images and projecting them onto a voxel grid. Then they are using language embeddings from CLIP to pair the language with the voxel grid. The limitations of this are that if they want this to run on the robot, they can't use the super huge versions of these models. While they could use a huge model on the cloud, that would introduce a lot of latency.
2. It almost certainly cannot identify invalid requests. There may be requests that are not covered by their language embeddings, in which case the robot would maybe do nothing. But it doesn't appear that this system has any knowledge of physics, other than the hardware limitations of the physical controller.
3. Hidden? Almost certainly wouldn't work. The voxel labeling relies on a module that labels the voxels and without visual info, it can't label them. Also, as far as I can tell, it doesn't appear to have very complex higher-order reasoning about, say, that a fork is in a drawer, which is in a kitchen, which is often in the back of a house. Partially obscured? That would be subject to the limitations of the visual classifier, so it might work. ViT is very good, but it probably depends on how obscured the object is.
The cool thing is that there are solutions to all of these problems, if the more basic problems can be solved more reliably to prove the underlying technology works.
> While they could use a huge model on the cloud, that would introduce a lot of latency.
Will all the recent work to make gen. AI faster (see groq for LLM & fal.ai for stable diffusion), I wonder if the latency will become low enough to make this a non-issue or at least good enough
If AI/ML home systems become significantly common for consumers before the onboard technology is capable, I could see home cacheing appliances for LLMs.
Like something that sits next to your router (or more likely, routers that come stock with it).
Does a robot that moves things in a home need this? The challenging decisions are (off the top of my head):
1. what am i picking up? - this can be AI in the cloud as it does not need to be real time
2. how do i pick it up? - this can be AI in the cloud as it does not need to be real time - the robot can take its time picking the object up
3. after pickup, where do i put the object? localization while moving probably needs to be done locally but identifying where to put down can be done via cloud, again, no rush
4. how do put the object down? again, the robot can take its time
You can see in the video the robot pauses before performing the actions after finding the object in its POV, so real time isn't a hard req for a lot of these
User fishbotics already answers a lot of these questions downstream, but just confirming it here as an author of the project/paper:
> - How does it know what objects are? Does it use some sort of realtime object classifier neural net? What limitations are there here?
We use Lang-SAM (https://github.com/luca-medeiros/lang-segment-anything) to do most of this, with CLIP embeddings (https://openai.com/research/clip) doing most of the heavy lifting of connecting image and text. One of the nice properties of using CLIP-like models is that you don't have to specify the classes you may want to query later, you can just come up with them during runtime.
> - Does the robot know when it can't perform a request? I.e. if you ask it to move a large box or very heavy kettlebell?
Nope! As it is right now, the models are very simple and they don't try to do anything fancy. However, that's why we open up our code! So the community can build smarter robots on top of this project that can use even more visual cues about the environment.
> - How well does it do if the object is hidden or obscured? Does it go looking for it? What if it must move another object to get access to the requested one?
It fails when the object is hidden or obscured in the initial scan, but once again we think it could be a great starting point for further research :) One of the nice things, however, is that we take full 3D information in consideration, and so even if some object is visible from only some of the angles, the robot has a chance to find it.
- How does it know what objects are? Does it use some sort of realtime object classifier neural net? What limitations are there here?
- Does the robot know when it can't perform a request? I.e. if you ask it to move a large box or very heavy kettlebell?
- How well does it do if the object is hidden or obscured? Does it go looking for it? What if it must move another object to get access to the requested one?