As far as I have seen, most AI solutions focus on object detection on a single frame. Would temporal memory, or video detection, increase the confidence a lot? I have not seen any solutions that would understand larger context over multiple seconds timespans.
The real kicker would be for it to integrate a 3D model of what it's looking at. But that would require some heuristics of the world which would probably require some other kind of training data than just a bunch of images. Maybe if/when 3D scans and the corresponding 2D images can be acquired en masse together, or if it could be done in a simulated environment with virtual cameras in them?