I'm serious. If you have to rely on mono, single image inputs then yeah ImageNet is going to do better. But it will also mistake every picture of a coke can as the real thing. It will be horrifically sensitive to malicious inputs. Much better would be to use 2 calibrated lenses and do 3D reconstruction. Even if you're just doing the reconstruction as a sanity check for a NN to weed out the false positives.