Outside of research transformers are rarely used for computer vision problems and CNNs remain the go to architecture. And you actually need to do some hacks to get transformers to work with computer vision at a meaningful scale (splitting images into patches and convoluting the patches to produce features to feed into the transformer).
> some hacks to get transformers to work with computer vision at a meaningful scale (splitting images into patches and convoluting the patches to produce features to feed into the transformer).
Yeah. Even modern CV methods are hacky insofar as picking the “right” way to apply linear algebra. Convolution layers are hacked up matrix multiplications that are “inspired” by human vision. Of course, the real reason for the hacks is that form works in practice.
I’ve looked into transformers for semantic segmentation, but the patching aspect seems to make it hard too. Do you have some sources that describe these hacks in detail?
You could do a code search on GitHub. I’m pretty lazy in the aspect of coding. I always seem to find a repo that has implemented an MVP with what I already had in mind. There are some gold nuggets on GitHub like Googles DDSP implementation they have academically published anonymous.
It will be interesting when Tesla and Waymo moves to transformer architecture, but as you wrote my guess is that it's not yet in production for vision tasks.
Tesla did, as mentioned in their AI Day. It is not full transformer (aka ViT). The use transformer decoder to synthesize data from different cameras and decode 3d coordinates directly (aka DETR).
I’m not sure they will, at least not with the research in the state it is presently. Researchers are interested in vision transformers because they’re competitive with CNNs if you give them enough training data - they don’t drastically outperform them.
Right now switching over to them would require a ton of code changes, relearning intuitions, debugging, profiling, etc. for not a ton of benefit.