Outside of research transformers are rarely used for computer vision problems an...

fault1 · on Dec 25, 2021

> some hacks to get transformers to work with computer vision at a meaningful scale (splitting images into patches and convoluting the patches to produce features to feed into the transformer).

sounds a lot like 'classical computer vision'. e.g, when I learned the subject (mid 2000s), topological features were all the rage: https://en.wikipedia.org/wiki/Digital_topology

bertday · on Dec 26, 2021

Yeah. Even modern CV methods are hacky insofar as picking the “right” way to apply linear algebra. Convolution layers are hacked up matrix multiplications that are “inspired” by human vision. Of course, the real reason for the hacks is that form works in practice.

joconde · on Dec 25, 2021

I’ve looked into transformers for semantic segmentation, but the patching aspect seems to make it hard too. Do you have some sources that describe these hacks in detail?

lowdose · on Dec 25, 2021

You could do a code search on GitHub. I’m pretty lazy in the aspect of coding. I always seem to find a repo that has implemented an MVP with what I already had in mind. There are some gold nuggets on GitHub like Googles DDSP implementation they have academically published anonymous.

xiphias2 · on Dec 25, 2021

It will be interesting when Tesla and Waymo moves to transformer architecture, but as you wrote my guess is that it's not yet in production for vision tasks.

liuliu · on Dec 25, 2021

Tesla did, as mentioned in their AI Day. It is not full transformer (aka ViT). The use transformer decoder to synthesize data from different cameras and decode 3d coordinates directly (aka DETR).

xiphias2 · on Dec 25, 2021

Thanks, sounds great, I'll read the DETR paper

jowday · on Dec 25, 2021

I’m not sure they will, at least not with the research in the state it is presently. Researchers are interested in vision transformers because they’re competitive with CNNs if you give them enough training data - they don’t drastically outperform them.

Right now switching over to them would require a ton of code changes, relearning intuitions, debugging, profiling, etc. for not a ton of benefit.

xiphias2 · on Dec 25, 2021

Sure, I think the same, but the tweets came from Andrew Karpathy, he's watching this space like an eagle.