I’m not sure they will, at least not with the research in the state it is presently. Researchers are interested in vision transformers because they’re competitive with CNNs if you give them enough training data - they don’t drastically outperform them.
Right now switching over to them would require a ton of code changes, relearning intuitions, debugging, profiling, etc. for not a ton of benefit.
Right now switching over to them would require a ton of code changes, relearning intuitions, debugging, profiling, etc. for not a ton of benefit.