What we have seen in the past 10 years of machine learning is that it’s extremely hard to know the next technique that will be practical in a vast array of problems. We had CNNs, batch norm, lstms, transformers, self supervised learning, reinforcement learning and a few other techniques that need to be perfected, thousands of ideas to be built upon, but nobody knows the next big thing that will work on real life problems.
That's how i felt at first, but getting deeper into the Swin transformer paper it actually makes a fair bit of sense - convolutions can be likened to self-attention ops that can only attend to local neighborhoods around pixels. That's a fairly sensible assumption for image data, but it also makes sense that more general attention would better capture complex spatial relationships if you can find a way to make it computationally feasible. Swin transformers certainly go through some contortions to get there, and I bet we'll see cleaner hierarchical architectures in the future, but the results speak for themselves.
The transformer in transformer (TnT) model looks promising - you can set up multiple overlapping domains of attention, at arbitrary scales over the input.
Not as much as you'd think. The original paper sets up its models so that Swin-T ~ ResNet-50 and Swin-S ~ ResNet-101 in compute and memory usage. They're still a bit higher in my experience, but i can also do drop-in replacements for ResNets and get better results on the same tasks and datasets, even when the datasets aren't huge.
For me it was quite the opposite feeling: after the attention all you need paper I thought that convolutions will become obsolete quite fast. AFAIK it still didn't happen completely, something is still missing in unifying the two approaches.