What we have seen in the past 10 years of machine learning is that it’s extremel...

cracker_jacks · on Dec 16, 2021

Transformers subverting convolution on their own turf (vision) was certainly unexpected.

heyitsguay · on Dec 16, 2021

That's how i felt at first, but getting deeper into the Swin transformer paper it actually makes a fair bit of sense - convolutions can be likened to self-attention ops that can only attend to local neighborhoods around pixels. That's a fairly sensible assumption for image data, but it also makes sense that more general attention would better capture complex spatial relationships if you can find a way to make it computationally feasible. Swin transformers certainly go through some contortions to get there, and I bet we'll see cleaner hierarchical architectures in the future, but the results speak for themselves.

robbedpeter · on Dec 16, 2021

The transformer in transformer (TnT) model looks promising - you can set up multiple overlapping domains of attention, at arbitrary scales over the input.

algo_trader · on Dec 16, 2021

But you have to pay the price for losing the inductive bias of cnns

Swin are still cpu/memory (and data) intensive compared to CNNs, right?

heyitsguay · on Dec 16, 2021

Not as much as you'd think. The original paper sets up its models so that Swin-T ~ ResNet-50 and Swin-S ~ ResNet-101 in compute and memory usage. They're still a bit higher in my experience, but i can also do drop-in replacements for ResNets and get better results on the same tasks and datasets, even when the datasets aren't huge.

xiphias2 · on Dec 16, 2021

For me it was quite the opposite feeling: after the attention all you need paper I thought that convolutions will become obsolete quite fast. AFAIK it still didn't happen completely, something is still missing in unifying the two approaches.