Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That's how i felt at first, but getting deeper into the Swin transformer paper it actually makes a fair bit of sense - convolutions can be likened to self-attention ops that can only attend to local neighborhoods around pixels. That's a fairly sensible assumption for image data, but it also makes sense that more general attention would better capture complex spatial relationships if you can find a way to make it computationally feasible. Swin transformers certainly go through some contortions to get there, and I bet we'll see cleaner hierarchical architectures in the future, but the results speak for themselves.


The transformer in transformer (TnT) model looks promising - you can set up multiple overlapping domains of attention, at arbitrary scales over the input.


But you have to pay the price for losing the inductive bias of cnns

Swin are still cpu/memory (and data) intensive compared to CNNs, right?


Not as much as you'd think. The original paper sets up its models so that Swin-T ~ ResNet-50 and Swin-S ~ ResNet-101 in compute and memory usage. They're still a bit higher in my experience, but i can also do drop-in replacements for ResNets and get better results on the same tasks and datasets, even when the datasets aren't huge.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: