What we have seen in the past 10 years of machine learning is that it’s extremely hard to know the next technique that will be practical in a vast array of problems. We had CNNs, batch norm, lstms, transformers, self supervised learning, reinforcement learning and a few other techniques that need to be perfected, thousands of ideas to be built upon, but nobody knows the next big thing that will work on real life problems.
That's how i felt at first, but getting deeper into the Swin transformer paper it actually makes a fair bit of sense - convolutions can be likened to self-attention ops that can only attend to local neighborhoods around pixels. That's a fairly sensible assumption for image data, but it also makes sense that more general attention would better capture complex spatial relationships if you can find a way to make it computationally feasible. Swin transformers certainly go through some contortions to get there, and I bet we'll see cleaner hierarchical architectures in the future, but the results speak for themselves.
The transformer in transformer (TnT) model looks promising - you can set up multiple overlapping domains of attention, at arbitrary scales over the input.
Not as much as you'd think. The original paper sets up its models so that Swin-T ~ ResNet-50 and Swin-S ~ ResNet-101 in compute and memory usage. They're still a bit higher in my experience, but i can also do drop-in replacements for ResNets and get better results on the same tasks and datasets, even when the datasets aren't huge.
For me it was quite the opposite feeling: after the attention all you need paper I thought that convolutions will become obsolete quite fast. AFAIK it still didn't happen completely, something is still missing in unifying the two approaches.
> In biological construction, there is no clearly defined blueprint that shows the final structure. Instead, our genes contain the information to make the structure by controlling a sequence of events during morphogenesis.
This seems obvious now, but it never occurred to me. It inspires me to try to stay more focused on good processes and less on desired outcomes.
There's an article like this once a week because everyone is trying to stake their claim to "I predicted it" so that in 10-15 years they can retire on the conference circuit (ala yann lecun and bengio and all those others).
Here's the truth: the future of AI is commitment completely unknown and not even at all certain - plenty of revolutionary technologies have fallen to wayside over the years.
So to my fairly-educated-in-this-area mind I think we should stop listening to wannabe thought leaders (and even minted thought leaders) and just keep focused on our small parts.
This take is both overly cynical and downplays the contributions people like Lecun and Bengio have made to the field of machine learning, both through their own work and their academic progeny. I see you work broadly in the area, but I doubt your research touches on anything they've affected (that you know of) if you earnestly feel that way.
This article's focus doesn't strike me as especially aligned with current problems in applied AI (how will self-organizing systems relate to prediction problems in NLP, tabular data, or computer vision?) but the connections to robotics are plausible, and in any case the tone doesn't come off as a wannabe thought leader trying to stake a claim, more like an excited learner who is really interested in some new ideas in a niche that may end up being narrower than they currently hope. I won't be assembling object detectors this way any time soon, but it was still a really pleasant read.
An under-appreciated aspect of molecular biological systems and structures is that they are dominated by local interactions that have no knowledge of the global state but direct self-assembly. They are unplanned both in terms of structure and function. They persist because they provide a selective advantage to the system and their instructions are encoded in the reproductive information store of their parent system.