AFAIK mamba is continuation of the SSM research, which is basically something called long-convolution.
Instead of doing quadratic attention (computing how much each token attends to every other token) you just "somehow" compute a long (same length as input) convolution kernel, and then you apply the conv1d.
Again, from my limited understanding, it's bit related to applying FFT, doing some matmul and then IFFT back. We know that this works but it's slow. But there are many ways to compute FFT and one of them is with something called butterfly matrices. I think it's just approximation but it's good enough and it's very fast/efficient on current hardware.
To put this in context, quadratic sounds bad, but in practice, subquadratic algos are often slower because of hw limitations. So while there was a lot of excitement about SSM it's not so easy to say that llama is over now. Also, we don't know if mamba will scale up, and the only way to know that is to actually pay few millions for training. But I am optimistic.
Another interesting model from subquadratic family is RWKV. Worth checking, but I think you had a podcast about it :)
BTW: I am self-thought and I've only skimmed the paper some time ago so I might be very wrong.
BTW2: Another thing with attention is that there's usually KV-cache, which helps a lot with performance, and I think you cannot do that with mamba.
Instead of doing quadratic attention (computing how much each token attends to every other token) you just "somehow" compute a long (same length as input) convolution kernel, and then you apply the conv1d.
Again, from my limited understanding, it's bit related to applying FFT, doing some matmul and then IFFT back. We know that this works but it's slow. But there are many ways to compute FFT and one of them is with something called butterfly matrices. I think it's just approximation but it's good enough and it's very fast/efficient on current hardware.
To put this in context, quadratic sounds bad, but in practice, subquadratic algos are often slower because of hw limitations. So while there was a lot of excitement about SSM it's not so easy to say that llama is over now. Also, we don't know if mamba will scale up, and the only way to know that is to actually pay few millions for training. But I am optimistic.
Another interesting model from subquadratic family is RWKV. Worth checking, but I think you had a podcast about it :)
BTW: I am self-thought and I've only skimmed the paper some time ago so I might be very wrong.
BTW2: Another thing with attention is that there's usually KV-cache, which helps a lot with performance, and I think you cannot do that with mamba.