How long does it generally take between model architectures like Mamba being proposed and the use of these architectures in SotA mega models like GPT or Gemini? IIUC Mamba basically eliminates restrictions on context length which would be awesome to see in the super-mega high performance models.