I haven’t gone through the paper in detail yet but maybe someone can answer. If ...

bunderbunder · 2024-10-03T18:23:03 1727979783

They didn't remove the hidden state entirely, they just removed it from the input, forget and update gates. I haven't digested the paper either, but I think that in the case of a GRU this means that the hidden state update masking (z_t and r_t in the paper's formulas) only depends on the new input, not the input plus the prior hidden state.

jfcoa · 2024-10-03T18:17:08 1727979428

It doesn't completely remove it, it removes certain dependencies on it so that it can be computed by parallel scan, there is still a hidden state. It bears some similarity to what was done with Mamba.

statusfailed · 2024-10-03T18:01:00 1727978460

I only had a quick look, but it looks like they tweaked the state update so the model can be run with parallel scan instead of having to do it sequentially.

_0ffh · 2024-10-03T20:09:43 1727986183

The trick is to make sure the recursive dependency stays linear, that's how you enable parallel training.