From what I know about RWKV, it's mostly a one man effort and doesn't have the s...

kouteiheika · on April 11, 2024

> The claim that RWKV isn't paralleizable during training also seems to be refuted in their readme.

RNNs are trivially parallizable (I've done it myself), as long as you're training them on multiple documents in parallel and have enough memory for the state for each document. You just train them 1 token at a time across N documents, instead of the transformer-like N tokens at a time across 1 document.

ianbutler · on April 11, 2024

RWKV is parallel at the level of the sequence like a transformer. Its formulation allows for each timestep t to be calculated in parallel except for a single serial scan at the end for aggregation which they use a custom cuda kernel to do.

kouteiheika · on April 11, 2024

I know. I trained RWKV myself using both methods, like a transformer and like an RNN.

Ultimately it probably doesn't matter that you can train it like a transformer because you can just train it in parallel on multiple documents simultaneously one token at a time, and, at least from my experience, this worked just as well, if not better.

Plus, doing it this way is more general because you don't need any custom kernels to do it, and it also helps the model to learn to deal with an "infinite" context better (while if you train it like a transformer its performance will regress once you evaluate it outside of the context window on which you've trained it, at least from what I've seen in my training runs).

nmfisher · on April 11, 2024

I played around with RWKV some time ago (maybe early 2023?) with similarly disappointing results, but my suspicion was that this was a dataset/training issue, not an architectural one. Leaderboard performance has improved a lot since then, and anecdotally, I've seen/heard some quite decent RWKV TTS experiments, so I'm bullish.

Also, the team has incorporated/raised money from investors (recursal.ai), so it's no longer a one man effort.