Hey it seems that UTF-8 support is broken on the page.
Test phrase could be something like "Жизнь прекрасна и удивительна" ("Life is great" in russian).
I make an assumption that this is the implementation on the page that is broken, not the actual tokenizer. The reason: russian works perfectly in GPT-3 which I guess wouldn't be the case with a tokenization as presented on the page.
Author here, you are correct! The issue here is due to the fact that a single user-perceived character might span into multiple tokens. This should be fixed now.
Test phrase could be something like "Жизнь прекрасна и удивительна" ("Life is great" in russian).
I make an assumption that this is the implementation on the page that is broken, not the actual tokenizer. The reason: russian works perfectly in GPT-3 which I guess wouldn't be the case with a tokenization as presented on the page.