Hacker News new | past | comments | ask | show | jobs | submit login

Hey it seems that UTF-8 support is broken on the page.

Test phrase could be something like "Жизнь прекрасна и удивительна" ("Life is great" in russian).

I make an assumption that this is the implementation on the page that is broken, not the actual tokenizer. The reason: russian works perfectly in GPT-3 which I guess wouldn't be the case with a tokenization as presented on the page.




Author here, you are correct! The issue here is due to the fact that a single user-perceived character might span into multiple tokens. This should be fixed now.


Hey. Thank you! However has the fix not been deployed yet? Still shows broken UTF-8.

> a single user-perceived character might span into multiple tokens

Is this the way it works as designed or is this a bug?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: