Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Most of these 1+ million words are almost never used, so 200k is plenty for English. Optimistically, we hope that rarer words would be longer and to some degree compositional (optim-ism, optim-istic, etc.), but unfortunately this is not what tokenisers arrive at (and you are more likely to get "opt-i-mis-m" or something like that). People have tried to optimise tokenisation and the main part of LLM training jointly, which leads to more sensible results, but this is unworkable for larger models, so we are stuck with inflated basic vocabularies.

It is also probably possible now to go even for larger vocabularies, in the 1-2 million range (by factorising the embedding matrix, for example), but this does not lead to noticeable improvements in performance, AFAIK.



Performance would be massively improved on constrained text tasks. That alone makes it worth it to expand the vocabulary size.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: