Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Trigrams are a known sweetspot. Its a practical heuristic. Got no proofs to link. Wrote awikipedia semana parser and witnessed similar results in practice.

Going above 3 adds very,very little precision while guzzling space. Using 2 shows a drop in precision.

Probably has todo with how human built systems are built (we make them in our image - and we seem to have a thing for 3s - map(subject, verb, object) -> (origin, data, destinstion) etc)

Addendum; if you know what PCA is, id wager that added n gram dimensions share a linear dependency with lower dimensions - so sharing a statistical resemblence (covar(A,B) -> 0) that adds very little to the data's variability once you start adding dims above 3.



Why the downvotes with out correction?




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: