Trigrams are a known sweetspot. Its a practical heuristic. Got no proofs to link. Wrote awikipedia semana parser and witnessed similar results in practice.
Going above 3 adds very,very little precision while guzzling space. Using 2 shows a drop in precision.
Probably has todo with how human built systems are built (we make them in our image - and we seem to have a thing for 3s - map(subject, verb, object) -> (origin, data, destinstion) etc)
Addendum; if you know what PCA is, id wager that added n gram dimensions share a linear dependency with lower dimensions - so sharing a statistical resemblence (covar(A,B) -> 0) that adds very little to the data's variability once you start adding dims above 3.
Going above 3 adds very,very little precision while guzzling space. Using 2 shows a drop in precision.
Probably has todo with how human built systems are built (we make them in our image - and we seem to have a thing for 3s - map(subject, verb, object) -> (origin, data, destinstion) etc)
Addendum; if you know what PCA is, id wager that added n gram dimensions share a linear dependency with lower dimensions - so sharing a statistical resemblence (covar(A,B) -> 0) that adds very little to the data's variability once you start adding dims above 3.