Like others have said, you might be overfitting your training data here: your model is just memorising the examples you give it and would fail if somebody slightly varies some payload. (inserting some whitespace or something)
Another thing to keep in mind is that an accuracy of 99% doesn't mean much in an unbalanced problem like yours (much more clean queries than malicious once).
What you should show instead is precision (of the ones labeled malicious, how many are actually malicious?) and recall (out of the malicious queries in the dataset, how many did your model label as malicious?)
Trigrams are a known sweetspot. Its a practical heuristic. Got no proofs to link. Wrote awikipedia semana parser and witnessed similar results in practice.
Going above 3 adds very,very little precision while guzzling space. Using 2 shows a drop in precision.
Probably has todo with how human built systems are built (we make them in our image - and we seem to have a thing for 3s - map(subject, verb, object) -> (origin, data, destinstion) etc)
Addendum; if you know what PCA is, id wager that added n gram dimensions share a linear dependency with lower dimensions - so sharing a statistical resemblence (covar(A,B) -> 0) that adds very little to the data's variability once you start adding dims above 3.