You could use clustering, dimensionality reduce or feature selection. However, t...

You could use clustering, dimensionality reduce or feature selection.

However, the way I have seen the hashing trick being used is not to compress the feature space. For most problems it would be a bad idea to just lump your most discriminative features together with some other random ones. Instead people just choose a very large feature space which makes collisions unlikely. For model implementations using sparse matrices it doesn't matter if the feature space is very large. The main advantage of this is that you don't have to keep an expensive hash map of your vocabulary in memory (hence my suggestion to use a trie).