Hacker News new | past | comments | ask | show | jobs | submit login

The hashing trick (a la sklearn's dictvectorizer) can make a huge difference.

I'm not a huge fan of that. Hash collisions can lead to unexpected behaviors in production and make feature attribution for debugging harder.

It's slightly more effort to implement, but with a trie data structure you can store even the biggest feature mapping in memory.




Does the fact that sklearn reverses signs for features that collide change your opinion?


How are you compressing the feature space in that case, by truncating the trie?


You could use clustering, dimensionality reduce or feature selection.

However, the way I have seen the hashing trick being used is not to compress the feature space. For most problems it would be a bad idea to just lump your most discriminative features together with some other random ones. Instead people just choose a very large feature space which makes collisions unlikely. For model implementations using sparse matrices it doesn't matter if the feature space is very large. The main advantage of this is that you don't have to keep an expensive hash map of your vocabulary in memory (hence my suggestion to use a trie).




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: