Online (Machine) learning in Clojure

a-priori · on July 15, 2009

Just for fun, I ported this program to Haskell. It appears to generate identical results, and completes the training set in 11.23 user seconds.

http://gist.github.com/147988

mreid · on July 15, 2009

That's very neat. Would you mind posting a link to your version from the comments on my article? Thanks.

I learnt Haskell before I tried using Clojure and really like it as a language. It doesn't surprise me that it runs faster. I'm guessing the feature value look-ups don't involve the extra unboxing that Java's maps do.

a-priori · on July 16, 2009

Sure, just cross-posted it.

Types in Haskell are boxed as well, though it's possible to get around this with some GHC-specific hackery, I haven't done it.

But the biggest problem is the parsing code -- it actually spends around 70% of the runtime parsing the input! Changing to regular expressions may help that, but I gave up trying to find good documentation for Text.Regex.Posix.

bravura · on July 16, 2009

"The reported accuracy is simply the cumulative total number of errors divided by the number of steps."

Use a moving average. I use: m <-- m - (2/t) (m - x_t) when estimating the current training error.

1/t would be the exact historical average. 2/t gives more weight to recent events, which is good when your distribution is non-stationary (as is the case when your model is changing). With a constant learning rate (independent of t) you get an exponential moving average.

jimbokun · on July 15, 2009

Considering his comment at the end about using the Colt matrix libraries, I wonder if he knows about incanter?

http://github.com/liebke/incanter/blob/59c13e05e3242e4491f9d...

mreid · on July 15, 2009

I'm the author of the blog post and to answer your question, yes, I know about Incanter. In fact, I found out about Parallel Colt via Incanter. I've played around with it a little and it looks very cool.

I was initially going to build my algorithm using its libraries as a base but I thought a simpler first step would be to write it without pulling in too many extra dependencies.