"Discovered SPAM filtering algorithm" ?

patio11 · on March 10, 2011

PG wrote an influential article "The Plan For Spam" some years ago (early 2000s because it was old when I entered industry) describing spam filtering using a naive Bayesean filter, a technique which was not totally unknown but which was subsequently widely adopted. It works pretty well, especially since you can chain it with other things like (probably the biggest one) IP based reputation. One popular implementation, PoPFile, was by fellow HNer John Graham-Cumming. For years I had an unfortunate hash collision and thought they were the same people.

Spam researcher hat off.

eru · on March 10, 2011

Didn't Paul Graham actually describe something more involved than a naive Bayesean filter?

patio11 · on March 10, 2011

Feel free to Google it and post your own synopsis, but insofar as you can do lossless compression to three words, I think those are the right three words.

eru · on March 10, 2011

How about dropping the `naive' part?

_delirium · on March 10, 2011

If I'm reading his essay correctly, it pretty directly proposes the kind of Bayesian inference called "naive Bayes", i.e. which makes an assumption of independence of the features (in this case words), and calculates the total probability of an email being spam by simple multiplication of the per-feature probabilities.

eru · on March 10, 2011

Yes, that's true.

I was more thinking of his weird pre- and post-processing like >>> When new mail arrives, it is scanned into tokens, and the most interesting fifteen tokens, where interesting is measured by how far their spam probability is from a neutral .5, are used to calculate the probability that the mail is spam. <<<

_delirium · on March 10, 2011

Yeah, I agree he has some interesting pragmatic tweaks on it. I suppose he was proposing "[naive Bayesian] filtering" but not necessarily "naive [Bayesian filtering]".