Hacker News new | past | comments | ask | show | jobs | submit login

Perhaps it's somewhat off-topic, but I've built a spam detector similar to the article's withOUT using "direct" AI, but rather via a key-word or key-phrase "ranker". A simplified example is given below.

The advantage over other techniques is that one can easily trace the exact math of a conclusion, and tune it as needed. The disadvantage is that one probably has to manually tune it all rather than let the machine "learn". However, a hybrid approach could be used whereby "pure" AI suggests words and phrases to encode.

     rule.addList("nigerian, prince", rank=7);
     rule.addPhrase("great opportunity", rank=5);
     rule.addPhrase("lisa smith", rank = -4); // probably good
Here a "list" means that the word order doesn't matter, but with a "phrase" it does matter. A negative value means its less likely to be spam, usually because it's specific to your business or task. Actually I had multiple categories rather than just "spam" versus "non-spam", but that would complicate the example. I also used a database. One could perhaps call it a "weighted" version of MS-Outlook's rule engine. Somebody had a similar idea: http://dergipark.gov.tr/download/article-file/45302



You're essentially doing a rough manual version of Bayesian classification on n-grams (which is still very explicable): http://www.paulgraham.com/spam.html


The idea of my approach was that a "power user" could add the rules and scores without having to understand something that may take a while to explain. A scoring sheet can be displayed for a given message that would make sense to just about anybody with an associate degree. Example scoring sheet for a given message:

     Category: Spam
       Rule-ID    Score
       ----------------
       NgrPrnc1       7
       bPills         5
       knownPeople   -3      
         Total:       9 Threshold Exceeded!

     Category: Tech Support
       knownWidgets   3
       offer1        -2
         Total:       1 Insufficient total

     Category: Etc...
One could click on the rule-ID as a hyperlink to see specifics of a given rule (if details don't fit on screen).


This is how people did things back in the time. "Expert systems" with hand-crafted rules, built by "experts".

From the past, we learn that these systems are brittle and break continuously. For example, what happens when spammers start using different words, or send legitimate looking emails that are actually spam? Do you think you can build rules to catch 70%, 80%, 90% or 99.99% of spam?

If your goal is simply showing the rules being applied, you can still learn the rules with ML but display them in this way (for example GP suggested looking at Naive Bayes which was the most common method used to fight spam; I'd also point you to decision trees which are easy to visualize).


As stated, it wasn't intended for an "expert", but a power user. Somebody has to make the decision anyhow of spam versus non-spam in order to make a training set for "learning" based AI. These days you can purchase spam detection systems/services such that training such systems in-house is usually not worth it. They can use rejected messages from thousands of orgs to train their system.

But what I described had additional purposes such as sub-routing to various departments. It was a multi-purpose email categorizer in the early days of spam. Each approach has trade-offs. I'm not sure how you'd apply a "decision tree" using weights in a way that makes sense to a power user. A non-weighted decision tree seems too blunt an instrument. One generally needs multiple "clues" (factors) voting in tandem.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: