Perhaps it's somewhat off-topic, but I've built a spam detector similar to the a...

webmaven · on July 24, 2018

You're essentially doing a rough manual version of Bayesian classification on n-grams (which is still very explicable): http://www.paulgraham.com/spam.html

tabtab · on July 24, 2018

The idea of my approach was that a "power user" could add the rules and scores without having to understand something that may take a while to explain. A scoring sheet can be displayed for a given message that would make sense to just about anybody with an associate degree. Example scoring sheet for a given message:

     Category: Spam
       Rule-ID    Score
       ----------------
       NgrPrnc1       7
       bPills         5
       knownPeople   -3      
         Total:       9 Threshold Exceeded!

     Category: Tech Support
       knownWidgets   3
       offer1        -2
         Total:       1 Insufficient total

     Category: Etc...

One could click on the rule-ID as a hyperlink to see specifics of a given rule (if details don't fit on screen).

halflings · on July 24, 2018

This is how people did things back in the time. "Expert systems" with hand-crafted rules, built by "experts".

From the past, we learn that these systems are brittle and break continuously. For example, what happens when spammers start using different words, or send legitimate looking emails that are actually spam? Do you think you can build rules to catch 70%, 80%, 90% or 99.99% of spam?

If your goal is simply showing the rules being applied, you can still learn the rules with ML but display them in this way (for example GP suggested looking at Naive Bayes which was the most common method used to fight spam; I'd also point you to decision trees which are easy to visualize).

tabtab · on July 24, 2018

As stated, it wasn't intended for an "expert", but a power user. Somebody has to make the decision anyhow of spam versus non-spam in order to make a training set for "learning" based AI. These days you can purchase spam detection systems/services such that training such systems in-house is usually not worth it. They can use rejected messages from thousands of orgs to train their system.

But what I described had additional purposes such as sub-routing to various departments. It was a multi-purpose email categorizer in the early days of spam. Each approach has trade-offs. I'm not sure how you'd apply a "decision tree" using weights in a way that makes sense to a power user. A non-weighted decision tree seems too blunt an instrument. One generally needs multiple "clues" (factors) voting in tandem.