Oh my yes! Machine learning one of the hardest programming there is. You only ever get an indirect measure if it is working correctly or not. Its hard to debug. My algorithms gets it right 80%, have I made an implementation mistake? Who knows?
My general strategy is to invest into training set curation and evaluation. I also use quick scatter scatter plots to check I can seperate the training sets into classes easy. If its not easy to do by eye then the machine is not magic and probably can't either. If I can't, then its time to rethink the representation.
The author correctly underlines the importance of training set, but also equally critical to have the right representation (the features). If you project your data into the right space then pretty much any ML algorithm will be able to learn on it. i.e. its more about what data you put in, rather than the processor. k-means and decision trees FTW
EDIT:
Oh and maybe very relevant is the importance of data normalization. Naive Bayes classifiers require features to be conditionally independent. So you have to PCA or ICA your data first (or both), otherwise features get "counted" twice. E.g. every wedding related word counting equally toward spam catagorization. PCA realizes which variables are highly correlated and projects them into a common measure of "weddingness". Very easy with skilean its preprocessing.Scaler() and turn whitening on.
Agreed about features and Bayesian filters. Words are just not very good features for filtering spam - but all the numeric data he could easily feed to the Bayesian filter by dividing the data ranges into compartments (like very-many-links-per-word, or over-100-links-per-word).
My general strategy is to invest into training set curation and evaluation. I also use quick scatter scatter plots to check I can seperate the training sets into classes easy. If its not easy to do by eye then the machine is not magic and probably can't either. If I can't, then its time to rethink the representation.
The author correctly underlines the importance of training set, but also equally critical to have the right representation (the features). If you project your data into the right space then pretty much any ML algorithm will be able to learn on it. i.e. its more about what data you put in, rather than the processor. k-means and decision trees FTW
EDIT: Oh and maybe very relevant is the importance of data normalization. Naive Bayes classifiers require features to be conditionally independent. So you have to PCA or ICA your data first (or both), otherwise features get "counted" twice. E.g. every wedding related word counting equally toward spam catagorization. PCA realizes which variables are highly correlated and projects them into a common measure of "weddingness". Very easy with skilean its preprocessing.Scaler() and turn whitening on.