A Tour of the Top Algorithms for Machine Learning Newbies

polkapolka · on Oct 30, 2018

The image for logistic regression is hilariously wrong. It shows the sigmoid as a decision boundary.

Also dont get hung up about no free lunch theorem. That is a great result in computer science theory with little practical impact: just pick neural networks for unstructured and GBDT for structured data. For the vast majority of real-life problems (not all possible problems) these are the single best algorithms.

bunderbunder · on Oct 30, 2018

Logistic regression isn't sexy, but it can still achieve near state-of-the-art results, is reasonably resistant to bias^H^H^H^H variance, and generates parameters that you can easily explain to someone with no background in math.

There's a lot of value in all that. Especially if your deliverable is something that a business is going to use, and not just a Kaggle entry.

claytonjy · on Oct 30, 2018

> generates parameters that you can easily explain to someone with no background in math

I know it _seems_ that way, but there's a surprising amount of nuance there and I think we're both fooling and limiting ourselves by letting this idea fester.

For one, unlike linear regression, logistic regression estimates aren't collapsible, so you can NOT interpret them as "changing this input by X changes the output by Y". That's only true if your set of covariates is perfect, which is never true, though in practice this interpretation might not be _that_ far off.

Another issue I see is practitioners not being aware of scaled/unscaled estimates; I've seen real papers from AI groups use logistic regression estimates like feature importance rankings, but using estimates in the scale of the original features, and not understanding the distinction when confronted about it.

From a practical sense, I think practitioners are much better served using random forests as their initial exploratory models. Less effort for results that are in practice at least as good as a well-prepped logit. Plenty issues with feature importance there, but not any worse than with logistic regression.

bunderbunder · on Oct 30, 2018

I don't think that's such a big deal in practice. See http://jakewestfall.org/blog/index.php/2018/03/12/logistic-r..., for example.

tl;dr: The upshot is that non-collapsibility means that I can't use LR coefficients for things that I don't really need to use them for, anyway. That doesn't feel like a crippling limitation to me.

(Well, also, I have to occasionally pause to cross my fingers and say, "ceteris paribus," under my breath, which does admittedly make people think I'm some sort of weird Harry Potter nut. Which is OK. They're not wrong, they're just right for the wrong reason.)

Nor does it render its coefficients less interpretable than those of most other models. "Less interpretable than OLS" can still be pretty darn interpretable.

claytonjy · on Oct 31, 2018

I had exactly that post in mind, it really raised my awareness of these issues.

I agree with Jake's interpretation of the conditional interpretation of the estimates, but the practical issue is that virtually nobody not well-educated in statistics will do that correctly. In particular, people tend to do exactly what Jake concedes rarely makes any sense, which is comparing estimates across different model specifications.

You and I might interpret these betas just fine, but if we show them to a less stats-y audience, will they?

bunderbunder · on Oct 31, 2018

I guess it depends. I have the luxury of working in a very "this is machine learning, which is not to be confused with statistical inference" problem domain. It doesn't really even really make sense to interpret most the models I build as describing any sort of causal relationship, and when people are looking at the parameter estimates, they're really just trying to figure out, "What does this model think is important?"

claytonjy · on Oct 31, 2018

That sounds nice!

Feature ranking seems like a clearly safe interpretation of betas, though I've been bitten too often by letting glm (in R) scale my predictors, giving me back estimates on the original scales, and thus incomparable, and seen it happen to others even more. Easy to miss when your original scales aren't all that different.

thanatropism · on Oct 31, 2018

It's not that difficult to compute true marginal effects from logistic regression using something like the bootstrap (if you have a distribution for your coefficients) or explicit differentiation. Every traditional stats app (Stata, SAS, etc) has this.

claytonjy · on Oct 31, 2018

What do you mean by "true" marginal effect? Are you suggesting a post-hoc procedure can correct estimates such that they are close enough to the estimate that would have been produced with a more complete model specification?

joe_the_user · on Oct 30, 2018

Logistic regression isn't sexy, but it can still achieve near state-of-the-art results, is reasonably resistant to bias^H^H^H^H variance, and generates parameters that you can easily explain to someone with no background in math.

As far as I can tell, the GP raised no objection to logistic regression, they simply noted that the illustration didn't actually illustrate logistic regression but something else.

imh · on Oct 30, 2018

Off topic-ish, but what do you mean by ^H^H^H^H? I feel like it's a joke I don't get.

csl · on Oct 30, 2018

https://en.m.wikipedia.org/wiki/Backspace#^H

polkapolka · on Oct 30, 2018

Yeah, it is a good first benchmark. But view interpretability as separate from accuracy. You can explain black box algorithms just fine these days.

Logistic regression is high bias low variance. If you were talking about fairness bias, then resistance to bias comes from logreg being too dumb to recognize complex non-linear patterns. Not necessarily a pro.

bunderbunder · on Oct 30, 2018

Sorry, I misspoke - will edit. Was talking about resistance to overfitting. Which largely comes from logistic regression's assumption of a linear decision boundary. It's true surprisingly often in classification tasks, and, when it's not, you can usually model it just fine with interaction variables.

With an ANN, your easiest defense against overfitting is to have great big heaping piles of training data. That's something that's hard to come by in many interesting situations.

polkapolka · on Oct 30, 2018

Agreed. Logistic regression with poly kernel or good engineering interactions can equal or beat more complex models for a fraction of the budget.

All the more power to you if a solid simple logreg model (or even no ML at all) is your first deliverable.

tomrod · on Oct 30, 2018

Would you mind talking to how black box interpretability is becoming well known? I've seen Shapley values used for feature interpretation, but not sure what else is being done.

polkapolka · on Oct 30, 2018

For an accessible recent overview see: https://christophm.github.io/interpretable-ml-book/

jonathankoren · on Oct 30, 2018

Logistic regression is almost interpretable. It certainly looks interpretable, and it certainly good enough if you’re just trying to make PM feel better or hang some explanatory chrome in a UI, but it’s not truly interpretable.

They maybe directionally interpretable, but that’s about it.

thanatropism · on Oct 31, 2018

This isn't rocket science.

https://www.stata.com/support/faqs/statistics/marginal-effec...

jonathankoren · on Oct 31, 2018

Knowing there is a a probabilistic relationship expressed by the coefficients and saying that “do x and y will happen” isn’t the same thing.

thanatropism · on Oct 31, 2018

Logistic regression essentially gives a conditional probability function, much like linear regression gives a conditional expectation function. You can compute log odds from logistic regression -- say, conditional to all other factors being left-handed makes you twice as likely to some binary effect. People were complaining that this isn't trivially done by staring at the coefficients, but people who can't think in partial derivatives shouldn't be in this business.

OTOH: if you assume an iid framework, the probabilistic marginal effects aren't even needed to go from something like "non-bottle blondes have probability p of being haired, bottle blondes have probability q" to "painting the hair of 1000 women will generate 1000*(q-p) jobs on average". Or you can parameterize a Poisson process for rare events and report exponential/Erlang waiting times. And so on.

jonathankoren · on Nov 1, 2018

That's the problem. Log odds are not intuitive. Hell, even probabilities aren't intuitive, and that's much easier to think about. Look at all the people start crying that "it was wrong" when the less likely event happens when the prediction said 90% probability.

This isn't a crazy minority position "logistic regression is not interpretable" is truism from basic ML courses, and blog posts all over the internet.

thanatropism · on Nov 1, 2018

It's odd that "ML theory" (really data science theory) as proposed by blog posts would supersede established statistics.

Something is rotten in the kingdom of Denmark.

jonathankoren · on Nov 1, 2018

I think you’re just trolling now. More to the point, I think you fundamentally misunderstand what interpretability means.

Logistic regression as not being interpretable was drilled into me by one of the creators of AdaBoost in grad school. As I said, this is widely held position.

Pseudomanifold · on Oct 31, 2018

I agree that the image is wrong, but I find your suggestion a little bit too plain: _what_ kind of neural networks? How do I choose the training strategy, the learning rate, the architecture, etc.

This opens up a can of worms that can be overwhelming for beginners.

The list is not too bad actually, but the phrasing can be improved. For example, Naive Bayes should rather be introduced before LDA, as this will make LDA much more understandable.

Also, LVQ seems a little bit odd---I would rather discuss better strategies for neighbourhood enumeration (approximate kd-trees or something like that).

abhgh · on Oct 30, 2018

Ha! Surprisingly this is not the first time I have seen someone describe this to be the boundary for logistic regression.

(Btw agree with your other comment.)

utopcell · on Oct 31, 2018

Exactly. I stopped reading as soon as I saw that image.

1024core · on Oct 30, 2018

> just pick neural networks for unstructured data

... except that NNs require a large amount of training data.

yorwba · on Oct 30, 2018

General-purpose machine learning algorithms require a large amount of training data. There's no magic that can make accurate predictions without any data to base them on. If you don't have enough data, the information needs to come from somewhere else.

If you have a large amount of data on a similar problem, you can try transfer learning to learn shared properties and only fine-tune the domain-specific stuff on a smaller data set.

If you have no quantitative data, but know domain experts, you can build a custom model based on their advice, with fewer parameters that need to be fit to the data you do have.

But if you have so little data that you can't train a neural network, you can't be getting new data very frequently. It might be cheaper to just pay a human to look at it.

If you don't even have enough data for humans to work with, fancy machine learning isn't going to help you.

polkapolka · on Oct 30, 2018

Sure there are constraint and pros and cons, but still: pick a neural network for unstructured data. Can always unsupervised pretrain and fine tune on a tiny dataset.

swsieber · on Oct 30, 2018

What qualifies as unstructured data? Would you consider text content to be structured or unstructured? (e.g. for classification of documents)

polkapolka · on Oct 30, 2018

Text data traditionally seen as unstructured. Try a simple MLP.

PLenz · on Oct 30, 2018

The algo is almost besides the point. The real work is in getting the data, cleaning it, and then explaining WTF is going on to lay people. Im most DS jobs how you get there is irrelevant.

tw1010 · on Oct 30, 2018

If you can't explain something in a different way from how you learned it, you probably don't understand it adequately.

Wiretrip · on Oct 30, 2018

Nice to see a roundup of ML that doesn't just go straight to Deep Learning for a change :-)

ur-whale · on Oct 30, 2018

Or at all, for that matter.

Wiretrip · on Nov 2, 2018

Indeed!

echelon · on Oct 30, 2018

I'm a machine learning newbie, and I'd really like to speak with someone in the field to figure out what I'm wanting to learn.

The problem I have is that I have a large labeled data set of spoken words and phonemes from a single speaker. I'd like to train a model and generate new phonemes (of various pitches, speeds, and intonations) with which to build a concatenative speech engine.

What algorithms and models would I be looking to use? What are the primary techniques? Is this something I could quickly begin to see results in, or would it take months or years of tweaking?

To clarify what I'm doing, I own/built trumped.com, and I'm trying to improve the speech synthesis quality by generating better fitting units of speech.

PeterisP · on Oct 30, 2018

If you've got a model that can generate new phonemes (of various pitches, speeds, and intonations), then you have a parametric speech synthesis engine and can use it directly as-is instead of strapping a concatenative engine on top.

For the techniques, Wavenet and Tacotron (e.g. https://google.github.io/tacotron/publications/tacotron2/) seem to be the state of art, but they are reportedly hard to replicate.

probably_wrong · on Oct 30, 2018

The simplest way would be downloading something like MaryTTS [1], read the documentation, and train your own voice model. It won't be perfect, but shouldn't be too hard.

The best results would probably be achieved by implementing DeepMind's WaveNet paper [2], but it might be too much for what you need.

I'm not really sure what to suggest in between those two. Some kind of convolutional NN, I guess?

[1] mary.dfki.de

[2] arxiv.org/abs/1609.03499

didibus · on Oct 30, 2018

I've been wondering, is nearest neighbour really ML? No part of the logic is learned from the data. It feels just like a glorified lookup table, where the look up is fuzzy to some predefined definition of nearest.

5minbreak · on Oct 30, 2018

k-nearest neighbors classification is one of the first non-linear supervised learning algorithms. Its predictions are derived from the data sample distances.

It is basically a glorified fuzzy lookup table, but then again, so can one view deep learning (fuzzy hierarchical localized lookup).

Pure memorization can even outperform logistic regression, especially with big data sets, so there is some recent debate as to what degree models memorize and to what degree they generalize.

didibus · on Oct 30, 2018

Interesting, though I don't totally see how deep learning would be similar. On deep learning, it is my understanding the weights are learned from the data. These are effectively constants, and represent logical rules.

So in essence, the rules which relates input to output are learned from the data in deep learning.

In nearest neighbour, the rule wasn't learned, we figured out the rule ourself: "use the nearest data point's result".

But in deep learning, the rule might be something like when feature x and y are between z range of each other and etc. And this rule is not defined by us, but by the weights which are learned from the data.

Effectively, deep learning thus learns the rules that define the relationship between input and class. But nearest neighbor is just a static rule that happens to be pretty general in essence, so it gives okay result for a lot of problems.

Not an AI expert, so take all this as my simple current understanding.

ipsa · on Oct 31, 2018

You could automatically encode a KNN model as a set of logical if-then rules: "if x1 > 10 and x2 < 3 then 4 nearest labels are [1, 1, 1, 0]" so the information is there. For KNN you could also train weights for every variable (how much should they count in the distance calculation?). For deep learning you have way more parameters and architecture choices than for nearest neighbors (mostly the distance metric and the number of neighbors to consider). After that, both learn a mapping from input data to a target.

ummonk · on Oct 31, 2018

What would you think about something like local linear least squares (where it uses linear regression but locally weights the linear regression so that it is designed to fit points closer to the target)?

Is it learning patterns in the data, or is it just a glorified lookup table?

Balgair · on Oct 30, 2018

> The technique (linear discriminate analysis) assumes that the data has a Gaussian distribution (bell curve), so it is a good idea to remove outliers from your data before hand.

Um, this seems very fishy to do, right? What am I missing here?

claytonjy · on Oct 30, 2018

Depends on your goal.

Outlier removal should be done much more carefully when the goal is inference; trying to test if your hypothesis is true. Here, outlier removal is a super easy way to accidentally p-hack. Current best-practice is to pre-register your analysis, including how you'll define and handle outliers.

For predictive goals, where the idea is to predict the class/value of unseen data, outlier removal is often a good way to keep your bias in check and not bias your model towards the outliers. The trade-off is that future outliers will be predicted as though they weren't, e.g. much closer to the mean than they should be. This is what the article is trying to do.

There's also a whole wide world of outlier & anomaly detection, where you want to say e.g. "this new data point is probably an outlier".

ur-whale · on Oct 30, 2018

No mention of Deep Learning in a top ML algorithm list in 2018? Kinda odd if you ask me.

Also, in the SVM section, no mention of kernel methods? (yet the picture shows a windy boundary). Also odd.

joe_the_user · on Oct 30, 2018

The first seven algorithms could be defended as "elementary" methods that would help someone work up to neural nets and deep learning. But once he starts talking about SVM, I think he's talking of a method as sophisticated as neural nets and deep learning.

Neural nets and SVM were competitors - competitively applicable and competitively difficult - in the aughts. Deep learning has now pulled away but not by the discovery of fundamentally more complicated methods. Rather, the process has involved lot and lots and lots of little refinements, through throwing lots of people, advanced-math intuitions and computing power at it, etc. Learning everything needed to create state-of-the-art results is hard (as far I can tell/guess) but the basics are reasonably simple.