I'm only starting with all that machine-learning, NN stuff and as many others I want to ask for some guidance/resources/learning material. What I feel especially lacking is something very broad and generic, some overview of existing techniques (but not as naïve as Ng's ML course, I assume). There exist a lot of estimators and classifiers, there exist a lot of techniques and tricks to train models, there exist a lot of details on how to design a NN architecture. So how, for instance, do I even decide, that Random Forest is not enough for this task and I want to build some specific kind of neural net? Or maybe I don't actually need any of these fancy famous techniques, but rather there exist some very well defined statistical method to do what I want?
What should I read to start grokking this kind of things? I feel quite ready to go full "DIY math PhD" mode and consume some heavy reading if necessary, but where do I even start?
So how, for instance, do I even decide, that Random Forest is not enough for this task and I want to build some specific kind of neural net?
The problem here is that it's really hard to give generic advice. As an analogy this is like asking "how do I know if Rails is enough for this task".
The answer is usually "yes", but the specifics matter a lot.
So in this specific case (and I realize you aren't looking for specific advice here, but I think the principles are useful):
Random Forests are very powerful, and work really well for hundreds, maybe thousands of features, on large but not huge amounts of data and are fairly easy to train.
There are a large number of types of neural networks. One of the big advantages of deep neural networks is that that can reduce the need for manual feature engineering. For examples conventional neural networks extract features from images that work better than any human engineered features, and LSTMs (and variations) work well at extracting features from text. The problem with deep neural networks is that they (generally) need a lot of data to train.
So, as usual the answer is "it depends".
In industry though, 90% of the time the question isn't "what classifier should I use". It's "how do I get the data"/"how do I extract features" and then "lets try all the classifiers and see what works best".
"Try 'em all" is not just an answer, but the only answer.
The No Free Lunch Theorem says that averaged across all possible problems, no single classifier is the best; in fact, they're all equivalent.
However, you probably don't care about all possible problems, but a specific one. Over the last decade or so, we've discovered that deep learning works really well on certain classes of problems, particularly those that may have some kind of nested structure, as in object or speech recognition. If your problem resembles one of those, a deep neural network might be a good place to start.
To paraphrase the learnings of thousands of data scientists on years of Kaggle competitions:
A quick and dirty model for a baseline: Random Forest
Structured data: Use a boosted tree algorithm (specifically the XGBoost implementation of gradient boosting), ensembled with maybe Extra Trees, Random Forests and MLPs
Some kind of time component on large datasets: FTL regression, XGB
Binary data (images or sound): Deep neural nets
Text: Try LSTMs, but this will often be beaten by manual feature engineering and Word2Vec derived features put into XGB.
LightGBM (https://github.com/Microsoft/LightGBM) is shaping up to beat XGBoost; it has mostly API parity and it won in benchmarks before a v2 with a new algorithm.
I tried LightGBM for a Kaggle. I couldn't get anywhere near XGB.
I was using the LambdaRank stuff. Given the boasting the LightGBM team had done I had assumed it would be close to XGB out-of-the-box for a ranking problem (since XGB only does pairwise ranking). It was far enough away that I had to ask if I was misinterpreting the output[1].
That was 6 months ago now, so maybe it has improved. I know they made big claims.
I don't think anyone has successfully used it for a high result in a Kaggle yet, which - for all its faults - is a good way to see what the maximum performance of a software package seems to be.
LibFFM is the other thing I should have mentioned previously as being worth trying.
Another less-recognised point is that in industry, you also need to ask "how can I maintain this?" and "what can go wrong with my algorithm?".
In one use case, a "blip" in your algorithm might mean showing the wrong kind of advertisement to a user. Not great, but ultimately no big deal. In another, it might mean automatically buying billions of dollars' worth of pumpkin futures (cf. Knight capital).
In the latter case you need a much greater penalty on model complexity, and much more emphasis on interpretability.
While I agree with your point (and often use this in interview questions) that wasn't what caused the Knight Capital problem.
That was bad software engineering and deployment practices, and had nothing to do with interprability of the model (actually it had little to do with the model at all.) They repurposed a feature toggle, then misdeployed the code: http://pythonsweetness.tumblr.com/post/64740079543/how-to-lo...
I understand that this was an example, but I'm sure someone will misread it as what happened in that case.
Generally for structured data (i.e. each column represents a distinct type of information, such as 'revenue' or 'color') you'll want random forest or GBM.
For unstructured data, where you'll need lots of complex feature engineering, you'll generally want to let the model learning those features - so use deep learning. E.g. images, natural language, audio...
I've won competitions with random forests and teach deep learning - both definitely have their place, but they are generally for quite different types of data. (This may change in the future, however, with deep learning showing that it has the potential to work well for structured data too.)
(Don't worry about the No Free Lunch theorem - it has little to do with predictive modeling in the real world. Recent research shows that a random forest will give amongst the best results for the vast majority of real world datasets.)
Thanks, I'll try that as well. But then again, this is specifically about deep learning. I'm asking more about something generic, systematic overview that would help me to know that I'm using some specific techinque because of reasons, and not because "deep learning is cool". Something that would include very basic, "manual" statistics approach as well as intro to NNs. I mean, I probably know that I need CNN when I'm presented with a picture, and sometimes I might guess that I might want to use RNN if I'm presented with a text I don't know how to parse, but when I want to predict something given a bunch of numbers and stuff, it is not all that obvious which exactly approach is likely to be "the right one" and which one is probably "because fashion".
Even though you specifically say you are willing to go full blown PhD and are interested in digging deep on algorithms etc. I strongly recommend working through "Practical Deep Learning for Coders" course at fast.ai
It's free :)
It gives you an excellent feel for what is possible and they are very focused on solving interesting and practical problems right away. They explicitly try to take the "requires a math PhD" out of deep learning. Once you're through with the course you have a very solid practical overview and understanding and can solve tons of real world problems (it's almost a startup idea generator tbh.) and once you're at that stage it becomes tons easier to dive deep into specific algorithms and optimizations.
tl;dr: Take the course (they also walk you through setting up a AWS GPU server so no fancy hardware required) and you'll be able to solve real world problems with state of the art algorithms.
I'd definitely watch the first few episodes of Ng's stuff, up to and including logistic regression (unless you know all of that already, in which case: read papers and do practice projects for yourself--or compete in kaggle if you don't have any application ideas)
The most common way to apply machine learning is supervised classification. The basic formula is: we learn a model (set of weights) to approximately map data (a matrix X) to corresponding labels (a matrix Y). Where you can use logistic regression to learn a set of weights, you can use a keras-based neural network.
If all of that makes sense to you already, I think you're well prepared to read Keras' documentation.
It surely does make sense to me, but I seriously think (maybe hope, even?) that "hacking-driven" approach here is significantly overvalued. Because of sociological reasons. After all, all this is mathematical problems, and while I'm aware that NNs are pretty much unexplored space, there surely must exist some quite significant amount of knowledge at level below the NNs that can be actually systematically learned. All these various statistical methods R-lang community is buzzing about which I'm not ever aware of, some rationale about "why NN and not just a regression", etc. You know, the math.
If you just pick up a math book, you'll learn lots of stuff that you don't need to know. That's fine, but it strikes me as a good way to avoid actually doing anything and gaining practical experience.
If you hit a wall in practice because you don't understand the math, you'll usually have enough of an idea of the problem to ask more intelligent questions about what kind of math you need. That will, incidentally, help you understand the math better because you're coming to it out of an actual need rather than just seeing it mixed into a bunch of chapters.
Unless you're going to write a machine learning framework or be a researcher, the required math isn't too bad and it sounds like you might have enough of a background already. So don't be afraid to dive into something practical (like a kaggle competition).
FWIW this is a really good blog for insight into the math and intuition behind deep learning: http://colah.github.io/ (i'm not sure if it's quite what you're looking for though)
What should I read to start grokking this kind of things? I feel quite ready to go full "DIY math PhD" mode and consume some heavy reading if necessary, but where do I even start?