"In contrast to deep neural networks which require great effort in hyper-parameter tuning, gcForest is much easier to train."
Hyperparameter tuning is not as much of an issue with deep neural networks anymore. Thanks to BatchNorm and more robust optimization algorithms, most of the time you can simply use Adam with a default learning rate of 0.001 and do pretty well. Dropout is not even necessary with many models that use BatchNorm nowadays, so generally tuning there is not an issue either. Many layers of 3x3 conv with stride 1 is still magical.
Basically: deep NNs can work pretty well with little to no tuning these days. The defaults just work.
I couldn't disagree more. The defaults don't just work, and the architecture of the network could also be considered a hyper parameter in which case what would be a reasonable default for all the types of problems ANN are used for?
Are you using batch normalization? If you are, an issue I see all the time is folks not setting the EMA filter coef correctly. In keras, it defaults to something like 0.99 which in my mind makes no sense. I use something around 0.6 and life is good. You want to get an overall good measurement of the statistics and in my mind the frequency cutoff when coef=0.99 is just way too high for most application. You usually want something that filters out just about everything except very close to DC.
The response to "the defaults should work just fine without any hyperparameter tuning" is "try fiddling with the EMA filter coefficient hyperparameter" ?
It's like the joke of the mathematician giving an exposition of a complex proof. At one point he says "It is obvious that X", pauses, scratches his head, does a few calculations. Leaves room for twenty minutes and returns. Then continues "it is obvious that X" and goes to the next step.
Deep in the field, it's fine for machine learning experts to say "everything just works" [if you've mastered X, Y, Q esoteric fields and tuning methods] since they're welcome to "humble brag" as much as they want. But when this gets in the way of figuring out what really "just works" it's more of a problem.
I think they're referring to the momentum parameter at [1]. The exponential moving average (EMA) of the batch mean/variance is used in the batch normalizing transform (Algorithm 1 in [2]).
The momentum ranges from 0 to 1. If it's close to 1, which the default of 0.99 is, the EMA of the batch mean/variance will change slowly across batches. If it's close to 0, the EMA will be close to the mean/variance of the current batch.
The EMA acts as a low-pass filter. With a momentum close to 1, the EMA changes slowly, filtering out high frequencies and leaving only frequencies close to DC. Note that this is opposite to what grandparent says: 0.99 has a lower frequency cutoff than 0.6 does. So I'm not really sure what they're getting at there.
They work well, just that you need a lot of patience (and know how) to work with them. Also GPUs are expensive. By the time you realize that you messed up you have wasted a lot of time. Of course this is true with any ml algorithm out there. But what I'm trying to say is it is possible that an as yet unknown method exists that may be less computationally complex.
One of the problems I see is that people abuse deep neural networks no end. One doesn't need to train a deep nn for recognizing structured objects like a coke can in a fridge. Simple hog/sift/other feature engineering may be a faster and better bet for small-scale object recognition. However expecting sift to out perform a deep neural net on imagenet is out of question. Thus when it comes to deploying systems in a short frame of time one should keep an open mind.
> One doesn't need to train a deep nn for recognizing structured objects like a coke can in a fridge.
I disagree. Sure, you don't need a NN to recognize one Coke can in one fridge for your toy robot project. If you want to recognize all Coke cans in all fridges, for your real-world, consumer-ready Coke-fetching robot product? You're going to need a huge dataset of all the various designs of Coke cans out there, in all the different kinds of refrigerators, and your toy feature engineered approach is going to lose to a NN on that kind of varied dataset.
I'm serious. If you have to rely on mono, single image inputs then yeah ImageNet is going to do better. But it will also mistake every picture of a coke can as the real thing. It will be horrifically sensitive to malicious inputs. Much better would be to use 2 calibrated lenses and do 3D reconstruction. Even if you're just doing the reconstruction as a sanity check for a NN to weed out the false positives.
Errm, hang on, are you saying that if you have a task of classifying unseen images given a labelled training set you should get a stereo camera or video camera and create another problem?
Which you can solve?
Because the problem is silly>
What if I say : "I will give you $10m to solve it, and if you fail, I will kill this very kind old monkey?"
Object recognition doesn't only exist in the subspace of labelled 2D images. It tends to be derived from a 3D space, which is a whole extra orthogonal data source that the "NN all the things" crowd is fastidiously ignoring.
Why, I'm not sure, but I'm guessing because it is hard/inaccurate to do with just NNs and parameter/network architecture tweaking. Possibly also because benchmarks with single mono images are much easier to make.
Just because it is hard with method A, and is harder to make benchmarks, doesn't mean method B isn't better.
Yes but am I missing something when I say that if the problem is to deal with labelled 2d images declaring that you should be working with 3d images or short video sequences doesn't help.
Sure, if you are building a Robot and I say "use this camera and a deep network" and you say "It'll work better with stereo" well... yes super do that!
But if we are working with mono images I don't understand how the observation helps?
> If you want to recognize all Coke cans in all fridges, for your real-world, consumer-ready Coke-fetching robot product?
If you're stuck with a mono dataset, post collection, then sure use NN and call it a day. But even if you have video you can do 3D reconstruction just from baseline movement. You won't know scale, so you can't differentiate between big coke cans and little coke cans, but at least you can rule out pictures of coke cans.
> But an NN can complete mess up when a new refrigerator is used, that wasn't part of the training set
Not if your training set is representative. And this is just as true of feature engineered approaches, the only difference is that dealing with real world variation requires a lot less work with NNs because once you add the variation to your dataset you're done. With feature engineering that's only the first step because now you have to figure out where the new variation is breaking your features and how to modify them to fix it.
And herein lies a prominent failure mode of a huge amount of this sort of work that I've seen - hard to just "add the variation to your dataset" when your data set is one or more orders of magnitude too small to contain it. At that point all that remains is the handwaving.
The right response to insufficient data is usually simplifying the modeling.
I'm not sure about that. The new GAN models over the past 2-3 months, like LS-GAN or WGAN, all seem to train much more stably. I've beaten up on WGAN with all sorts of strange tweaks and hyperparameter settings and while it may not work well, it's never catastrophically diverged on me the way DCGAN would at the drop of a hat.
I've always found it curious that Neural Networks get so much hype when xgboost (gradient boosted decision trees) is by far the most popular and accurate algorithm for most Kaggle competitions. While neural networks are better for image processing types of problems, there are a wide variety of machine learning problems where decision tree methods perform better and are much easier to implement.
The hype for neural networks is deserved. Some major contributions to the field resulted in increases in accuracy for fields like NLP, computer vision, structured data, machine translation, style transfer, etc.
XGBoost did not change much from the "Greedy function approximation: A gradient boosting machine." paper, but uses a few tricks to be much much faster, allowing for better tuning.
I'd say Tensorflow/Keras can handle a wider variety of problems, with the same, or improved accuracy, than tree-based methods can. NN's do well on structured problems (the domain of tree-based methods), but also own computer vision and, increasingly, NLP. I agree on the pitfalls of applying neural nets vs. forests.
It is true that tree-based methods are academically a bit out of vogue: The exciting stuff is happening in the neural network space. You have more chance of getting published with deep learning (this used to be the other way around).
I agree that neural nets are state-of-the-art and do quite well on certain types of problems (NLP and vision, which are important problems). But a lot of data is structured (sales, churn, recommendations, etc), and it is so much easier to train an xgboost model than a neural net model. You need a very expensive computer or expensive cloud computing to train neural nets, and even then it is not easy. Ease of implementation is an important factor that gets overlooked in academia. And on non-NLP and non-image datasets, usually the single best Kaggle model is an xgboost model, which was probably developed in 1/10th the time it took to make a good neural net model. Xgboost has come a long way since it was first introduced, with early stopping being an example of a significant improvement.
Academia does not have to run the model in production. It has few computational constraints, and most datasets do not have a feedback loop, requiring combating drift, debugging, and retraining. Papers are often accepted when they equal or beat state-of-the-art. Not many academics have to deal with the business side of running models in prod.
All of this leads to ease of implementation being overlooked. Especially on NLP, you see a lot of overengineering with deep neural nets (where the feature engineering is hidden inside the architecture). These models are hard to implement/reuse.
But yeah: academia/theoretical machine learning creates the very tools for applied machine learning.
There are lots of reasons, principally among them that Kaggle comps problems represent only one tiny fraction of ML problems.
On general purpose ML, with out much time rigging a optimal solution, RF/Xgboost will preform better. But in many problems, ie vision, DL is vastly superior.
The other important point is that there many new opportunities for researchers regarding DL, where statistical ML approach on supervised problems is a much more well established field.
I think the hype around CNN is the NLP aspect of it as it relates to AI. If you can hammer down NLP and translate voice or text to computer-legible commands, you've really improved the user experience.
On the other side of the CNN coin is the image recognition that's getting a lot of hype from the self driving auto crowd. I think any data scientist worth their salt understands how the different algorithms stack up against each other. You wouldn't use xgboost for a computer vision problem just like you wouldn't use CNN for a tabular data problem.
I don't know about the others, but the two visions dataset they compare to (MNIST and the face recognition one) are small datasets and the CNN they compare to doesn't seem very state of the art.
It also seems each layer of random forest just concatenates a class distribution to the original feature vector. So this doesn't seem to get the same "hierarchy of features" benefit that you get in large-scale CNN and DNN.
To your point that they are comparing small datasets. I dont see that as a problem. If they achieve better results on small datasets that is a great achievement, as often the bottleneck is the size of the dataset rather than computation time.
> often the bottleneck is the size of the dataset rather than computation time
That's generally true for DNNs, which is a good place to be if you have lots of data. This typically isn't true for tree based approaches, which is why they fell out of fashion in some problem domains; they don't generalize as well. This paper doesn't seem to change what we already know in this respect.
^ The authors time and effort they observe it takes to create state-of-the-art CNNs, but their point-of-comparison CNNs look to be fairly simple -- I don't see an AlexNet or something for some of these tasks either just as a point of comparison even if not a fully relevant one
"if an algorithm performs well on a certain class of problems then it necessarily pays for that with degraded performance on the set of all remaining problems"
The No Free lunch theorem is basically a consequence of the fact that almost all problems 'look random'; it doesn't really apply to the tiny subset that are of interest to humans.
An often missed point is that the NFL has a practical consequence and that is, learning doesn't happen if the model architecture is not well adapted to the structure of the data. There being structure is not on its own strong enough to allow for any algorithm to do well. The model itself needs to be aligned with the structure in the data. This applies to animals, humans too, by inserting structural priors, evolution sacrificed performance on certain kinds of problems to do well on the kind we met in the natural world.
You can read more in: Antipredictable Sequences: Harder to Predict Than Random Sequences by Huaiyu Zhu and Wolfgang Kinzel
> it doesn't really apply to the tiny subset that are of interest to humans.
That's a very misleading TL;DR. NFLT certainly _applies_ -- deep networks are not immune to NFLT -- it's just that NFLT isn't very useful because we can't use it as a basis for decisions. You can't detect that your algorithm's performance is being limited as a consequence of NFLT; and even if it were consequential, you would just see that the algorithm wasn't working very well, and there are _much_ more likely causes for that than NFLT.
If you have two predictors p0 and p1 then it can be shown that the mixture p(data) = (p0(data) + p1(data)) / 2 incurs at most 1 bit of loss compared to the better of p0 and p1.
In other words, for any pair of machine learning algorithms there is an algorithm which only performs marginally worse than either of the two on any given problem, and which may perform arbitrarily better. NFL equalises these two cases by smearing tiny losses over vast regions of possibility-space that are vanishingly unlikely.
I often hear the claim that NFL just implies that different algorithms and learning strategies will succeed for different tasks. And that NFL has no practical consequences apart from that.
Is there a resource I can refer to, that is clear and explicit about the reasoning?
NFL is probably the worst mis-quoted theorem in ML.
The subtle elements of the theorems (both for inference and search) have had huge impact on the field of optimization theory. They touch upon (Kolmogorov) Complexity, Incompleteness, Halting Problems, and later related work on the physical limits of inference: One can not know everything about the universe, if you are a part of it, and philosophy: Is the universe itself a computer?
That different algorithms and search strategies will succeed for different tasks, does not imply that we have to try them all: We can use prior knowledge on what worked on related tasks. Also the tasks we care about, is only a very small (explorable) subset of all possible tasks.
The fact that different algorithms are best for different tasks has more to do with computational complexity.
If we didn't care about resource usage then we could just pick a sufficiently expressive model class (e.g. all algorithms) to mix over, and perform Bayesian inference. You can look up AIXI and Solomonoff induction for more thoughts along these lines.
There is a theoretically optimal machine learning algorithm called Solomonoff Induction. Solomonoff Induction assumes a prior over all possible computer programs that could have produced the data. And it assumes that shorter, simpler programs are more likely than longer, more complex ones. Under such a prior, the no free lunch theorem doesn't apply.
NFLT assumes that all machine learning problems have infinite information. Infinite Kolmogorov complexity. That all positive and negative examples are labelled completely randomly without any underlying pattern or reason. Which is obviously untrue.
Solomonoff Induction isn't really an "algorithm" in the way we normally think of algorithms, as it isn't computable. There are computable approximations, but at that point, you lose the claim of theoretically optimal.
> There are computable approximations, but at that point, you lose the claim of theoretically optimal.
True, but I'm surprised more work isn't being done on "searching the space of programs that produce the data". There's very little research on this topic other than a few papers on minimum description length (MDL). I feel that this is probably the eventual route to AGI. We know what the "optimal" predictor is in theory; now work out the best time/memory approximation to it for practical purposes.
There's lots of work that's being done related to program synthesis and inductive logic programming. You can even view the more sophisticated recurrent Neural nets, with more complex memory structures, as differentiable programs. Then you're searching the space of programs guided by gradients. SGD effectively acts as an additional prior (assumption) by the kind of solutions it tends towards.
The reason (the general) you don't hear much about them if you don't go looking is that the state of the art hasn't budged much for the past couple decades. It's the same graph algorithms, searching and sorting toy problems. The search space over programs is difficult to traverse and it remains to be seen what the added compute power + gradients gets us.
On a more practical level, the learning 2 search paradigm can be viewed as also searching for a particular program under certain strict constraints that make search tractable. Probabilistic programming where the priors and likelihoods are themselves complex programs instead of simple distributions from the exponential family are effectively also searching for programs.
This is pedantic and I'm not even certain correct. Full SI will run forever without returning an answer, sure. But as it runs it will get closer and closer to the true answer. Eventually it will approach an answer within whatever degree of precision you want. And it will be more accurate than any alternative algorithm, so it's still optimal.
"All remaining problems" in this case means the universe of all potential functions. Everything. Only a vanishingly small subset of every kind of function or dataset are of interest or even observable in nature.
That doesn't apply to an ensemble of algorithms where the weights of a given member of the ensemble are adapted based on observations from the given domain. If it did, humans wouldn't be able to choose a good algorithm for specific cases, and obviously we can.
Deep neural networks can be thought of as ensembles of smaller neural networks, though of course each member of the ensemble is going to share some degree of algorithmic bias. This suggests that perhaps deep neural networks with heterogeneous activation functions and branching structures will perform better than homogeneous networks.
The No Free Lunch theorem doesn't really care how your algorithm does it, and an ensemble of algorithms is just another algorithm. It certainly applies.
The reason it's not actually all that useful for real world applications is that problems we value are a subset of "all possible problems" for want of a better explanation. Problems are typically smooth, that is a small change in input variables results in a small change in 'fitness', which already limits how applicable the NFL theorem is.
The human brain is exactly such an ensemble of algorithms, and look at all the cognitive biases we have[1]. That our brain was developed for foraging, hunting, and reproducing could explain why the code we write is so buggy.
Look I'm just kidding around. Every algorithm has its advocates, and the arguments get tedious after a while.
> If it did, humans wouldn't be able to choose a good algorithm for specific cases, and obviously we can.
This is a surprisingly commonly held fallacy in some AI circles. It's the idea that humans are mathematically perfect. When you phrase it that way, it's fairly obviously false, but you still see a lot of people argue things like "NFL doesn't apply to ensembles because humans..." or "machines can never be as intelligent as humans because...".
The reality is that humans are subject to the same mathematical laws as machines. It's far more likely that my brain can't solve an NP-hard problem in polynomial time either. My brain can't beat random search on the set of all possible problems.
It doesn't imply that humans are mathematically perfect, but our brains are basically algorithm generating algorithms - we're not just weighting a set of preexisting solutions. To generalize quite a bit, saying NFL applies to general intelligence ends up implying there are problems for which efficient algorithms exist, but intelligence is literally incapable of discovering. That seems pretty absurd to me.
The problem is that if our brains are not "magic" (essentially making this a religious argument), then they're operating via some algorithmic principles. If our brains are "algorithm generating algorithms", then I can in principle write an "algorithm generating algorithm" in silicon that does the same thing. And I know with mathematical certainty that my digital version is subject to NFL. So either my brain is too or we go back to religion to explain what non-physical process is responsible for our super-turing capabilities.
The second point is that NFL is often interpreted in a weird way, where we only think about "interesting problems". It is defined on the set of all search or optimization problems. It says if your algorithm is better than exhaustive search over some subset of problems, it must be worse on the complement of that set.
What does it mean for an algorithm to be efficient? Well, it's roughly speaking the number of steps it needed to take (assuming each step is the same amount of work, blah blah). OK, so an "efficient" algorithm must, by definition, prioritize some steps over others -- it's picking the "best" steps to take each time it has the choice. OK, so I'll just make up some instances of the problem that are custom-tailored so that what the algorithm thinks are the "best" steps always lead me in the wrong direction. You algorithm will then be worse than exhaustive search on my set of problems, precisely because it's choosing to avoid the steps I know to be good -- I defined the problem to make that happen.
That is true of any algorithm you can conceive of. There will be problems for which the bias that makes the algorithm good on the problems you intended it to work on will be exactly the wrong bias.
It doesn't matter if you say, "Aha, I'll let my algorithm generate new algorithms! Gotcha!" I'll just design a set of problems for which your algorithm generating algorithm will generate the wrong algorithms.
Search is always about bias. Without bias, you have random search -- that's literally the textbook machine learning definition of bias. All NFL says is that if you have to worry about every possible problem, any bias you choose will be worse than random sometimes. There's no escape clause here. Your algorithm generating algorithm is still a search algorithm with its own biases, and it will still be worse than random on some subset of all possible problems.
It doesn't do well - e.g. on the toy example of MNIST they report 98.96% which is, frankly, rather horrible.
They claim that deep neural approaches get 98.75% or 99.05% by referencing obsolete decade old results, while in fact state of art exceeds 99.8% (i.e. 0.2% error rate, which is five times lower than 1.0% error rate reported in this paper). I have seen MNIST given as a homework exercise for undergraduate students in ML/NN class, and getting 99.0% would indicate that your code has some serious bugs, a decent undergrad with no prior experience can get 99.7% on MNIST after a few lecture introduction to basics and a dozen hours of homework coding practice.
It might be an alternative for deep neural networks, but I doubt it will be an alternative for deep convolutional neural networks. The key breakthrough for deep learning was the stacking of many convolutional layers and being able to train those (key elements include SGD for large scale, dropout or batch norm, etc.). Perhaps if they build some convolutional features for the decision tree they will be able to achieve comparable results, but it will start to look a lot like a deep convolutional neural net (i.e., the same thing with a different name).
If someone proposes a method as an alternative for a field, they need to test this method on the accepted benchmark dataset for that field. For object recognition in static images this dataset is the ImageNet competition. Computing power can be bought from AWS if no cluster is available. The lack of it can't be an argument.
Not saying that the paper has no reason to exist, I think it is generally well written and decision trees certainly deserve attention. If they can do representation learning on high level this is certainly something to look into. But it shouldn't claim to be an alternative to state-of-the-art deep learning if there is no data for this comparison. Everyone can solve MNIST (or even CIFAR).
That's exactly what I was going to post, but you beat me.
I found surprising that they claim competitive performance over neural networks just using a CPU. The advantage I think neural networks have over all other methods is how efficiently they run on GPUs. I would love to see other methods take advantage of modern hardware or parallelism and see if they can catch up with neural networks on image classification.
While optimizations to cost by ditching GPUs as a requirement are important (and presumably these systems benefit from GPU optimization as well, seems unclear from my skim of the paper), cheaper training is NOT just about saving your wallet.
A real emerging area of opportunity is having systems train new systems. This has numerous applications, including assisting DSEs in the construction of new systems or allowing expert systems to learn more over time and even integrate new techniques into a currently deployed system.
I'n not an expert here, but I'd like to be, so I'm definitely going to ask my expert friends more about this.
Yes, as a plugin, and on its way to make it to the trunk AFAIK. I've integrated it into deepdetect recently because even as a beta it works well and complements GPU based DNNs fairly naturally. Deep learning practitioners are already well equipped with GPUs so having XGBoost run on them as well is a good bonus!
None of these experiments actually do anything to show feature learning - if this is the claim, I would like to see a transfer learning experiment. I would be surprised if this works well, since they can't jointly optimize their layers (so you can't just use ImageNet to induce a good representation). Not quite clear why we should think that trees will turn out to be inherently cheaper that a DNN with similar accuracy, unless perhaps the model structure encodes a prior which matches the distribution of the problem?
The method's performance on MNIST is relatively mediocre. You might think 98.96% is amazing, but it's about relative performance. It is a relatively easy exercise nowadays to get above 99% with neural nets. Even I can get that kind of performance with hand-written Python neural nets, on the CPU, with no convolutions.
For the rest of the (non-image) datasets, it's already common knowledge that boosting methods are competitive with neural nets.
Progress in this field is astonishing and it really propagates to the masses in the form of easy-to-use black boxes with a pinch of undergraduate-level maths. Just wow!
Hyperparameter tuning is not as much of an issue with deep neural networks anymore. Thanks to BatchNorm and more robust optimization algorithms, most of the time you can simply use Adam with a default learning rate of 0.001 and do pretty well. Dropout is not even necessary with many models that use BatchNorm nowadays, so generally tuning there is not an issue either. Many layers of 3x3 conv with stride 1 is still magical.
Basically: deep NNs can work pretty well with little to no tuning these days. The defaults just work.