Wow this paper is great. Back in November, I implemented the "Trend or No Trend: A Novel Nonparametric Method for Classifying Time Series"[1] paper in node.js and then I hooked it up to Twitters API. The algorithm was able to find trending topics pretty quickly and accurately. Does anyone know if the source for these algorithms is implemented anywhere?
These algorithms are implemented in just about every major language, as they are standard tools. I'd say your best bet for leveraging them is python scikit learn (super good documentation)
I'm surprised that there aren't any PCA/Feature-Extraction/Dimension-Reduce algos on this list. And there's also a really strong emphasis on clustering and classification.
Is this an artifact of how the list was put together or is clustering really that much more common than feature extraction? Is PCA thought to be too trivial to qualify?
PageRank is a similar eigenvalue based algorithm and I think PCA can be seen as a relaxation of k-means so they sort of did.
But yes, if i wrote the list it eigenvector based methods would be more prominent.
Also If I wrote it the tree based methods would have been replaced with random forests which are similar to CART but use the crucial ideas of sample bagging and random feature subsets which have proven themselves over and over.
They also have a good selection of articles though they tend to be somewhat deep learning/neural network focused. These algorithms have seen a big surge in interest and success of late but may not be as applicable to "data mining" (as opposed to say image or speech recognition).
I think PCA is better understood in theorem-proof style of mathematics rather than algorithmically. For example, PCA can be understood as an optimization problem. Whereas CART and k-NN and the likes are best understood algorithmically. I know k-NN was introduced as an algorithm and only later figured out in terms of probability, etc..
Is it just me, or was anyone else taken aback by an academic article using a linkbait-esque title? I'm sure the title is perfectly reasonable, given that it's a survey paper, but the past few years of being bombarded with articles like "10 weird tricks to optimize your everything" immediately triggers my skeptical meter. Funny how the internet warps the brain.
Unlike most Top 10 lists that are deliberately designed with unexpected choices for shock value, the paper cites exactly how the 10 were chosen and attempts to justify why each choice was made.
If the paper was titled "Wow, The Top 10 Algorithms in Data Mining Are Incredible, And I Can't Believe Where Naive Bayes Placed," then yeah, that might be link-baity.
Statistics professors HATE him! Doctor's discovery revealed the secret to learning any problem with just 10 training samples. Find a separating hyperplane with this One Weird Trick.
Maybe I'm being selective about my memory, but I don't remember the awful linkbait titles being nearly so popular in 2007 when this was written. I don't think there's even the hint of imitation of Buzzfeed-style titling here. Also, I've taken a class from one of the authors, and I'd be surprised if he'd ever even seen those kinds of pages. :)
He's on about the problem of data mining being used by NSA and corporate America to do Bad Things.
And what are you on about?
I'm guessing you're on about the "tool is only as good or evil as the wielder."
Trying to take this back out of the meta, our society needs to figure out how to stop hurting each other. I think a good dose of critical thinking skills for our youngsters would go a good distance toward helping that.
I guess that's sort of like saying computers are evil because they are used by the NSA. Is there even any evidence that they do use it for anything significant? Let alone that this guys work is contributing to it.
Computers have many applications, while data mining is generally used with working sets consisting of information about large groups of people - either to get a strategic leg up on them (say, price discrimination or advertising susceptibility), or to single out deviant individuals.
I don't think the entire field is strictly a universal evil per se, but with the current climate of centralization I think any good that it can bring is massively dwarfed by the bad.
Similarly, missiles themselves aren't a universal bad as they can be used for self defense, but in the current world basically their entire purpose is to preserve empires.
I don't think that's the case. In theory it can be used to "find deviants", but has that actually ever been done and to what degree? Improving advertisements isn't horrible.
It has a lot of applications in science, could vastly improve medicine, it's used to predict energy usage for power plants, filter spam, recommend movies or products, improve harvests for farming, etc. But I guess those things don't sell papers.
And lastly calling a technology "bad" isn't actually helpful. We can't stop bad people from using the technology or "uninvent" it. You'll just damage the people using it for good. See the Halo Effect where people told about the risks of nuclear power perceive the benefits as less (http://lesswrong.com/lw/lj/the_halo_effect/).
By "finding deviants", I mean exactly what the NSA is doing, and what every Keystone police department will be doing in ten years. And yes, improving advertisements is horrible - this is precisely one of the anti-human facets I am talking about.
> It has a lot of applications in science, could vastly improve medicine, it's used to predict energy usage for power plants, filter spam, recommend movies or products, improve harvests for farming, etc
You're right. My initial comment was way too harsh and broad for indicting the whole field rather than just the in-our-face web scene applications.
> lastly calling a technology "bad" isn't actually helpful. We can't stop bad people from using the technology
Of course you can't uninvent something, but you can slow down or speed up it's progress. Path dependence is a thing. Currently, most people do not feel empowered with respect to computers. The more we put them at the mercy of seemingly unpredictable computer systems that are actually trying to outwit them, the harder it is going to be for them to get over those feelings of disempowerment so that we can get to a place where people expect their computers to work for them.
I've seen no evidence that the NSA does that, and really it makes no sense. Doing a search for "abnormal" behavior would mostly get you people that visit odd websites, or are awake at atypical hours, or spend too much time online, etc. Nothing remotely useful. There is a lot of hype around data-mining but there are diminishing returns on what can be done with it.
I agree that ads are bad, but they aren't that bad. Targeted advertisement is mostly just finding things that are relevant to the website or viewer. And it's so incredibly easy to block ads on the internet I don't really have much sympathy for those that don't and complain about them. Ads were way worse in the age of television where 25% of all content was commercials.
>Of course you can't uninvent something, but you can slow down or speed up it's progress. Path dependence is a thing.
Perhaps, but for the most part the technology already exists. At it's core it's just basic statistics or sometimes even simpler techniques. Diminishing returns have already been hit long ago - I can't think of much that would improve the "bad" applications more than the "good".
>The more we put them at the mercy of seemingly unpredictable computer systems that are actually trying to outwit them, the harder it is going to be for them to get over those feelings of disempowerment so that we can get to a place where people expect their computers to work for them.
I.. what? On the rare occasions where machine learning is used in apps, it's to improve user experience, or do something that would otherwise be too complex or inefficient to code by hand. I'm sure someone, somewhere, abused the technology in their app - just like every programming technology ever. But in general I don't see how this is bad or even that it makes computers more "unpredictable".
Kevin Murphy's treatment of Kernel Methods in "Machine Learning - A Probabilistic Perspective" offers a take on SVMs that I don't often find in most discussions. He makes the case that SVMs are generally rather unusual from a Bayesian point of view. Apropos to other sparse vector machines, Murphy opines:
"Note that SVMs are very unnatural from a probabilistic point of view. First, they encode sparsity in the loss function rather than the prior. Second, they encode kernels by using an algorithmic trick, rather then being an explicit part of the model. Finally, SVMs do not result in probabilistic outputs, which causes various difficulties, especially in the multi-class classification setting.
"It is possible to obtain sparse, probabilistic, multi-class kernel-based classifiers, which work as well or better than SVMs, using techniques such as L1VM or RVM... However, we include a discussion of SVMs, despite their non-probabilistic nature, for two main reasons. First, they are very popular and widely used, so all students of machine learning should know about them. Second, they have some computational advantages over probabilistic methods in the structured output case... [specifically, SSVMs]
...
"The conclusion is as follows: if speed matters, use an RVM, but if well-calibrated probabilistic output matters (e.g., for active learning or control problems), use a GP. The only circumstances under which using an SVM seems sensible is the structured output case, where likelihood-based methods can be slow. (We attribute the enormous popularity of SVMs not to their superiority, but to ignorance of the alternatives, and also to the lack of high quality software implementing the alternatives.)"
----
the tl;dr is that SVMs are an important tool for their historical impact and popularity, but the modern data scientist has a range of cookie-cutter algorithms to choose from that could possibly be much more useful.
Okay, given a box of data, shove it into
one of these algorithms, let the
computer run, and now what? What do we
have?
E.g., for some 'classifier', what the heck do
we have? That is, what guarantees do we have?
That is, is there something specific and guaranteed
that the algorithm has actually done?
I know; I know; in some vague, unspecific sense,
that seems to work in some practical cases,
the algorithm will 'classify'. Still, we don't
have precise description of (1) in what cases
does the algorithm work and (2) how well.
E.g., what is the rate (probability)
of false classifications?
Sure, we'd need some hypotheses to say.
Okay, what the heck are the algorithms
assuming for hypotheses that would let there even be
a rate of false classifications.
Then with those hypotheses,
what are the probabilities of a false
classification?
Without such considerations, about all we
have is some
code that runs.
The body of literature you are looking for is called statistical learning theory. If you are interested, you can start with Vapnik's two books, but that is what I said, a start. The theory of Learning has shifted quite a bit, but the fundamentals remain the same. Once you are up to speed with Vapnik you can start reading proceedings of COLT and ALT (computational learning theory and algorithmic learning theory respectively).
You will find the following 3 types of guarantees
i) Strong assumptions on the knowledge of the stochastic process generating the data to back guarantees on the algorithm
ii) No assumptions on the on the data generating process except that it is governed by some unknown independent stochastic process. Prove guarantees on the output of the algorithm
iii) Don't even assume that the data generation process is stochastic yet provide guarantees on the output.
What exactly is guaranteed will of course differ.
Little advice, smugness as demonstrated with ...I figured it out 16 years ago yo bitches, will not help your cause whatever that might be. BTW are you H. A. Dye, 'cause Dye figured out group structures on measure preserving transformations in 1959 a little bit earlier than 16 years ago...just saying.
I was commenting on the OP. I started reading it
but found just 'algorithms' with no indication of
anything like any guarantees or qualities for them.
Your i) sounds good. E.g., in stochastic dynamic programming
with assumptions reasonably expected, e.g., essentially
a Markov assumption, we can easily prove that the
number crunching yields the decisions with the least
possible expected cost from any 'measurable' means
of manipulating the data. And once when I argued this,
I also included a proof that additional data independent of
what we already wouldn't help either. This result is, of course, 'obvious' although obvious is not a proof. The proof turned out to be just a light tap with Fubini's
theorem.
So, good: Some people actually argue that some of the
algorithms actually have some desirable qualities. So,
the world is not totally coming apart at the seams
or being turned over to naive hackers who only know to
watch the code run and then conclude qualitative that
it seems 'nice'.
On your ii), "some unknown independent stochastic process",
that's not so clear. So, for some index set I and
random variables X(i) for i in I, the set
{ X(i) | i in I }
is independent? That's something of an unusual
and, in practice, not very interesting stochastic
process. Maybe some EEs would call that 'white noise'.
On your iii), sorry, but essentially everything is 'stochastic'.
Even a 'deterministic' process is stochastic, in practice
usually is also Markov (past and future conditionally
independent given the present).
Thanks for the reference to H. A. Dye. I went on to
show how such a group leads easily to statistical
hypothesis tests. That is, in particular I was
making sense out of 'resampling' and 'bootstrap' plans.
In part I got the idea of considering 'measure preserving', of course,
from ergodic theory where measure preserving
transformations are central, but the transformations
I used, and needed for the statistics, are the ones
that ergodic theory throws out.
No, I'm not Dye. That's the first I heard of Dye's
paper. Thanks. At one point long after publishing
I saw remark in one of the volumes of Dunford and
Schwartz, as I recall, attributed to Norbert Wiener,
that maybe something could be done with
the 'periodic' case where, thus, there is a finite
group.
Never heard of Vapnik. Thanks for the reference. From
a simple Google, found at
The Bactra Review: Occasional and eclectic book reviews by Cosma Shalizi
The Nature of Statistical Learning Theory
by V. N. Vapnik
Okay, Vapnik is trying to do some mathematics
to show some desirable mathematical properties of
some of statistical learning theory, maybe in particular some of the algorithms. Good.
The review is a bit tough on Vapnik's writings from
when he was writing in Russia. That's sad:
The famous Russian texts on probability and stochastic
processes, usually from Springer, were exemplary and with
beautifully done translations to English, e.g., E. Dynkin,
R. Lipster, etc.
I can'f find where I said "yo bitches".
But
I do believe that a proposed algorithm should
do something, that what it does should be
clear, good, and demonstrated, hopefully with
some theorems and proofs. Knuth's TACP was
big on 'analysis' of algorithms, but from
the OP and, say, Prof Ng's course on machine
learning it appears that the analysis part is
being omitted.
Broadly, it looks like (1) statistics, optimization,
stochastic processes are old fields
and no longer very popular, at least in the US
where they might be in departments of mathematics
which, however, in the US, mostly don't like
those fields;
(2) computing is a new field and has
lots of grant money, interest, and students;
(3) for an actual application, those old fields need computing to
do the specified data manipulations;
(4) computing wants to take on some of the
problems the old fields worked on and, then,
is at risk of putting poor versions of good old
wine in new bottles with new labels. This is
a form of 'academic theft'? Well, if not, if
the new wine is poor, then it's a form of
academic dilution. Not good.
Well its a HN comment not a scientific communication.
Scenario (i) is mostly considered uninteresting and "solved" because it assumes that the particular stochastic process generating the data is known (this was the pre-Vapnik model). In most practical cases, however, one does not know the corresponding measure from which the data was generated (and will be generated in future). The old school thinking was that one should estimate the generating distribution, and then solve for the optimal decision, based on the estimate. Vapnik showed that it is far better to do it directly without bothering to estimate the distribution which in itself is a non-trivial problem.
So (ii) is the more interesting case that machine learners care to handle and Vapnik's was one of the first breakthrough fundamental result in this formalism. It applies to all algorithms that does something called structural risk minimization, not just one specific algorithm. His bound applies to all generating distributions, all samples, and all structural risk minimization algorithms ! However in this set up we still need to assume independence. Some form of independence, even in some weak form needs to be present. But it is quite remarkable that you can give guarantees on performance regardless of what actual distribution generated the data. I should also mention Leslie Valiant. Lookup PAC learning you will understand the basic premise of this style. One genuine criticism is that it is overly conservative (and it has to be, because it is giving guarantees that applies for the worst distribution, worst training sample, worst algorithm...).
(iii) Removes the need for independent draws. If you want to read up on these, search for "regret minimization." In this setup the guarantees hold for all sequence of data, even if adversarially generated by someone who knows the algorithm you will employ. Some geometric assumptions are necessary on the domain and some function analytic ones on the function that you are trying to minimize (typically convex. Non-differentiable functions are allowed).
-- Shalizi's Review --
Shalizi writes good stuff but this review of Vapnik's book is surprisingly mean spirited. The field of course has progressed after Vapnik and there are legitimate criticisms that one can raise, but the review exhibits bad form, I think. Prior to Vapnik there were very few results that one could state outside of the statistical physics community (that worked with strong assumptions, typically Gaussian noise etc)
-- Aside --
No you did not say the word "yo bitches" but that was the attitude you demonstrated. Before casting judgments as you did at least make an effort to familiarize yourself with the background. And that is exactly what it is, background. A machine learner will not even mention these things because you are expected to know these things if you are in this field.
-- Re: Analysis being omitted --
Think of it this way, one doesn't really need to know precise chemistry of hydrocarbon combustion to drive a car, conversely, knowledge of such chemistry alone is useless unless it is teamed up with good engineering and in this case algorithm. There is a huge difference between showing that minimizing a particular function has nice properties and actually coming up with a piece of code to efficiently minimize such a function, not just in theory where constants can be conveniently swept under the rug, but also in practice. So dont be too quick to dismiss the "naive coders".
Most importantly, practice, almost always precedes explanation and analysis in engineering. Usually practitioners find the algorithms in their hacky ways. The analyst/theorist smoothens the kinks and explains why the method works so well after it has been demonstrated to work so well-- not always, but almost always.
Much of machine learning is done empirically. You can see how well an algorithm works on different problems, and a test set will estimate exactly how well. Mathematical guarantees are interesting, but really who cares? I suppose the bayesian machine learning camp is pretty into formal math. And really methods are based on some intuition about how they will work - nothing is just random.
> Mathematical guarantees are interesting, but really who cares?
We should all care. The guaranteed help us know
what the heck we are doing and what the heck we've
got when we are done. E.g., in statistics we
like estimators that are unbiased and minimum
variance -- that's 'goodness' and good to know
and potentially valuable in practice. Estimators
that are biased and with variance larger than
necessary are not good. We like regression in part
because it is 'least squares' which means it is
a projection and, thus, in mean-square sense, the
most accurate approximation we can get. We like
statistics with loss and risk functions because
they tell us what to do at least cost. Etc.
Unbiasedness is oversold. It makes sense when you also average the estimator, otherwise what matters is the expected error or expected loss (probabilities thereof). Let the error be in the bias part or the variance part, what do I care. Furthermore not all losses decompose into bias and variance either.
I also think the fascination with least squares led us astray. Its a bad loss function for classification. Bad for neural networks that use the sigmoid function. Bad because it can create exponentially many local minima, on the other hand with sigmiod function if you used KL divergence instead of square loss you will not have these problems.
Part of it is the old statistical vestige of analyzing everything using tools of asymptotic Normality.
But I agree with your general point, (and upvoted your parent comment) one should care about what one is really getting after the algorithm has been applied, and phrase that in language of mathematics. If the community does not do this a) the snake-oil salesmen will come out of the woodworks and b) machine learning will suffer the same fate as AI winter c) at the same time one should not discourage open ended practice and experimentation (that is how most new things are discovered, analysis comes later, mostly)
Moreover, your a), b) say some of what I think better than I said it.
Otherwise you have a good critique of traditional
statistics.
On traditional statistics, I voted with my research
and did work that was distribution-free, also
multidimensional, unusual for distribution-free. I did wonder
if there could be another way, and what you wrote
about Vapnik suggests that there can be.
This thread now is much deeper into the analysis of
machine learning algorithms than I noticed in
what I saw of Stanford professor Ng's course and
several books and papers in PDF I downloaded and
scanned. I would guess the level
of Vapnik's work would be that of a graduate seminar
and not a ugrad computer science course and that the
mathematical prerequisites would filter out nearly
everyone with a ugrad CS degree and nearly all
their professors. Why? From some of my experience getting
a paper of mine reviewed, measure preserving totally
blew away several editors in chief of famous
CS journals and chaired professors of CS at
famous research universities; one such person wrote
me "Neither I nor anyone on my board of editors
has the mathematical prerequisites to review your
paper.", and a second such person wrote me
a similar remark. The math prerequisites were not so
advanced, say, just at the level a graduate course
in probability via, say, Loeve or Neveu and fast
and easy reading for, say, E. Cinlar, P. Billingsley,
L. Breiman, J. Doob, A. Karr, E. Dynkin, etc. But the paper was 'CS' because it
was on a CS problem. For AI, I cooked up the paper
to improve on some of our work with 'expert systems'
for the problem. I have to suspect that work
in machine learning that needs some of Loeve
or functional analysis from, say, Rudin will
not be popular in CS departments.
This thread has become a good start on a grad seminar.
Just now I can't follow up on the leads here;
instead I have to finish off the code for
two Web pages and get my new Web site live.
Machine learning is really one part probability theory, one part functional analysis one part optimization and another part algorithms. I think with your background you will enjoy it. The particular two papers by Vapnik and Chervonenkis are very readable and does not use anything more technical than Chernoff bounds, its more about how they are used rather than what are used. CS students typically do not have probability in their curriculum, at least no measure theory, so you will see a lot of Physicist turned machine learners in this field. You might have something interesting going on with your own research finding, if you have not published it I would encourage you to do so. A possible venue is JMLR (journal of machine learning research, there editors are not scared of a little measure theory).
BTW a surprising fact is that Vapnik's bounds are dimensionality independent, provided you are working with what are known as function classes with finite VC (Vapnki-Chervonenkis) dimension. Dimensionality is a big deal in m/c learning because the data they are interested in is typically quite high dimensional.
> You might have something interesting going on with your own research finding, if you have not published it I would encourage you to do so.
'Information Sciences', 1999, "... Anomaly Detector". So look
for 'zero day' problems in server farms and networks. Get
a multidimensional, distribution-free hypothesis test
where know false alarm rate in advance and can adjust it
exactly in small steps over a wide range.
What is novel is the 'applied probability' that permits
the results with false alarm rate.
The assumptions about the input data are realistic and,
thus, do not permit doing as well as the Neyman-Pearson
result, but, still, intuitively, the detection rate
should be relatively high. There is a weaker sense,
not in the paper, where the detection rate is
as high as possible. I.e., essentially, for the
selected false alarm rate, the 'critical region'
where null hypothesis (a healthy system) is accepted
has least possible Lebesgue measure.
It appears that the input data can have dimensions
10, 100, 1000, etc. I have an algorithm that makes
the computations fast, but it's not in the paper.
The computing needed was a bit much when I published
the paper, but that computing is quite reasonable
now. For a system where really care a lot about
reliability, performance, security, etc., it could
be worthwhile to deploy what I cooked up.
My current project is quite different but does need
a server farm, and I might deploy my detector there.
Then with that demo, if people like my detector,
I might let people buy it.
Perhaps, but IMHO the obsession with "provably optimal" is bad and will never lead to anything useful. The real world is messy and complicated. Search spaces are far too large to ever completely explore. We need heuristics to do things in a reasonable amount of time. Heuristics that exploit the structure of the problem we are trying to solve. And those are going to be difficult if not impossible to formally verify.
Okay: Googled K-fold. It's a case
of resampling. To justify it, use
a finite group of measure preserving
transformations, which I figured out
and published about 16 years ago.
But we still have not made clear what
the heck the assumptions are.
Confusion matrix: Wikipedia has nothing
on it, but Matlab's documentation does.
But there's a problem: Suppose
we have some data on cats and dogs
and want to have a 'classifier'.
So, we do what Matlab said: Take
half of the training data and build
a classifier. Then we test the
classifier with the other half of the
data.
But suppose the classifier finds three
classifications instead of just the
two, dogs and cats we already know
exist. So, if we apply such methods
where we don't know in advance the
number of classifications we might
get three classes and be quite wrong
and not know it.
Again, the methodology looks like just
pushing some data into some software
with too little in assumptions up front
and too little about what we are getting
out on the back side.
Now you are talking about supervised vs unsupervised learning and it seems that you are talking about clustering, not classification. Classification problems are problems where you have number of classes and the classificator tries to estimate ẃith given parameters what group(s) the given sample belongs to. It doesn't 'invent' new classes.
Okay, then, a critique of the OP is that
in classification the number of classes is
an input to the algorithm, not an output.
My view is that being clear about such
details is crucial.
These algorithms are all good, but they seems to be missing recurrent ANN, that is, the form of datamining tool our brains are using.
I can add this paper as relevant for those being interested
http://orre.neurologic.org/pubdoc/abstract/complex2003-abstr...
[1] - http://dspace.mit.edu/bitstream/handle/1721.1/85399/87030495...