Hacker News new | past | comments | ask | show | jobs | submit login
Top Algorithms in Data Mining (2008) [pdf] (uvm.edu)
177 points by Anon84 on March 23, 2014 | hide | past | favorite | 43 comments



Wow this paper is great. Back in November, I implemented the "Trend or No Trend: A Novel Nonparametric Method for Classifying Time Series"[1] paper in node.js and then I hooked it up to Twitters API. The algorithm was able to find trending topics pretty quickly and accurately. Does anyone know if the source for these algorithms is implemented anywhere?

[1] - http://dspace.mit.edu/bitstream/handle/1721.1/85399/87030495...


Is is that famous algorithm that's able to predict future Twitter trending topics pretty accurately? Is your implementation opensource?


These algorithms are implemented in just about every major language, as they are standard tools. I'd say your best bet for leveraging them is python scikit learn (super good documentation)


I'm surprised that there aren't any PCA/Feature-Extraction/Dimension-Reduce algos on this list. And there's also a really strong emphasis on clustering and classification.

Is this an artifact of how the list was put together or is clustering really that much more common than feature extraction? Is PCA thought to be too trivial to qualify?


PageRank is a similar eigenvalue based algorithm and I think PCA can be seen as a relaxation of k-means so they sort of did.

But yes, if i wrote the list it eigenvector based methods would be more prominent.

Also If I wrote it the tree based methods would have been replaced with random forests which are similar to CART but use the crucial ideas of sample bagging and random feature subsets which have proven themselves over and over.

I think it may just be out of date.


Any suggestions for something more recent?


The machine learning reddit recently started a faq:

http://www.reddit.com/r/MachineLearning/comments/20i0vi/meta...

They also have a good selection of articles though they tend to be somewhat deep learning/neural network focused. These algorithms have seen a big surge in interest and success of late but may not be as applicable to "data mining" (as opposed to say image or speech recognition).


I think PCA is better understood in theorem-proof style of mathematics rather than algorithmically. For example, PCA can be understood as an optimization problem. Whereas CART and k-NN and the likes are best understood algorithmically. I know k-NN was introduced as an algorithm and only later figured out in terms of probability, etc..


Algorithmically, it's just finding an eigensystem...


Yes, but what's interesting about PCA is mathematical, not algorithmic.


This article was eventually expanded into a full fledged book with more in depth descriptions of each algorithm:

http://www.crcpress.com/product/isbn/9781420089646


Is it just me, or was anyone else taken aback by an academic article using a linkbait-esque title? I'm sure the title is perfectly reasonable, given that it's a survey paper, but the past few years of being bombarded with articles like "10 weird tricks to optimize your everything" immediately triggers my skeptical meter. Funny how the internet warps the brain.


Unlike most Top 10 lists that are deliberately designed with unexpected choices for shock value, the paper cites exactly how the 10 were chosen and attempts to justify why each choice was made.

If the paper was titled "Wow, The Top 10 Algorithms in Data Mining Are Incredible, And I Can't Believe Where Naive Bayes Placed," then yeah, that might be link-baity.


Statistics professors HATE him! Doctor's discovery revealed the secret to learning any problem with just 10 training samples. Find a separating hyperplane with this One Weird Trick.

http://www.oneweirdkerneltrick.com/


Maybe I'm being selective about my memory, but I don't remember the awful linkbait titles being nearly so popular in 2007 when this was written. I don't think there's even the hint of imitation of Buzzfeed-style titling here. Also, I've taken a class from one of the authors, and I'd be surprised if he'd ever even seen those kinds of pages. :)


Q: What's the difference between working on data mining and designing missiles?

A: When you build missiles, you can at least take comfort knowing that they'll only be used to harm people who are far away.


What are you talking about?


He's on about the problem of data mining being used by NSA and corporate America to do Bad Things.

And what are you on about?

I'm guessing you're on about the "tool is only as good or evil as the wielder."

Trying to take this back out of the meta, our society needs to figure out how to stop hurting each other. I think a good dose of critical thinking skills for our youngsters would go a good distance toward helping that.


I guess that's sort of like saying computers are evil because they are used by the NSA. Is there even any evidence that they do use it for anything significant? Let alone that this guys work is contributing to it.


Computers have many applications, while data mining is generally used with working sets consisting of information about large groups of people - either to get a strategic leg up on them (say, price discrimination or advertising susceptibility), or to single out deviant individuals.

I don't think the entire field is strictly a universal evil per se, but with the current climate of centralization I think any good that it can bring is massively dwarfed by the bad.

Similarly, missiles themselves aren't a universal bad as they can be used for self defense, but in the current world basically their entire purpose is to preserve empires.


I don't think that's the case. In theory it can be used to "find deviants", but has that actually ever been done and to what degree? Improving advertisements isn't horrible.

It has a lot of applications in science, could vastly improve medicine, it's used to predict energy usage for power plants, filter spam, recommend movies or products, improve harvests for farming, etc. But I guess those things don't sell papers.

And lastly calling a technology "bad" isn't actually helpful. We can't stop bad people from using the technology or "uninvent" it. You'll just damage the people using it for good. See the Halo Effect where people told about the risks of nuclear power perceive the benefits as less (http://lesswrong.com/lw/lj/the_halo_effect/).


By "finding deviants", I mean exactly what the NSA is doing, and what every Keystone police department will be doing in ten years. And yes, improving advertisements is horrible - this is precisely one of the anti-human facets I am talking about.

> It has a lot of applications in science, could vastly improve medicine, it's used to predict energy usage for power plants, filter spam, recommend movies or products, improve harvests for farming, etc

You're right. My initial comment was way too harsh and broad for indicting the whole field rather than just the in-our-face web scene applications.

> lastly calling a technology "bad" isn't actually helpful. We can't stop bad people from using the technology

Of course you can't uninvent something, but you can slow down or speed up it's progress. Path dependence is a thing. Currently, most people do not feel empowered with respect to computers. The more we put them at the mercy of seemingly unpredictable computer systems that are actually trying to outwit them, the harder it is going to be for them to get over those feelings of disempowerment so that we can get to a place where people expect their computers to work for them.


I've seen no evidence that the NSA does that, and really it makes no sense. Doing a search for "abnormal" behavior would mostly get you people that visit odd websites, or are awake at atypical hours, or spend too much time online, etc. Nothing remotely useful. There is a lot of hype around data-mining but there are diminishing returns on what can be done with it.

I agree that ads are bad, but they aren't that bad. Targeted advertisement is mostly just finding things that are relevant to the website or viewer. And it's so incredibly easy to block ads on the internet I don't really have much sympathy for those that don't and complain about them. Ads were way worse in the age of television where 25% of all content was commercials.

>Of course you can't uninvent something, but you can slow down or speed up it's progress. Path dependence is a thing.

Perhaps, but for the most part the technology already exists. At it's core it's just basic statistics or sometimes even simpler techniques. Diminishing returns have already been hit long ago - I can't think of much that would improve the "bad" applications more than the "good".

>The more we put them at the mercy of seemingly unpredictable computer systems that are actually trying to outwit them, the harder it is going to be for them to get over those feelings of disempowerment so that we can get to a place where people expect their computers to work for them.

I.. what? On the rare occasions where machine learning is used in apps, it's to improve user experience, or do something that would otherwise be too complex or inefficient to code by hand. I'm sure someone, somewhere, abused the technology in their app - just like every programming technology ever. But in general I don't see how this is bad or even that it makes computers more "unpredictable".


Kevin Murphy's treatment of Kernel Methods in "Machine Learning - A Probabilistic Perspective" offers a take on SVMs that I don't often find in most discussions. He makes the case that SVMs are generally rather unusual from a Bayesian point of view. Apropos to other sparse vector machines, Murphy opines:

"Note that SVMs are very unnatural from a probabilistic point of view. First, they encode sparsity in the loss function rather than the prior. Second, they encode kernels by using an algorithmic trick, rather then being an explicit part of the model. Finally, SVMs do not result in probabilistic outputs, which causes various difficulties, especially in the multi-class classification setting.

"It is possible to obtain sparse, probabilistic, multi-class kernel-based classifiers, which work as well or better than SVMs, using techniques such as L1VM or RVM... However, we include a discussion of SVMs, despite their non-probabilistic nature, for two main reasons. First, they are very popular and widely used, so all students of machine learning should know about them. Second, they have some computational advantages over probabilistic methods in the structured output case... [specifically, SSVMs]

...

"The conclusion is as follows: if speed matters, use an RVM, but if well-calibrated probabilistic output matters (e.g., for active learning or control problems), use a GP. The only circumstances under which using an SVM seems sensible is the structured output case, where likelihood-based methods can be slow. (We attribute the enormous popularity of SVMs not to their superiority, but to ignorance of the alternatives, and also to the lack of high quality software implementing the alternatives.)"

----

the tl;dr is that SVMs are an important tool for their historical impact and popularity, but the modern data scientist has a range of cookie-cutter algorithms to choose from that could possibly be much more useful.


Okay, given a box of data, shove it into one of these algorithms, let the computer run, and now what? What do we have?

E.g., for some 'classifier', what the heck do we have? That is, what guarantees do we have? That is, is there something specific and guaranteed that the algorithm has actually done?

I know; I know; in some vague, unspecific sense, that seems to work in some practical cases, the algorithm will 'classify'. Still, we don't have precise description of (1) in what cases does the algorithm work and (2) how well.

E.g., what is the rate (probability) of false classifications? Sure, we'd need some hypotheses to say. Okay, what the heck are the algorithms assuming for hypotheses that would let there even be a rate of false classifications. Then with those hypotheses, what are the probabilities of a false classification?

Without such considerations, about all we have is some code that runs.

Uh, where's the beef?


The body of literature you are looking for is called statistical learning theory. If you are interested, you can start with Vapnik's two books, but that is what I said, a start. The theory of Learning has shifted quite a bit, but the fundamentals remain the same. Once you are up to speed with Vapnik you can start reading proceedings of COLT and ALT (computational learning theory and algorithmic learning theory respectively).

You will find the following 3 types of guarantees

i) Strong assumptions on the knowledge of the stochastic process generating the data to back guarantees on the algorithm

ii) No assumptions on the on the data generating process except that it is governed by some unknown independent stochastic process. Prove guarantees on the output of the algorithm

iii) Don't even assume that the data generation process is stochastic yet provide guarantees on the output.

What exactly is guaranteed will of course differ.

Little advice, smugness as demonstrated with ...I figured it out 16 years ago yo bitches, will not help your cause whatever that might be. BTW are you H. A. Dye, 'cause Dye figured out group structures on measure preserving transformations in 1959 a little bit earlier than 16 years ago...just saying.


I was commenting on the OP. I started reading it but found just 'algorithms' with no indication of anything like any guarantees or qualities for them.

Your i) sounds good. E.g., in stochastic dynamic programming with assumptions reasonably expected, e.g., essentially a Markov assumption, we can easily prove that the number crunching yields the decisions with the least possible expected cost from any 'measurable' means of manipulating the data. And once when I argued this, I also included a proof that additional data independent of what we already wouldn't help either. This result is, of course, 'obvious' although obvious is not a proof. The proof turned out to be just a light tap with Fubini's theorem.

So, good: Some people actually argue that some of the algorithms actually have some desirable qualities. So, the world is not totally coming apart at the seams or being turned over to naive hackers who only know to watch the code run and then conclude qualitative that it seems 'nice'.

On your ii), "some unknown independent stochastic process", that's not so clear. So, for some index set I and random variables X(i) for i in I, the set

{ X(i) | i in I }

is independent? That's something of an unusual and, in practice, not very interesting stochastic process. Maybe some EEs would call that 'white noise'.

On your iii), sorry, but essentially everything is 'stochastic'. Even a 'deterministic' process is stochastic, in practice usually is also Markov (past and future conditionally independent given the present).

Thanks for the reference to H. A. Dye. I went on to show how such a group leads easily to statistical hypothesis tests. That is, in particular I was making sense out of 'resampling' and 'bootstrap' plans.

In part I got the idea of considering 'measure preserving', of course, from ergodic theory where measure preserving transformations are central, but the transformations I used, and needed for the statistics, are the ones that ergodic theory throws out.

No, I'm not Dye. That's the first I heard of Dye's paper. Thanks. At one point long after publishing I saw remark in one of the volumes of Dunford and Schwartz, as I recall, attributed to Norbert Wiener, that maybe something could be done with the 'periodic' case where, thus, there is a finite group.

Never heard of Vapnik. Thanks for the reference. From a simple Google, found at

http://vserver1.cscs.lsa.umich.edu/~crshalizi/reviews/vapnik...

a review

The Bactra Review: Occasional and eclectic book reviews by Cosma Shalizi The Nature of Statistical Learning Theory by V. N. Vapnik

Okay, Vapnik is trying to do some mathematics to show some desirable mathematical properties of some of statistical learning theory, maybe in particular some of the algorithms. Good.

The review is a bit tough on Vapnik's writings from when he was writing in Russia. That's sad: The famous Russian texts on probability and stochastic processes, usually from Springer, were exemplary and with beautifully done translations to English, e.g., E. Dynkin, R. Lipster, etc.

I can'f find where I said "yo bitches".

But I do believe that a proposed algorithm should do something, that what it does should be clear, good, and demonstrated, hopefully with some theorems and proofs. Knuth's TACP was big on 'analysis' of algorithms, but from the OP and, say, Prof Ng's course on machine learning it appears that the analysis part is being omitted.

Broadly, it looks like (1) statistics, optimization, stochastic processes are old fields and no longer very popular, at least in the US where they might be in departments of mathematics which, however, in the US, mostly don't like those fields; (2) computing is a new field and has lots of grant money, interest, and students; (3) for an actual application, those old fields need computing to do the specified data manipulations; (4) computing wants to take on some of the problems the old fields worked on and, then, is at risk of putting poor versions of good old wine in new bottles with new labels. This is a form of 'academic theft'? Well, if not, if the new wine is poor, then it's a form of academic dilution. Not good.


Well its a HN comment not a scientific communication.

Scenario (i) is mostly considered uninteresting and "solved" because it assumes that the particular stochastic process generating the data is known (this was the pre-Vapnik model). In most practical cases, however, one does not know the corresponding measure from which the data was generated (and will be generated in future). The old school thinking was that one should estimate the generating distribution, and then solve for the optimal decision, based on the estimate. Vapnik showed that it is far better to do it directly without bothering to estimate the distribution which in itself is a non-trivial problem.

So (ii) is the more interesting case that machine learners care to handle and Vapnik's was one of the first breakthrough fundamental result in this formalism. It applies to all algorithms that does something called structural risk minimization, not just one specific algorithm. His bound applies to all generating distributions, all samples, and all structural risk minimization algorithms ! However in this set up we still need to assume independence. Some form of independence, even in some weak form needs to be present. But it is quite remarkable that you can give guarantees on performance regardless of what actual distribution generated the data. I should also mention Leslie Valiant. Lookup PAC learning you will understand the basic premise of this style. One genuine criticism is that it is overly conservative (and it has to be, because it is giving guarantees that applies for the worst distribution, worst training sample, worst algorithm...).

(iii) Removes the need for independent draws. If you want to read up on these, search for "regret minimization." In this setup the guarantees hold for all sequence of data, even if adversarially generated by someone who knows the algorithm you will employ. Some geometric assumptions are necessary on the domain and some function analytic ones on the function that you are trying to minimize (typically convex. Non-differentiable functions are allowed).

   -- Shalizi's Review --
Shalizi writes good stuff but this review of Vapnik's book is surprisingly mean spirited. The field of course has progressed after Vapnik and there are legitimate criticisms that one can raise, but the review exhibits bad form, I think. Prior to Vapnik there were very few results that one could state outside of the statistical physics community (that worked with strong assumptions, typically Gaussian noise etc)

    -- Aside --
No you did not say the word "yo bitches" but that was the attitude you demonstrated. Before casting judgments as you did at least make an effort to familiarize yourself with the background. And that is exactly what it is, background. A machine learner will not even mention these things because you are expected to know these things if you are in this field.

    -- Re: Analysis being omitted  --
Think of it this way, one doesn't really need to know precise chemistry of hydrocarbon combustion to drive a car, conversely, knowledge of such chemistry alone is useless unless it is teamed up with good engineering and in this case algorithm. There is a huge difference between showing that minimizing a particular function has nice properties and actually coming up with a piece of code to efficiently minimize such a function, not just in theory where constants can be conveniently swept under the rug, but also in practice. So dont be too quick to dismiss the "naive coders".

Most importantly, practice, almost always precedes explanation and analysis in engineering. Usually practitioners find the algorithms in their hacky ways. The analyst/theorist smoothens the kinks and explains why the method works so well after it has been demonstrated to work so well-- not always, but almost always.


Much of machine learning is done empirically. You can see how well an algorithm works on different problems, and a test set will estimate exactly how well. Mathematical guarantees are interesting, but really who cares? I suppose the bayesian machine learning camp is pretty into formal math. And really methods are based on some intuition about how they will work - nothing is just random.


> Much of machine learning is done empirically.

Yes, it does seem so.

> Mathematical guarantees are interesting, but really who cares?

We should all care. The guaranteed help us know what the heck we are doing and what the heck we've got when we are done. E.g., in statistics we like estimators that are unbiased and minimum variance -- that's 'goodness' and good to know and potentially valuable in practice. Estimators that are biased and with variance larger than necessary are not good. We like regression in part because it is 'least squares' which means it is a projection and, thus, in mean-square sense, the most accurate approximation we can get. We like statistics with loss and risk functions because they tell us what to do at least cost. Etc.


Unbiasedness is oversold. It makes sense when you also average the estimator, otherwise what matters is the expected error or expected loss (probabilities thereof). Let the error be in the bias part or the variance part, what do I care. Furthermore not all losses decompose into bias and variance either.

I also think the fascination with least squares led us astray. Its a bad loss function for classification. Bad for neural networks that use the sigmoid function. Bad because it can create exponentially many local minima, on the other hand with sigmiod function if you used KL divergence instead of square loss you will not have these problems.

Part of it is the old statistical vestige of analyzing everything using tools of asymptotic Normality.

But I agree with your general point, (and upvoted your parent comment) one should care about what one is really getting after the algorithm has been applied, and phrase that in language of mathematics. If the community does not do this a) the snake-oil salesmen will come out of the woodworks and b) machine learning will suffer the same fate as AI winter c) at the same time one should not discourage open ended practice and experimentation (that is how most new things are discovered, analysis comes later, mostly)


On your last paragraph, we agree.

Moreover, your a), b) say some of what I think better than I said it.

Otherwise you have a good critique of traditional statistics.

On traditional statistics, I voted with my research and did work that was distribution-free, also multidimensional, unusual for distribution-free. I did wonder if there could be another way, and what you wrote about Vapnik suggests that there can be.

This thread now is much deeper into the analysis of machine learning algorithms than I noticed in what I saw of Stanford professor Ng's course and several books and papers in PDF I downloaded and scanned. I would guess the level of Vapnik's work would be that of a graduate seminar and not a ugrad computer science course and that the mathematical prerequisites would filter out nearly everyone with a ugrad CS degree and nearly all their professors. Why? From some of my experience getting a paper of mine reviewed, measure preserving totally blew away several editors in chief of famous CS journals and chaired professors of CS at famous research universities; one such person wrote me "Neither I nor anyone on my board of editors has the mathematical prerequisites to review your paper.", and a second such person wrote me a similar remark. The math prerequisites were not so advanced, say, just at the level a graduate course in probability via, say, Loeve or Neveu and fast and easy reading for, say, E. Cinlar, P. Billingsley, L. Breiman, J. Doob, A. Karr, E. Dynkin, etc. But the paper was 'CS' because it was on a CS problem. For AI, I cooked up the paper to improve on some of our work with 'expert systems' for the problem. I have to suspect that work in machine learning that needs some of Loeve or functional analysis from, say, Rudin will not be popular in CS departments.

This thread has become a good start on a grad seminar.

Just now I can't follow up on the leads here; instead I have to finish off the code for two Web pages and get my new Web site live.


Machine learning is really one part probability theory, one part functional analysis one part optimization and another part algorithms. I think with your background you will enjoy it. The particular two papers by Vapnik and Chervonenkis are very readable and does not use anything more technical than Chernoff bounds, its more about how they are used rather than what are used. CS students typically do not have probability in their curriculum, at least no measure theory, so you will see a lot of Physicist turned machine learners in this field. You might have something interesting going on with your own research finding, if you have not published it I would encourage you to do so. A possible venue is JMLR (journal of machine learning research, there editors are not scared of a little measure theory).

BTW a surprising fact is that Vapnik's bounds are dimensionality independent, provided you are working with what are known as function classes with finite VC (Vapnki-Chervonenkis) dimension. Dimensionality is a big deal in m/c learning because the data they are interested in is typically quite high dimensional.


> You might have something interesting going on with your own research finding, if you have not published it I would encourage you to do so.

'Information Sciences', 1999, "... Anomaly Detector". So look for 'zero day' problems in server farms and networks. Get a multidimensional, distribution-free hypothesis test where know false alarm rate in advance and can adjust it exactly in small steps over a wide range.

What is novel is the 'applied probability' that permits the results with false alarm rate.

The assumptions about the input data are realistic and, thus, do not permit doing as well as the Neyman-Pearson result, but, still, intuitively, the detection rate should be relatively high. There is a weaker sense, not in the paper, where the detection rate is as high as possible. I.e., essentially, for the selected false alarm rate, the 'critical region' where null hypothesis (a healthy system) is accepted has least possible Lebesgue measure.

It appears that the input data can have dimensions 10, 100, 1000, etc. I have an algorithm that makes the computations fast, but it's not in the paper.

The computing needed was a bit much when I published the paper, but that computing is quite reasonable now. For a system where really care a lot about reliability, performance, security, etc., it could be worthwhile to deploy what I cooked up.

My current project is quite different but does need a server farm, and I might deploy my detector there. Then with that demo, if people like my detector, I might let people buy it.


Perhaps, but IMHO the obsession with "provably optimal" is bad and will never lead to anything useful. The real world is messy and complicated. Search spaces are far too large to ever completely explore. We need heuristics to do things in a reasonable amount of time. Heuristics that exploit the structure of the problem we are trying to solve. And those are going to be difficult if not impossible to formally verify.


you could start with Google and with terms:

- K-fold

- Confusion matrix


Okay: Googled K-fold. It's a case of resampling. To justify it, use a finite group of measure preserving transformations, which I figured out and published about 16 years ago.

But we still have not made clear what the heck the assumptions are.

Confusion matrix: Wikipedia has nothing on it, but Matlab's documentation does.

But there's a problem: Suppose we have some data on cats and dogs and want to have a 'classifier'. So, we do what Matlab said: Take half of the training data and build a classifier. Then we test the classifier with the other half of the data.

But suppose the classifier finds three classifications instead of just the two, dogs and cats we already know exist. So, if we apply such methods where we don't know in advance the number of classifications we might get three classes and be quite wrong and not know it.

Again, the methodology looks like just pushing some data into some software with too little in assumptions up front and too little about what we are getting out on the back side.


Now you are talking about supervised vs unsupervised learning and it seems that you are talking about clustering, not classification. Classification problems are problems where you have number of classes and the classificator tries to estimate ẃith given parameters what group(s) the given sample belongs to. It doesn't 'invent' new classes.


Okay, then, a critique of the OP is that in classification the number of classes is an input to the algorithm, not an output. My view is that being clear about such details is crucial.


These algorithms are all good, but they seems to be missing recurrent ANN, that is, the form of datamining tool our brains are using. I can add this paper as relevant for those being interested http://orre.neurologic.org/pubdoc/abstract/complex2003-abstr...



It was published in 2007 not 2008 ;)


i understand some of these words




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: