Hacker News new | past | comments | ask | show | jobs | submit login

Okay, given a box of data, shove it into one of these algorithms, let the computer run, and now what? What do we have?

E.g., for some 'classifier', what the heck do we have? That is, what guarantees do we have? That is, is there something specific and guaranteed that the algorithm has actually done?

I know; I know; in some vague, unspecific sense, that seems to work in some practical cases, the algorithm will 'classify'. Still, we don't have precise description of (1) in what cases does the algorithm work and (2) how well.

E.g., what is the rate (probability) of false classifications? Sure, we'd need some hypotheses to say. Okay, what the heck are the algorithms assuming for hypotheses that would let there even be a rate of false classifications. Then with those hypotheses, what are the probabilities of a false classification?

Without such considerations, about all we have is some code that runs.

Uh, where's the beef?




The body of literature you are looking for is called statistical learning theory. If you are interested, you can start with Vapnik's two books, but that is what I said, a start. The theory of Learning has shifted quite a bit, but the fundamentals remain the same. Once you are up to speed with Vapnik you can start reading proceedings of COLT and ALT (computational learning theory and algorithmic learning theory respectively).

You will find the following 3 types of guarantees

i) Strong assumptions on the knowledge of the stochastic process generating the data to back guarantees on the algorithm

ii) No assumptions on the on the data generating process except that it is governed by some unknown independent stochastic process. Prove guarantees on the output of the algorithm

iii) Don't even assume that the data generation process is stochastic yet provide guarantees on the output.

What exactly is guaranteed will of course differ.

Little advice, smugness as demonstrated with ...I figured it out 16 years ago yo bitches, will not help your cause whatever that might be. BTW are you H. A. Dye, 'cause Dye figured out group structures on measure preserving transformations in 1959 a little bit earlier than 16 years ago...just saying.


I was commenting on the OP. I started reading it but found just 'algorithms' with no indication of anything like any guarantees or qualities for them.

Your i) sounds good. E.g., in stochastic dynamic programming with assumptions reasonably expected, e.g., essentially a Markov assumption, we can easily prove that the number crunching yields the decisions with the least possible expected cost from any 'measurable' means of manipulating the data. And once when I argued this, I also included a proof that additional data independent of what we already wouldn't help either. This result is, of course, 'obvious' although obvious is not a proof. The proof turned out to be just a light tap with Fubini's theorem.

So, good: Some people actually argue that some of the algorithms actually have some desirable qualities. So, the world is not totally coming apart at the seams or being turned over to naive hackers who only know to watch the code run and then conclude qualitative that it seems 'nice'.

On your ii), "some unknown independent stochastic process", that's not so clear. So, for some index set I and random variables X(i) for i in I, the set

{ X(i) | i in I }

is independent? That's something of an unusual and, in practice, not very interesting stochastic process. Maybe some EEs would call that 'white noise'.

On your iii), sorry, but essentially everything is 'stochastic'. Even a 'deterministic' process is stochastic, in practice usually is also Markov (past and future conditionally independent given the present).

Thanks for the reference to H. A. Dye. I went on to show how such a group leads easily to statistical hypothesis tests. That is, in particular I was making sense out of 'resampling' and 'bootstrap' plans.

In part I got the idea of considering 'measure preserving', of course, from ergodic theory where measure preserving transformations are central, but the transformations I used, and needed for the statistics, are the ones that ergodic theory throws out.

No, I'm not Dye. That's the first I heard of Dye's paper. Thanks. At one point long after publishing I saw remark in one of the volumes of Dunford and Schwartz, as I recall, attributed to Norbert Wiener, that maybe something could be done with the 'periodic' case where, thus, there is a finite group.

Never heard of Vapnik. Thanks for the reference. From a simple Google, found at

http://vserver1.cscs.lsa.umich.edu/~crshalizi/reviews/vapnik...

a review

The Bactra Review: Occasional and eclectic book reviews by Cosma Shalizi The Nature of Statistical Learning Theory by V. N. Vapnik

Okay, Vapnik is trying to do some mathematics to show some desirable mathematical properties of some of statistical learning theory, maybe in particular some of the algorithms. Good.

The review is a bit tough on Vapnik's writings from when he was writing in Russia. That's sad: The famous Russian texts on probability and stochastic processes, usually from Springer, were exemplary and with beautifully done translations to English, e.g., E. Dynkin, R. Lipster, etc.

I can'f find where I said "yo bitches".

But I do believe that a proposed algorithm should do something, that what it does should be clear, good, and demonstrated, hopefully with some theorems and proofs. Knuth's TACP was big on 'analysis' of algorithms, but from the OP and, say, Prof Ng's course on machine learning it appears that the analysis part is being omitted.

Broadly, it looks like (1) statistics, optimization, stochastic processes are old fields and no longer very popular, at least in the US where they might be in departments of mathematics which, however, in the US, mostly don't like those fields; (2) computing is a new field and has lots of grant money, interest, and students; (3) for an actual application, those old fields need computing to do the specified data manipulations; (4) computing wants to take on some of the problems the old fields worked on and, then, is at risk of putting poor versions of good old wine in new bottles with new labels. This is a form of 'academic theft'? Well, if not, if the new wine is poor, then it's a form of academic dilution. Not good.


Well its a HN comment not a scientific communication.

Scenario (i) is mostly considered uninteresting and "solved" because it assumes that the particular stochastic process generating the data is known (this was the pre-Vapnik model). In most practical cases, however, one does not know the corresponding measure from which the data was generated (and will be generated in future). The old school thinking was that one should estimate the generating distribution, and then solve for the optimal decision, based on the estimate. Vapnik showed that it is far better to do it directly without bothering to estimate the distribution which in itself is a non-trivial problem.

So (ii) is the more interesting case that machine learners care to handle and Vapnik's was one of the first breakthrough fundamental result in this formalism. It applies to all algorithms that does something called structural risk minimization, not just one specific algorithm. His bound applies to all generating distributions, all samples, and all structural risk minimization algorithms ! However in this set up we still need to assume independence. Some form of independence, even in some weak form needs to be present. But it is quite remarkable that you can give guarantees on performance regardless of what actual distribution generated the data. I should also mention Leslie Valiant. Lookup PAC learning you will understand the basic premise of this style. One genuine criticism is that it is overly conservative (and it has to be, because it is giving guarantees that applies for the worst distribution, worst training sample, worst algorithm...).

(iii) Removes the need for independent draws. If you want to read up on these, search for "regret minimization." In this setup the guarantees hold for all sequence of data, even if adversarially generated by someone who knows the algorithm you will employ. Some geometric assumptions are necessary on the domain and some function analytic ones on the function that you are trying to minimize (typically convex. Non-differentiable functions are allowed).

   -- Shalizi's Review --
Shalizi writes good stuff but this review of Vapnik's book is surprisingly mean spirited. The field of course has progressed after Vapnik and there are legitimate criticisms that one can raise, but the review exhibits bad form, I think. Prior to Vapnik there were very few results that one could state outside of the statistical physics community (that worked with strong assumptions, typically Gaussian noise etc)

    -- Aside --
No you did not say the word "yo bitches" but that was the attitude you demonstrated. Before casting judgments as you did at least make an effort to familiarize yourself with the background. And that is exactly what it is, background. A machine learner will not even mention these things because you are expected to know these things if you are in this field.

    -- Re: Analysis being omitted  --
Think of it this way, one doesn't really need to know precise chemistry of hydrocarbon combustion to drive a car, conversely, knowledge of such chemistry alone is useless unless it is teamed up with good engineering and in this case algorithm. There is a huge difference between showing that minimizing a particular function has nice properties and actually coming up with a piece of code to efficiently minimize such a function, not just in theory where constants can be conveniently swept under the rug, but also in practice. So dont be too quick to dismiss the "naive coders".

Most importantly, practice, almost always precedes explanation and analysis in engineering. Usually practitioners find the algorithms in their hacky ways. The analyst/theorist smoothens the kinks and explains why the method works so well after it has been demonstrated to work so well-- not always, but almost always.


Much of machine learning is done empirically. You can see how well an algorithm works on different problems, and a test set will estimate exactly how well. Mathematical guarantees are interesting, but really who cares? I suppose the bayesian machine learning camp is pretty into formal math. And really methods are based on some intuition about how they will work - nothing is just random.


> Much of machine learning is done empirically.

Yes, it does seem so.

> Mathematical guarantees are interesting, but really who cares?

We should all care. The guaranteed help us know what the heck we are doing and what the heck we've got when we are done. E.g., in statistics we like estimators that are unbiased and minimum variance -- that's 'goodness' and good to know and potentially valuable in practice. Estimators that are biased and with variance larger than necessary are not good. We like regression in part because it is 'least squares' which means it is a projection and, thus, in mean-square sense, the most accurate approximation we can get. We like statistics with loss and risk functions because they tell us what to do at least cost. Etc.


Unbiasedness is oversold. It makes sense when you also average the estimator, otherwise what matters is the expected error or expected loss (probabilities thereof). Let the error be in the bias part or the variance part, what do I care. Furthermore not all losses decompose into bias and variance either.

I also think the fascination with least squares led us astray. Its a bad loss function for classification. Bad for neural networks that use the sigmoid function. Bad because it can create exponentially many local minima, on the other hand with sigmiod function if you used KL divergence instead of square loss you will not have these problems.

Part of it is the old statistical vestige of analyzing everything using tools of asymptotic Normality.

But I agree with your general point, (and upvoted your parent comment) one should care about what one is really getting after the algorithm has been applied, and phrase that in language of mathematics. If the community does not do this a) the snake-oil salesmen will come out of the woodworks and b) machine learning will suffer the same fate as AI winter c) at the same time one should not discourage open ended practice and experimentation (that is how most new things are discovered, analysis comes later, mostly)


On your last paragraph, we agree.

Moreover, your a), b) say some of what I think better than I said it.

Otherwise you have a good critique of traditional statistics.

On traditional statistics, I voted with my research and did work that was distribution-free, also multidimensional, unusual for distribution-free. I did wonder if there could be another way, and what you wrote about Vapnik suggests that there can be.

This thread now is much deeper into the analysis of machine learning algorithms than I noticed in what I saw of Stanford professor Ng's course and several books and papers in PDF I downloaded and scanned. I would guess the level of Vapnik's work would be that of a graduate seminar and not a ugrad computer science course and that the mathematical prerequisites would filter out nearly everyone with a ugrad CS degree and nearly all their professors. Why? From some of my experience getting a paper of mine reviewed, measure preserving totally blew away several editors in chief of famous CS journals and chaired professors of CS at famous research universities; one such person wrote me "Neither I nor anyone on my board of editors has the mathematical prerequisites to review your paper.", and a second such person wrote me a similar remark. The math prerequisites were not so advanced, say, just at the level a graduate course in probability via, say, Loeve or Neveu and fast and easy reading for, say, E. Cinlar, P. Billingsley, L. Breiman, J. Doob, A. Karr, E. Dynkin, etc. But the paper was 'CS' because it was on a CS problem. For AI, I cooked up the paper to improve on some of our work with 'expert systems' for the problem. I have to suspect that work in machine learning that needs some of Loeve or functional analysis from, say, Rudin will not be popular in CS departments.

This thread has become a good start on a grad seminar.

Just now I can't follow up on the leads here; instead I have to finish off the code for two Web pages and get my new Web site live.


Machine learning is really one part probability theory, one part functional analysis one part optimization and another part algorithms. I think with your background you will enjoy it. The particular two papers by Vapnik and Chervonenkis are very readable and does not use anything more technical than Chernoff bounds, its more about how they are used rather than what are used. CS students typically do not have probability in their curriculum, at least no measure theory, so you will see a lot of Physicist turned machine learners in this field. You might have something interesting going on with your own research finding, if you have not published it I would encourage you to do so. A possible venue is JMLR (journal of machine learning research, there editors are not scared of a little measure theory).

BTW a surprising fact is that Vapnik's bounds are dimensionality independent, provided you are working with what are known as function classes with finite VC (Vapnki-Chervonenkis) dimension. Dimensionality is a big deal in m/c learning because the data they are interested in is typically quite high dimensional.


> You might have something interesting going on with your own research finding, if you have not published it I would encourage you to do so.

'Information Sciences', 1999, "... Anomaly Detector". So look for 'zero day' problems in server farms and networks. Get a multidimensional, distribution-free hypothesis test where know false alarm rate in advance and can adjust it exactly in small steps over a wide range.

What is novel is the 'applied probability' that permits the results with false alarm rate.

The assumptions about the input data are realistic and, thus, do not permit doing as well as the Neyman-Pearson result, but, still, intuitively, the detection rate should be relatively high. There is a weaker sense, not in the paper, where the detection rate is as high as possible. I.e., essentially, for the selected false alarm rate, the 'critical region' where null hypothesis (a healthy system) is accepted has least possible Lebesgue measure.

It appears that the input data can have dimensions 10, 100, 1000, etc. I have an algorithm that makes the computations fast, but it's not in the paper.

The computing needed was a bit much when I published the paper, but that computing is quite reasonable now. For a system where really care a lot about reliability, performance, security, etc., it could be worthwhile to deploy what I cooked up.

My current project is quite different but does need a server farm, and I might deploy my detector there. Then with that demo, if people like my detector, I might let people buy it.


Perhaps, but IMHO the obsession with "provably optimal" is bad and will never lead to anything useful. The real world is messy and complicated. Search spaces are far too large to ever completely explore. We need heuristics to do things in a reasonable amount of time. Heuristics that exploit the structure of the problem we are trying to solve. And those are going to be difficult if not impossible to formally verify.


you could start with Google and with terms:

- K-fold

- Confusion matrix


Okay: Googled K-fold. It's a case of resampling. To justify it, use a finite group of measure preserving transformations, which I figured out and published about 16 years ago.

But we still have not made clear what the heck the assumptions are.

Confusion matrix: Wikipedia has nothing on it, but Matlab's documentation does.

But there's a problem: Suppose we have some data on cats and dogs and want to have a 'classifier'. So, we do what Matlab said: Take half of the training data and build a classifier. Then we test the classifier with the other half of the data.

But suppose the classifier finds three classifications instead of just the two, dogs and cats we already know exist. So, if we apply such methods where we don't know in advance the number of classifications we might get three classes and be quite wrong and not know it.

Again, the methodology looks like just pushing some data into some software with too little in assumptions up front and too little about what we are getting out on the back side.


Now you are talking about supervised vs unsupervised learning and it seems that you are talking about clustering, not classification. Classification problems are problems where you have number of classes and the classificator tries to estimate ẃith given parameters what group(s) the given sample belongs to. It doesn't 'invent' new classes.


Okay, then, a critique of the OP is that in classification the number of classes is an input to the algorithm, not an output. My view is that being clear about such details is crucial.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: