Okay, given a box of data, shove it into
one of these algorithms, let the
computer run, and now what? What do we
have?
E.g., for some 'classifier', what the heck do
we have? That is, what guarantees do we have?
That is, is there something specific and guaranteed
that the algorithm has actually done?
I know; I know; in some vague, unspecific sense,
that seems to work in some practical cases,
the algorithm will 'classify'. Still, we don't
have precise description of (1) in what cases
does the algorithm work and (2) how well.
E.g., what is the rate (probability)
of false classifications?
Sure, we'd need some hypotheses to say.
Okay, what the heck are the algorithms
assuming for hypotheses that would let there even be
a rate of false classifications.
Then with those hypotheses,
what are the probabilities of a false
classification?
Without such considerations, about all we
have is some
code that runs.
The body of literature you are looking for is called statistical learning theory. If you are interested, you can start with Vapnik's two books, but that is what I said, a start. The theory of Learning has shifted quite a bit, but the fundamentals remain the same. Once you are up to speed with Vapnik you can start reading proceedings of COLT and ALT (computational learning theory and algorithmic learning theory respectively).
You will find the following 3 types of guarantees
i) Strong assumptions on the knowledge of the stochastic process generating the data to back guarantees on the algorithm
ii) No assumptions on the on the data generating process except that it is governed by some unknown independent stochastic process. Prove guarantees on the output of the algorithm
iii) Don't even assume that the data generation process is stochastic yet provide guarantees on the output.
What exactly is guaranteed will of course differ.
Little advice, smugness as demonstrated with ...I figured it out 16 years ago yo bitches, will not help your cause whatever that might be. BTW are you H. A. Dye, 'cause Dye figured out group structures on measure preserving transformations in 1959 a little bit earlier than 16 years ago...just saying.
I was commenting on the OP. I started reading it
but found just 'algorithms' with no indication of
anything like any guarantees or qualities for them.
Your i) sounds good. E.g., in stochastic dynamic programming
with assumptions reasonably expected, e.g., essentially
a Markov assumption, we can easily prove that the
number crunching yields the decisions with the least
possible expected cost from any 'measurable' means
of manipulating the data. And once when I argued this,
I also included a proof that additional data independent of
what we already wouldn't help either. This result is, of course, 'obvious' although obvious is not a proof. The proof turned out to be just a light tap with Fubini's
theorem.
So, good: Some people actually argue that some of the
algorithms actually have some desirable qualities. So,
the world is not totally coming apart at the seams
or being turned over to naive hackers who only know to
watch the code run and then conclude qualitative that
it seems 'nice'.
On your ii), "some unknown independent stochastic process",
that's not so clear. So, for some index set I and
random variables X(i) for i in I, the set
{ X(i) | i in I }
is independent? That's something of an unusual
and, in practice, not very interesting stochastic
process. Maybe some EEs would call that 'white noise'.
On your iii), sorry, but essentially everything is 'stochastic'.
Even a 'deterministic' process is stochastic, in practice
usually is also Markov (past and future conditionally
independent given the present).
Thanks for the reference to H. A. Dye. I went on to
show how such a group leads easily to statistical
hypothesis tests. That is, in particular I was
making sense out of 'resampling' and 'bootstrap' plans.
In part I got the idea of considering 'measure preserving', of course,
from ergodic theory where measure preserving
transformations are central, but the transformations
I used, and needed for the statistics, are the ones
that ergodic theory throws out.
No, I'm not Dye. That's the first I heard of Dye's
paper. Thanks. At one point long after publishing
I saw remark in one of the volumes of Dunford and
Schwartz, as I recall, attributed to Norbert Wiener,
that maybe something could be done with
the 'periodic' case where, thus, there is a finite
group.
Never heard of Vapnik. Thanks for the reference. From
a simple Google, found at
The Bactra Review: Occasional and eclectic book reviews by Cosma Shalizi
The Nature of Statistical Learning Theory
by V. N. Vapnik
Okay, Vapnik is trying to do some mathematics
to show some desirable mathematical properties of
some of statistical learning theory, maybe in particular some of the algorithms. Good.
The review is a bit tough on Vapnik's writings from
when he was writing in Russia. That's sad:
The famous Russian texts on probability and stochastic
processes, usually from Springer, were exemplary and with
beautifully done translations to English, e.g., E. Dynkin,
R. Lipster, etc.
I can'f find where I said "yo bitches".
But
I do believe that a proposed algorithm should
do something, that what it does should be
clear, good, and demonstrated, hopefully with
some theorems and proofs. Knuth's TACP was
big on 'analysis' of algorithms, but from
the OP and, say, Prof Ng's course on machine
learning it appears that the analysis part is
being omitted.
Broadly, it looks like (1) statistics, optimization,
stochastic processes are old fields
and no longer very popular, at least in the US
where they might be in departments of mathematics
which, however, in the US, mostly don't like
those fields;
(2) computing is a new field and has
lots of grant money, interest, and students;
(3) for an actual application, those old fields need computing to
do the specified data manipulations;
(4) computing wants to take on some of the
problems the old fields worked on and, then,
is at risk of putting poor versions of good old
wine in new bottles with new labels. This is
a form of 'academic theft'? Well, if not, if
the new wine is poor, then it's a form of
academic dilution. Not good.
Well its a HN comment not a scientific communication.
Scenario (i) is mostly considered uninteresting and "solved" because it assumes that the particular stochastic process generating the data is known (this was the pre-Vapnik model). In most practical cases, however, one does not know the corresponding measure from which the data was generated (and will be generated in future). The old school thinking was that one should estimate the generating distribution, and then solve for the optimal decision, based on the estimate. Vapnik showed that it is far better to do it directly without bothering to estimate the distribution which in itself is a non-trivial problem.
So (ii) is the more interesting case that machine learners care to handle and Vapnik's was one of the first breakthrough fundamental result in this formalism. It applies to all algorithms that does something called structural risk minimization, not just one specific algorithm. His bound applies to all generating distributions, all samples, and all structural risk minimization algorithms ! However in this set up we still need to assume independence. Some form of independence, even in some weak form needs to be present. But it is quite remarkable that you can give guarantees on performance regardless of what actual distribution generated the data. I should also mention Leslie Valiant. Lookup PAC learning you will understand the basic premise of this style. One genuine criticism is that it is overly conservative (and it has to be, because it is giving guarantees that applies for the worst distribution, worst training sample, worst algorithm...).
(iii) Removes the need for independent draws. If you want to read up on these, search for "regret minimization." In this setup the guarantees hold for all sequence of data, even if adversarially generated by someone who knows the algorithm you will employ. Some geometric assumptions are necessary on the domain and some function analytic ones on the function that you are trying to minimize (typically convex. Non-differentiable functions are allowed).
-- Shalizi's Review --
Shalizi writes good stuff but this review of Vapnik's book is surprisingly mean spirited. The field of course has progressed after Vapnik and there are legitimate criticisms that one can raise, but the review exhibits bad form, I think. Prior to Vapnik there were very few results that one could state outside of the statistical physics community (that worked with strong assumptions, typically Gaussian noise etc)
-- Aside --
No you did not say the word "yo bitches" but that was the attitude you demonstrated. Before casting judgments as you did at least make an effort to familiarize yourself with the background. And that is exactly what it is, background. A machine learner will not even mention these things because you are expected to know these things if you are in this field.
-- Re: Analysis being omitted --
Think of it this way, one doesn't really need to know precise chemistry of hydrocarbon combustion to drive a car, conversely, knowledge of such chemistry alone is useless unless it is teamed up with good engineering and in this case algorithm. There is a huge difference between showing that minimizing a particular function has nice properties and actually coming up with a piece of code to efficiently minimize such a function, not just in theory where constants can be conveniently swept under the rug, but also in practice. So dont be too quick to dismiss the "naive coders".
Most importantly, practice, almost always precedes explanation and analysis in engineering. Usually practitioners find the algorithms in their hacky ways. The analyst/theorist smoothens the kinks and explains why the method works so well after it has been demonstrated to work so well-- not always, but almost always.
Much of machine learning is done empirically. You can see how well an algorithm works on different problems, and a test set will estimate exactly how well. Mathematical guarantees are interesting, but really who cares? I suppose the bayesian machine learning camp is pretty into formal math. And really methods are based on some intuition about how they will work - nothing is just random.
> Mathematical guarantees are interesting, but really who cares?
We should all care. The guaranteed help us know
what the heck we are doing and what the heck we've
got when we are done. E.g., in statistics we
like estimators that are unbiased and minimum
variance -- that's 'goodness' and good to know
and potentially valuable in practice. Estimators
that are biased and with variance larger than
necessary are not good. We like regression in part
because it is 'least squares' which means it is
a projection and, thus, in mean-square sense, the
most accurate approximation we can get. We like
statistics with loss and risk functions because
they tell us what to do at least cost. Etc.
Unbiasedness is oversold. It makes sense when you also average the estimator, otherwise what matters is the expected error or expected loss (probabilities thereof). Let the error be in the bias part or the variance part, what do I care. Furthermore not all losses decompose into bias and variance either.
I also think the fascination with least squares led us astray. Its a bad loss function for classification. Bad for neural networks that use the sigmoid function. Bad because it can create exponentially many local minima, on the other hand with sigmiod function if you used KL divergence instead of square loss you will not have these problems.
Part of it is the old statistical vestige of analyzing everything using tools of asymptotic Normality.
But I agree with your general point, (and upvoted your parent comment) one should care about what one is really getting after the algorithm has been applied, and phrase that in language of mathematics. If the community does not do this a) the snake-oil salesmen will come out of the woodworks and b) machine learning will suffer the same fate as AI winter c) at the same time one should not discourage open ended practice and experimentation (that is how most new things are discovered, analysis comes later, mostly)
Moreover, your a), b) say some of what I think better than I said it.
Otherwise you have a good critique of traditional
statistics.
On traditional statistics, I voted with my research
and did work that was distribution-free, also
multidimensional, unusual for distribution-free. I did wonder
if there could be another way, and what you wrote
about Vapnik suggests that there can be.
This thread now is much deeper into the analysis of
machine learning algorithms than I noticed in
what I saw of Stanford professor Ng's course and
several books and papers in PDF I downloaded and
scanned. I would guess the level
of Vapnik's work would be that of a graduate seminar
and not a ugrad computer science course and that the
mathematical prerequisites would filter out nearly
everyone with a ugrad CS degree and nearly all
their professors. Why? From some of my experience getting
a paper of mine reviewed, measure preserving totally
blew away several editors in chief of famous
CS journals and chaired professors of CS at
famous research universities; one such person wrote
me "Neither I nor anyone on my board of editors
has the mathematical prerequisites to review your
paper.", and a second such person wrote me
a similar remark. The math prerequisites were not so
advanced, say, just at the level a graduate course
in probability via, say, Loeve or Neveu and fast
and easy reading for, say, E. Cinlar, P. Billingsley,
L. Breiman, J. Doob, A. Karr, E. Dynkin, etc. But the paper was 'CS' because it
was on a CS problem. For AI, I cooked up the paper
to improve on some of our work with 'expert systems'
for the problem. I have to suspect that work
in machine learning that needs some of Loeve
or functional analysis from, say, Rudin will
not be popular in CS departments.
This thread has become a good start on a grad seminar.
Just now I can't follow up on the leads here;
instead I have to finish off the code for
two Web pages and get my new Web site live.
Machine learning is really one part probability theory, one part functional analysis one part optimization and another part algorithms. I think with your background you will enjoy it. The particular two papers by Vapnik and Chervonenkis are very readable and does not use anything more technical than Chernoff bounds, its more about how they are used rather than what are used. CS students typically do not have probability in their curriculum, at least no measure theory, so you will see a lot of Physicist turned machine learners in this field. You might have something interesting going on with your own research finding, if you have not published it I would encourage you to do so. A possible venue is JMLR (journal of machine learning research, there editors are not scared of a little measure theory).
BTW a surprising fact is that Vapnik's bounds are dimensionality independent, provided you are working with what are known as function classes with finite VC (Vapnki-Chervonenkis) dimension. Dimensionality is a big deal in m/c learning because the data they are interested in is typically quite high dimensional.
> You might have something interesting going on with your own research finding, if you have not published it I would encourage you to do so.
'Information Sciences', 1999, "... Anomaly Detector". So look
for 'zero day' problems in server farms and networks. Get
a multidimensional, distribution-free hypothesis test
where know false alarm rate in advance and can adjust it
exactly in small steps over a wide range.
What is novel is the 'applied probability' that permits
the results with false alarm rate.
The assumptions about the input data are realistic and,
thus, do not permit doing as well as the Neyman-Pearson
result, but, still, intuitively, the detection rate
should be relatively high. There is a weaker sense,
not in the paper, where the detection rate is
as high as possible. I.e., essentially, for the
selected false alarm rate, the 'critical region'
where null hypothesis (a healthy system) is accepted
has least possible Lebesgue measure.
It appears that the input data can have dimensions
10, 100, 1000, etc. I have an algorithm that makes
the computations fast, but it's not in the paper.
The computing needed was a bit much when I published
the paper, but that computing is quite reasonable
now. For a system where really care a lot about
reliability, performance, security, etc., it could
be worthwhile to deploy what I cooked up.
My current project is quite different but does need
a server farm, and I might deploy my detector there.
Then with that demo, if people like my detector,
I might let people buy it.
Perhaps, but IMHO the obsession with "provably optimal" is bad and will never lead to anything useful. The real world is messy and complicated. Search spaces are far too large to ever completely explore. We need heuristics to do things in a reasonable amount of time. Heuristics that exploit the structure of the problem we are trying to solve. And those are going to be difficult if not impossible to formally verify.
Okay: Googled K-fold. It's a case
of resampling. To justify it, use
a finite group of measure preserving
transformations, which I figured out
and published about 16 years ago.
But we still have not made clear what
the heck the assumptions are.
Confusion matrix: Wikipedia has nothing
on it, but Matlab's documentation does.
But there's a problem: Suppose
we have some data on cats and dogs
and want to have a 'classifier'.
So, we do what Matlab said: Take
half of the training data and build
a classifier. Then we test the
classifier with the other half of the
data.
But suppose the classifier finds three
classifications instead of just the
two, dogs and cats we already know
exist. So, if we apply such methods
where we don't know in advance the
number of classifications we might
get three classes and be quite wrong
and not know it.
Again, the methodology looks like just
pushing some data into some software
with too little in assumptions up front
and too little about what we are getting
out on the back side.
Now you are talking about supervised vs unsupervised learning and it seems that you are talking about clustering, not classification. Classification problems are problems where you have number of classes and the classificator tries to estimate ẃith given parameters what group(s) the given sample belongs to. It doesn't 'invent' new classes.
Okay, then, a critique of the OP is that
in classification the number of classes is
an input to the algorithm, not an output.
My view is that being clear about such
details is crucial.
E.g., for some 'classifier', what the heck do we have? That is, what guarantees do we have? That is, is there something specific and guaranteed that the algorithm has actually done?
I know; I know; in some vague, unspecific sense, that seems to work in some practical cases, the algorithm will 'classify'. Still, we don't have precise description of (1) in what cases does the algorithm work and (2) how well.
E.g., what is the rate (probability) of false classifications? Sure, we'd need some hypotheses to say. Okay, what the heck are the algorithms assuming for hypotheses that would let there even be a rate of false classifications. Then with those hypotheses, what are the probabilities of a false classification?
Without such considerations, about all we have is some code that runs.
Uh, where's the beef?