Hacker News new | past | comments | ask | show | jobs | submit login
Every Model Learned by Gradient Descent Is Approximately a Kernel Machine (arxiv.org)
406 points by scottlocklin on Dec 6, 2020 | hide | past | favorite | 107 comments



I'm confused by findings like this one, and I'm hoping someone here can educated me.

There are many known universal approximations. Deep networks are one. SVMs are one. Heck, cubic splines are one, and they've been in use for nearly a hundred years IIRC.

The problem has never been one of finding a sufficiently powerful approximator. It has been training that approximator. My understanding of the significant advancement made by deep learning is that we finally figured out how to train a specific kind of universal approximator in a way such that it finds very good separation surfaces for what used to be impossible-to-solve classification problems.

But it should be no surprise to anyone that there exist, in theory, other universal approximations that approximately reproduce the same separation surfaces, should it? I'd expect any universal approximator to be powerful enough to reproduce the separation surfaces, hence the meaning of the word "universal". The problem was always finding the right weights, not finding the right approximator architecture.

Am I missing something?


50/50 I can shed some light or am way over my head:

The point of the paper is that if you train a deep learning model with gradient descent then the resulting model is effectively a kernel machine, regardless of model architecture.

The nice thing about a kernel machine is that it is simple (just one hidden layer) and we are able to use to analyze a kernel machine more effectively and conveniently.

So, I think the contribution here isn't "these sets of universal aproximators are equivalent" but rather "We have this effective but opauge deep learning thing, turns it it's actual a kernel machine in retropsect so we can bring 'kernel tooling' to analyze the deep learning mode"


Does this essentially mean that any multi-layer RNN can be reasonably approximated by a 1-layer network (something like a perceptron) for the "playback" purposes, that is, for recognition / transformation, not learning?

This may have colossal practical implications, as long as the approximation stays good enough.


Hmmm, I think that's not precise and my use of "architecture" was misleading.

If we're thinking in terms of "universal aproximators", an RNN is a way to make a sequence of approximate functions for a sequence of inputs.

But it's still a sequence of functions, not a single function.

For a 1 layer network to have the same ability as an RNN (take an unbounded amount of context) it would need to have infinite width which is a no-go.


I would be skeptical about thinking of networks this way without empirically verifying it yourself.

The only useful trick I’ve found like that, is that a stack of linear layers with no activation function is equivalent to a single larger layer. Sometimes it enables some clever optimizations on TPUs, since you want one of the dimensions to be a multiple of 128. (I haven’t actually used that trick, but it’s in my back pocket.)

But thinking of an entire model as a single layer seems strange. A single layer has to have some kind of meaning. To me, it means “a linear mapping followed by a nonlinear activation function.” So is the claim that there exists a sufficiently complicated activation function that approximates any given model? Because that sounds an awful lot like the activation function itself might be “the model”. Except that makes no sense, because activation functions don’t use model weights; the linear multiply before the activation does that.

So it quickly takes me in circles. I don’t have a good intuition for models yet though.


Wouldn't this one-layer network be a lot less "compressive" than the multi-layer net, and in some sense "duplicate" subnetworks in earlier layers?


Does that mean in theory we can uncover an underlying model, a theorem effectively, that the model is effectively approximating and so remove some of the uncertainty?


If I understand the paper (which is questionable), that's what the author is aiming for. E.g. he's saying 1) We can make these amazing black boxes 2) We don't really understand them 3) But when we make them with gradient descent they end up being almost kernel machines 4) We know a lot about kernel machines, so we can use that to "remove some uncertainty"


An issue with deep learning is that it is very hard to analyse from a theoretical mathematical perspective, to prove things about them.

Kernels have been studied thoroughly from a theoretical perspective and people have proven things about them.

The goal of papers such as these is to find ways in which deep neural networks and kernel methods are similar, so that theoretical results and tools found for kernel methods may be adapted to deep neural networks.


Finding the right architecture, or more in general the right model, is very much still the main problem.

You should be careful with the meaning you ascribe to the word 'universal'. The list of universal approximators is massive, and the sub-list of universal approximators that can be trained with OLS is still substantial. Still these models can differ significantly:

- How efficient are they (in #parameters required for a certain error) for specific tasks? There is a known 'maximum efficiency' for general tasks, but in high dimensions this efficiency is terrible, such that many models will fail terribly on high-dimensional data. Hence, you should pick a model that is exceptionally good for a specific task, although it might be less efficient for other tasks.

- How well can the model cope with noise? If your dependent variable is severely distorted (think financial data) then you need a model that can balance between interpolating datapoints and averaging out the noise.

Just to name my two favorite properties. The first one is _kind of_ related to learnability, since an inefficient model is often pretty much impossible to learn.


My understanding of the read was to show "how" they're equivalent as opposed to how to actually construct such an approximator or learn it.

Similar to showing a problem falls in NP, you can reduce the problem down to another problem in NP and be done with it.


Agree, but also think the result may be too general to be useful. Proving that you can rewrite any network learned with gradient descent this way kinda suggests that the architecture doesn't matter, but we know that's not true. Eg, why are networks with skip connections SO much better than networks without? What about batch normalization? This makes me suspicious that it's a nice theoretical result a bit too high level to be useful. Yes, it was proved years ago that you can train an arbitrary function with a wide enough two-layer net, but it's not a terribly practical way to approach the world. Now we have architectures much better than two-layer networks, and, for that matter, SVMs.

There's a number of problems with svms; complexity for training and inference scales with the amount of training data, which is pretty sad panda for complex problems.

Extremely spicy/cynical take: it's not cool to say "you all should go look at all these possible applications" when the thrust is the paper is to prop up the relevance of an obsolete approach. You gotta do the actual work to close the gap if you still want your PhD to be worth something...

That said, I haven't read the paper terribly closely, and am always happy to be proven wrong!


I'd be curious if re-framing a trained neural network model as a SVM gives you insight into it's support vectors and maybe a little understanding on why the NN works the way it does?


>suggests that the architecture doesn't matter, but we know that's not true. Eg, why are networks with skip connections SO much better than networks without? What about batch normalization?

Is this true though, or does network architecture only matter in terms of efficiency? This is non rhetorical, I really don't know much about deep learning. :) I guess i'm asking if with enough data and compute, is architecture still relevant?


These things matter a lot in practice. Imagine a giant million dimensional loss surface, where each point is a set of weights for the model. Then the gradient is pushing us around on this surface, trying to find a 'minimum.' Current understanding (for a while, actually) is that you never really hit minima so much as giant mostly-flat regions where further improvement maybe takes a million years. The loss surfaces for models with skip connections seem to be much, much nicer.

https://papers.nips.cc/paper/2018/file/a41b3bb3e6b050b6c9067...

In effect, there's a big gap between an existence proof and actually workable models, and the tricks of the trade do quite a lot to close the gap. (And there are almost certainly more tricks that we're still not aware of! I'm still amazed at how late in the game batch normalization was discovered.)

OTOH, so long as you're using the basic tricks of the trade, IME architecture doesn't really matter much. Our recent kaggle competition for birdsong identification was a great example of this: pretty much everyone reported that the difference between five or so 'best practices' feature extraction architectures (various permutations of resnet/efficientnet) was negligible.


Thank you so much for your response, that example makes a lot of sense. Algorithmically speaking in computer science, we can formalize efficiency with complexity theory.

Can we do the same with neural networks? Is there a formalization of why 'skip connections' (which I know nothing about) are better, why transformers are more efficient than recurrance, etc?

Is it useful to talk about their complexity or universal properties or size (and I realize this is muddled up a bit by the fact that hardware architecture can sometimes trump efficiency).


Classically, the equivalent of complexity theory in machine learning is statistical learning theory, where the main question is: if I have a magical algorithm that always finds the function in my class that fits the data best, how big does my dataset (which is a sample from some unknown probability distribution) need to be to ensure that the function I pick is almost as good as the best function in the class with high probability? This is known as PAC (probably approximately correct) learning.

For many non-deep machine learning models like support vector machines (SVMs), "find the function in my class that fits the data best" can be posed as a convex optimization problem. So the magical algorithm actually exists. In this setting, the PAC analysis is the natural thing to study. (Typically, we find that you need more samples if your function class is more expressive, which agrees with the observation that deep learning requires huge data sets.)

In deep learning, the magical algorithm doesn't exist. The optimization problem is nonconvex, but everyone uses stochastic gradient descent (SGD), an algorithm that is only guaranteed to find the global optimum on convex problems. Theory suggests that SGD will often converge on a local optimum that is significantly worse than the global optimum. However, in practice this doesn't happen much! If the network is big enough and all the algorithm hyperparameters are tuned well, and you run deep the deep learning algorithm with different random seeds, the result will be about equally good every time.

ML theory people working in deep learning tend to focus on this phenomenon: why does SGD usually find good local optima? This is totally different from the PAC analysis, and the analogy with computational complexity is less crisp.


Skip connections have very good motivation (see one of my other comments in this thread), and attention is decently well motivated, especially as an improvement in the translation space where they were first introduced. I don't think there's any formal proof that attention >> than convolutions with a wide receptive field.

It would be fantastic to have better measures of problem complexity. My thinking at this point is that huge parameter size makes it easier to search for a solution, but once we've found one, there should be interesting ways to simplify the function you've found. Recall above: there are many equivalent models with the same loss when the learning slows down... Some of these equivalent models have lots of zeros in them. We find that often you can prune 90% of the weights and still have a perfectly good model. Eventually you hit a wall, where it gets hard to prune more without large drops in model quality; this /might/ correspond to the actual problem complexity somehow, but the pruned model you happened to find may not actually be the best, and there may be better methods we haven't discovered yet.


> why are networks with skip connections SO much better than networks without?

What are the leading theories for why this seems to be the case? Less nodes to capture and direct decisions?


Oh there's plenty of good explantation in the neural network literature (my eli5: the skip connections make the default mapping an identity instead of a zero mapping; you can start by doing no harm, and improve from there). The method was suggested by knowledge from differential equations. All I'm saying is that the "everything is secretly an svm" viewpoint is probably too coarse to explain these interesting and effective structural differences.


> It has been training that approximator.

Also it is representation of approximator itself and data compression. For cubic splines to approximate some NN you likely would need enormous number of intervals covering input space.


Also, cubic splines don't extrapolate like the other mentioned models


An even simpler universal approximator that the authors overlooked for some reason is just to use a very large hash map. There are some practical issues around generalization and storage but it has very predictable precision.


"Here we show that every model learned by this method [SGD], regardless of architecture, is approximately equivalent to a kernel machine [i.e., a support vector machine or SVM] with a particular type of kernel" -- a type of kernel which Domingos, the author, calls a "path kernel."

As defined in the paper, a "path kernel" measures, for any two data points, how similarly a model varies (specifically, how similarly the model's gradients change) at those two data points during training via SGD. This isn't exactly your usual, plain-vanilla, radial-basis type of kernel.

We've known for a long time that SVMs are universal approximators, i.e., in theory they can approximate any target function. The importance of this work is that it has found a new, surprising, deep connection between any model trained via SGD and SVMs, which are well understood :-)


Great explanation of the intuitive understanding of the path kernel which seems to be the main takeaway from this paper.

One minor technical correction, the proof relief on the continuous model gradient flow not SGD. So it’s proven for GD and likely true for SGD and your intuitive explanation likely still holds, but it’s not obvious.


You're absolutely right. I substituted SGD for GD without giving it any thought because everyone uses SGD!


They sketch the argument for SGD, but they don't know if it actually holds (see Remark 5 in the paper).


For everyone saying "oh wow, we can go back to SVMs now, huge speedups, etc" - that's more than a little premature. This is purely a formal equivalence, but not very useful computationally.

The math here is pretty much first-year undergraduate level calculus, and it's worth going through section 2 since it's quite clearly written (thanks Prof Domingos).

Essentially, what the author does is show that any model trained with "infinitely-small step size full-batch gradient descent" (i.e. a model following a gradient flow) can be written in a "kernelized" form

   y(x) = \sum_{(x_i, y_i) \in L} a_i K(x, x_i) + b.

The intuition most people have for SVMs is that the constants a_i are, well, constant, that the a_i are sparse, and that the kernel function K(x, x_i) is cheap to compute (partly why it's called the 'kernel trick').

However, none of those properties apply here, which is why this isn't particularly exciting computationally. The "trick" is that both a_i and K(x, x_i) involve path integrals along the gradient flow for the input x, and so doing a single inference is approximately as expensive as training the model from scratch (if I understand this correctly).


> y(x) = \sum_{(x_i, y_i) \in L} a_i K(x, x_i) + b.

HN feature request: parse inline latex math please


See also W. Brendel and M. Bethge. "Approximating CNNs with bag-of-local-features models works surprisingly well on ImageNet." https://arxiv.org/abs/1904.00760 (which is referenced in this paper) and explains "[t]his suggests that the improvements of DNNs over previous bag-of-feature classifiers in the last few years is mostly achieved by better fine-tuning rather than by qualitatively different decision strategies."


I do have a tangential technical question for someone who knows more math than I do.

A kernel SVM always finds (one-of) the global best fit lines in the kernel space.

A gradient descent model explicitly converges to one of the nearest local minimas by definition.

Does this paper conclude that the local minimas that neural networks converge to are one of the many equivalent global maxima ? Won't this be a major revelation by itself ?


I don't have a strong math background but I think during the optimisation process of finding the hyperplane, the solver (algorithm that attempts to find best separating hyperplane) uses soft margin to allow mis-classified instances. Its tolerance is controlled by a hyper-parameter so it will comprise to find the best fit within the set parameter. So it is a 'best solution' with a condition. However there are many variations of implementations from different solvers to handle it.

Example:https://towardsdatascience.com/support-vector-machine-simply...


Isn't one of the features of high-dimensional spaces that local minima are rare and there's usually some direction that slopes towards a lower loss?


I mean, obviously not in the general case. Cryptography is designed to be a high dimensional space with pretty much no slope.

I'd assume that a lot of the binary decision tree / chip level optimizations are similar: almost no slope worth analyzing.


Sure, and unbreakable crypto is notoriously difficult to make. I wouldn't expect a situation like that to come up in a real-world problem.


Binary decision trees are basically super optimizers: minimizing the number of gates needed to make a Chip's logic (multiply circuits or whatever)

> Sure, and unbreakable crypto is notoriously difficult to make

Not really. It's the unbreakable AND efficient requirement combined that's hard.

Just run anything over a random xor shift add kernel about 1 million times and it's unbreakable (so long as xor / shift forms a bijection). It's just inefficient.

Finding the smallest number of iterations that works is the hard part. Usually, people try to break it (ex, AES was originally broken over 4 iterations) then double or triple your best attempt, and you are set.

AES is now broken over 5 iterations IIRC, but still a long way to go to break full 8 iteration AES.

More modern ciphers (SHA512) are like 80 iterations of a simpler kernel.

Even 'broken' ciphers like TEA or RC4 are hard to break in practice if you just increase the iteration count up the wazzoo. The problem is: AES gets to security in just 8 iterations.

So to be better than AES requires both efficiency AND security. After all, 100 iterations of AES is going to be more secure if you don't care about efficiency. (10x more iterations than the default spec)


In the context of this paper, you can think of the "gradient descent step" as optimizing the parameters of the kernel (i.e. the parameters that generate the kernel space). There are no explicitly optimality guarantees beyond those of standard gradient descent.

The "SVM step" would still find a global optima within the kernel space, but the qualifications of the previous step mean that the kernel space generated might be useless.


I don't think the output of the conversion is guaranteed to be equivalent to a hyperplane learned by an SVM.

I didn't have time to read the paper, but reading the abstract I don't see a claim that the gradient descent model approximated by a kernel machine is equivalent to an optimal fit obtained by SVM maximum margin hyperplane fitting.

I assume one likely ends up with different hyperplane fits from converting a NN/gradient-desc-learned model to kernel machine vs learning a kernel machine directly via SVM learning.


Nitpick: Minima is already plural.


Doesn’t gradient descent use a convex cost function so that it always generates a global minimum?


No, not necessarily. The objective functions used to train neural networks are generally non-convex (the nets themselves being non-convex as well), but are traditionally trained using stochastic gradient descent (and its variants).


This really threw me off when I first started learning about ANNs, coming from a traditional econometrics background. B-but ... the parameters aren’t identified!


The claim here is a bit misleading, as already pointed out by other comments, since the kernel is an evolving one that is essentially learned after seeing the data.

Contrary to many related works that compare wide neural networks to kernel methods, our recent work shows that one can study a feature learning infinite-width limit with realistic learning rate.

https://arxiv.org/abs/2011.14522

We identified what separates the kernel regime (e.g., NTK) and the feature learning regime. In the infinite-width limit, OP's work could belong to either regime depending on the parametrization, i.e. the path kernel is either equal to the NTK or performing feature learning.

It's an incredibly interesting research topic. Please feel free to comment with thoughts on our work :)


The kernel they find is a function of the gradient descent path, which is a function of the data. So no, its nothing at all like a normal kernel machine, where we pick the kernel before seeing the data.

It also only applies to the continuous limit of non-stochastic GD, far from the real training methods used.

We don't gain any understanding either; understanding implies predictive power about some new situation, and I don't see any- and nor does the paper suggest them.

Looks like yet another attempt to attract attention by "understanding" NNs. Look, humans can't explain or understand how we drive, speak, translate, play chess, etc, so why should we expect to understand how models that do these work? Of course, we can understand the principles of the training process, and in fact we already do- the theory of SGD is well understood.


> humans can't explain or understand how we drive, speak, translate, play chess, etc, so why should we expect to understand how models that do these work?

This implies there's no point in pursuing explainability, but many domains involve inferences where the significant predictors are much easier to abstract at a useful level.

For example, if a DLNN could make suggestions as to how to tune a greenhouse given certain yield objectives, then it's reasonable to pursue heuristic techniques aiming to explain what about the parameters most significantly led to the given suggestions.


,,Look, humans can't explain or understand how we drive, speak, translate, play chess, etc, so why should we expect to understand how models that do these work?''

I agree with you, but also it's amazing how much deepmind has achieved by putting neuroscientists and machine learning experts in the same room, and trying to make systems that work inside the human brain work efficiently on metal.

If you look at this talk for 2010, Demis was already listing attention as an example (which was responsible for the recent improvement in protein folding prediction as an example):

https://www.youtube.com/watch?v=F5PSyu7booU


As far as I'm aware, attention does not even attempt biological plausibility, nor was it in any way inspired by biology. The issues attention addresses are very specific to sequential nature of so called Recurrent Neural Networks. The first issue is known as exploding / vanishing gradients - basically as you keep multiplying some vector with matrices you will either explode that vector to infinity or squeeze to zero, the same happens with derivatives. The second issue is that you can not parallelize sequential operation. Attention address this issues by removing recurrence by using a specific invented mathematical structure. There was no name for it, but attention gives good intuition for what that mathematical structure is trying to do. Kind of like quantum chromodynamics uses the term "colors" in a way that has nothing to do with light, photons or even electromagnetic force.


>As far as I'm aware, attention does not even attempt biological plausibility, nor was it in any way inspired by biology.

It may not have been the intention, but associative memory is the one of the only mechanisms that computational neuroscientists can agree on broadly. There's been recent work on energy-based models that suggest biologically plausible methods adjacent to attention. [0]

[0] https://arxiv.org/abs/2008.06996


Absolutely, modern NN architectures have been inspired by biological ones- despite their massive differences.

Even in cases like attention, the modern version (that actually works in GPT-3, AlphaFold2, etc), has little in common with both the english word and what we think of as attention. Its a formula with two matmuls and a softmax: softmax(AB)C. In particular, it doesn't necessarily look anywhere at all- just a weighted sum of the inputs. Nothing like the hard attention used by the human visual cortex. Its not even that different from a convolution where you allow the weights to be a function of the input.

So the inspiration might have come from humans, but the actual architectures have largely come from pure trial and error, with limited, difficult to explain intuition on what tends to work.


Actually self attention is a generalization of convolution:

https://openreview.net/pdf?id=HJlnC1rKPB

,,This work provides evidence that attention layers can perform convolution and, indeed, they often learn to do so in practice. Specifically, we prove that a multi-head self-attention layer with sufficient number of heads is at least as expressive as any convolutional layer. Our numerical experiments then show that self-attention layers attend to pixel-grid patterns similarly to CNN layers, corroborating our analysis.''


How do we really know that brains use hard attention?


Eyes only have high resolution in a tiny spot.


Most definitely, what I meant though is how do we know mental attention is not essentially a big softmax (over some context, e.g. short or long term memory)?

The eye physically focusing on one thing at time seems like a special case (and this fact isn't even true of all animals, e.g. many prey species), and not a part of the brain's attention mechanism.


Oh, I don't know, it very well might! However, the brain's physical structure doesn't look conducive to 4096x4096 matmuls!


Actually the retinal anisotropy is a feature distinct from attention. You can fixate on a point in a scene and then attend any other point in the scene. This is indeed how both animal and human attention experiments in the field is setup to control for the eyes movements.


"Perhaps the most significant implication of our result for deep learning is that it casts doubt on the common view that it works by automatically discovering new representations of the data, in contrast with other machine learning methods, which rely on predefined features (Bengio et al., 2013). As it turns out, deep learning also relies on such features, namely the gradients of a predefined function, and uses them for prediction via dot products in feature space, like other kernel machines."

Huh. So the implication here is that a deep network can never generalize results to inputs classes that were not explicit in the training set? And would this result apply for networks trained with something other than gradient based methods?


So at the end, it rephrased a statement from "Neural Tangent Kernel: Convergence and Generalization in Neural Networks" [https://arxiv.org/abs/1806.07572], besides in a way which is kind of miss-leading.

The assertion is known by the community at least since 2018, if not even well before.

I find this article and the buzz around a little awkward.


To be fair, the article you cite is from 2018, and the OP is from 2012 (if the arXiv dates are right).


The OP article was submitted on Mon, 30 Nov 2020 23:02:47 - the arXiv identifier 2012.xyz means it was published in the 12th month of 2020, not 2012.


Oh God that's embarrassing. I really thought that's how arXiv identifiers worked, and have been under the wrong impression for some time.


If, as in this paper, we allow ourselves to set the kernel after seeing the data, then the statement in the title is trivial: if my learning algorithm outputs function f, I can take the kernel K(x,x')=f(x)*f(x').

The result is interesting insofar as the path kernel is interesting, which requires some more thought.


If I'm understanding correctly, it doesn't just set the kernel after seeing the data, but also after training the entire model, because the path kernel can't be defined without the optimization process to define the path.

I can't tell if this paper is a useful insight or not.


Interesting to compare this with [1], which shows that for some concept classes neural networks trained with SGD are much easier to train (i.e., require less data).

In other words, even though these two types of model are in a way equivalent, one can be much easier to train than the other for certain concepts (no free lunch).

[1] https://arxiv.org/abs/2001.04413


Interestingly deep networks aren't ever really learned by gradient descent.

No! Really !

They are learned by interaction with a network selection process of varying degrees of hokeyness that involves decisions being taken by humans about the characteristics of the network, training data, training process and testing processes. At the end -> the model! Gradient descent is v.critical to this, but it's not the only thing going on by a long way.

Also dropout.


This sounds interesting, but a little over my head. Can anyone offer an explanation for a software engineer with no AI/ML background?


Part of the reason this is interesting is that deep learning has to have some kind of inductive bias to perform as well as we've seen (given a learning problem, it's biased toward learning in particular ways). In general though, a neural network can approximate _any_ function, so reigning in that complexity and uncovering which functions are actually learnable (or efficiently learnable) by deep learning is an important research direction. This paper says that the functions uncovered by deep learning (with caveats) are precisely those which are close to functions represented by a different learning technique, which is notable because this new class of functions is not "all functions" and because the characterization is explainable in some sense, giving insight into how deep learning works. That connection also winds up having a bunch of other interesting implications that somebody else can cover if they'd like.


“You don’t need a fancy model, you can just find a similar example directly from the training data and get similar results”


If you look at how they actually "translate" the fancy model to the simple one, it requires fully fitting the original model (and keeping track of the evolution of gradients over the training). So it wouldn't make training more efficient, but perhaps it would be useful in inference or probing the characteristics of the original model.


This has always anecdotally appeared to be the case when investigating the predictions of neural nets. Particularly when it comes time to answer the question “what does this model not handle”


Defining ‘similar’ robustly is the meat of the problem, and what we’re finding deep NNs to do well.


Not an explanation, but a benefit could be that SVMs can be evaluated much faster and are more explainable (* Citation needed, I know).


Don’t kernel SVMs need a full pass through the data they were trained on to make predictions? How is that faster?


No, they require a full pass over the support vectors, which are potentially a much smaller set. (That’s part of why everyone was so excited about SVMs when they were invented) The support vectors are the training values with nonzero hinge loss, or alternatively, training values sufficiently close to the decision boundary.


Fair enough, but the number of support vectors for non trivial problems is still pretty large (as I understand but could be wrong), e.g. 20-30% of the dataset. Having to iterate over 30% of say imagenet on each batch of predictions seems unfeasible.


You only need the "Support Vectors" to make predictions, not the whole dataset.


neural nets at the same time require multiple passes through the data (epochs). if we can train a model in one epoch jnstead of 10000 epochs thats a breakthrough!


Epochs are more about the training data than the model... If you've got a big enough dataset, one epoch or less is fine!


True, but it sounds like you’re just shifting computation from training to inference. And I’m not sure that’s a very good trade off to make, you’re likely to predict on much more data than you trained on (e.g. ranking models at google, fb, etc)


not sure I get your point, both DNNs and SVMs require one forward pass for inference, so there is no difference. if SVM model can converge in one epoch, how is it not less efficient than the status quo with DNNs?


For kernel SVMs, one needs to keep around part of the training data (the support vectors) right? With DNNs, after training, all you need are the model parameters. For very large datasets, keeping around even a small part of your training data may not be feasible.

Furthermore, number of parameters do not (necessarily) grow with the size of the training data, can be reused if you get more data, can be quantized/pruned/etc. There's not really an easy way to do these things with SVMs as far as I understand.


Given the discussion here seems to stray into conventional kernels and SVMs instead of the generalized version dubbed "path kernel", it seems a lot of comments or observations here on vanilla SVM may not be applicable to whats being described in the paper.

It would be neat if this paper was accompanied by minimalist code to demonstrate the approximate equivalence on some toy "deep" networks, so any misconceptions could be avoided in the peanut gallery here (like every inference / evaluation requiring to integrate the path kernel from scratch, which is clearly not proposed, it just describes how training the original deep network is in some sense equivalent to integrating the path kernel)


The observation is that every deep net f(x) trained on a dataset of (xi,yi) pairs using descent can be written as f(x)=sum_i a_i K(x,xi) + b, which looks like a kernel machine. But in fact a and b depend on x, and K depends on the entire dataset. So the paper is in fact saying f(x)=sum_i a_i(x) K(x, x1,...,xn, y1,...,yn, xi) + b(x). If you were thinking “my deep net is just a kernel SVM with flavor,” you’d need a LOT of flavor for the equivalence to hold.


From a skim, I think one asterisk which may be a gap between the claim in the title and what's actually shown in the paper is that the theorem focuses on gradient descent-trained models which minimize a loss function which is the sum of loss L(y_i, y*_i) on points from a given dataset. While that's clearly very broad, I think it _doesn't_ include things like GANs, where parts of the model produce fake data to train against.


The claim also applies to GANs as you can simply use a masking function to indicate which inputs were used for each optimization timestep, like the author suggests for stochastic gradient descent in remark 5.


Another thought: are neural nets just a weird analog / floating point kind of probabilistic data structure or lossy compression algorithm?


Since it hasn't been mentioned, Stan Kriventsov has very nicely summed up the paper on his blog: https://www.dl.reviews/2020/12/14/neural-networks-kernel-mac...


It is worth noting that the author is the same of a very nice book to popularize machine learning, “the master algorithm”.

https://homes.cs.washington.edu/~pedrod/


>> If gradient descent is limited in its ability to learn representations, better methods for this purpose are a key research direction. Current nonlinear alternatives include predicate invention (Muggleton and Buntine, 1988) and latent variable discovery in graphical models (Elidan et al., 2000).

Hah! Fancy seeing that here! Predicate invention is a main line of my PhD research.

Briefly, predicate invention is the ability of Inductive Logic Programming (ILP) systems to learn their own inductive bias. It is in a sense similar to feature learning or learning-to-learn. ILP systems learn logic programs from examples usually by searching the space of programs defined by a set of sub-programs, called the background knowledge (BK), and a language bias that determines the structure of learned programs. Predicate invention then means learning new BK and language bias to change the program search space while searching it.

The reference in Domingo's article is the first description of the concept which was for a long time more theoretical than practical: ILP approaches could only perform a limited form of predicate invention, e.g. could only invent BK programs of fixed structure or could not invent recursive programs etc. Things changed in 2013 with a new approach, Meta-Interpretive Learning (MIL). MIL systems are for the first time capable of unconstrained predicate invention, including the invention of mutually recursive programs. Full discolosure: my PhD research is on MIL.

Here are some more recent references on predicate invention in MIL:

Meta-interpretive learning of higher-order dyadic datalog: Predicate invention revisited (IJCAI 2013):

https://www.ijcai.org/Proceedings/13/Papers/231.pdf

Bias reformulation for one-shot function induction (ECAI 2014):

http://www.doc.ic.ac.uk/~shm/Papers/metabias.pdf

Logical minimisation of meta-rules within meta-interpretive learning (ICLP 2015):

https://www.doc.ic.ac.uk/~shm/Papers/minmeta.pdf

Like I say this is my field of study and as you can probably tell I'm very excited about it so I'm happy to answer questions- email in my profile.


Is this a new finding? I'm not an expert on the deeper mathematical side of ML, but I remember a friend of mine already telling me about something that sounded exactly like this (ca. 2016-17).


The fact that kernel machines can approximate other models isn't new. I think the novel idea is that they have explicitly constructed the "translation" between the arbitrary model to the kernel machine, and it's quite clean.


This can have very good real world consequences. If this is true, then it makes sense to first attack a problem with SVMs and then if the results are encouraging, try to go the deep learning route if needed.


This is a really cool new bit of intuition:

> A key property of path kernels is that they combat the curse of dimensionality by incorporating derivatives into the kernel: two data points are similar if the candidate function’s derivatives at them are similar, rather than if they are close in the input space. This can greatly improve kernel machines’ ability to approximate highly variable functions (Bengio et al., 2005). It also means that points that are far in Euclidean space can be close in gradient space, potentially improving the ability to model complex functions. (For example, the maxima of a sine wave are all close in gradient space, even though they can be arbitrarily far apart in the input space.)


Does this include transformers?


Watch your stock NVIDIA :)


Why, aren't GPUs pretty useful in training SVMs as well?


SVMs are solved via convex optimization methods which have taken more time to get on the GPU train.

On the other hand there are GPU accelerated SVM training such as: https://github.com/Xtra-Computing/thundersvm

A github or Google search will reveal other GPU accelerated SVM training.


SVMs can also be trained using gradient descent.


Well, and though the author shows an approximate equivalence, which helps us understand a class of models better, it's not obvious that it's preferable to use SVMs of the type described. In particular, it seems like often it would be preferable to deal with model weights (even if they are mathematically a "superposition" of datapoints) than to ship around and revisit the whole dataset.


Looking at you, deep learning.


A linear SVM can in turn be expressed as a very shallow neutral network. The main difference is that with SVMs you put all your effort into transforming inputs for the model (e.g. all the popular kernels) while with neural networks usually most of the effort goes into clever model architectures.


There is probably a ton of isomorphism between different models. It may come down to what is easiest to understand and fastest to implement in code.


See also:

A visual proof that neural networks can approximate any function

https://news.ycombinator.com/item?id=19708620


So a "neural network" is actually a type of parameterized mathematical function that can be fit to any curve including higher dimensional surfaces, etc.?



Yes.


Pretty sure it means a complex algorithm can be approximated by a super simple model.


Thing is, gradient descent is not really a complex algorithm.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: