For everyone saying "oh wow, we can go back to SVMs now, huge speedups, etc" - t...

For everyone saying "oh wow, we can go back to SVMs now, huge speedups, etc" - that's more than a little premature. This is purely a formal equivalence, but not very useful computationally.

The math here is pretty much first-year undergraduate level calculus, and it's worth going through section 2 since it's quite clearly written (thanks Prof Domingos).

Essentially, what the author does is show that any model trained with "infinitely-small step size full-batch gradient descent" (i.e. a model following a gradient flow) can be written in a "kernelized" form

   y(x) = \sum_{(x_i, y_i) \in L} a_i K(x, x_i) + b.

The intuition most people have for SVMs is that the constants a_i are, well, constant, that the a_i are sparse, and that the kernel function K(x, x_i) is cheap to compute (partly why it's called the 'kernel trick').

However, none of those properties apply here, which is why this isn't particularly exciting computationally. The "trick" is that both a_i and K(x, x_i) involve path integrals along the gradient flow for the input x, and so doing a single inference is approximately as expensive as training the model from scratch (if I understand this correctly).