The observation is that every deep net f(x) trained on a dataset of (xi,yi) pair...

The observation is that every deep net f(x) trained on a dataset of (xi,yi) pairs using descent can be written as f(x)=sum_i a_i K(x,xi) + b, which looks like a kernel machine. But in fact a and b depend on x, and K depends on the entire dataset. So the paper is in fact saying f(x)=sum_i a_i(x) K(x, x1,...,xn, y1,...,yn, xi) + b(x). If you were thinking “my deep net is just a kernel SVM with flavor,” you’d need a LOT of flavor for the equivalence to hold.