But the authors successfully show how to train CNNs, RNNs, and LSTM RNNs without backpropagation, i.e., every layer learning only via local rules, without having to wait for gradients to be backpropagated to all layers before the entire model can move on to the next sample.
As I understand it, this work has paved a path for training very large networks in massively parallel, fully distributed hardware -- in the not too distant future.
> But the authors successfully show how to train CNNs, RNNs, and LSTM RNNs without backpropagation, i.e., every layer learning only via local rules
The basic version of this was shown in [1], as mentioned by the ICLR review:
"Specifically, the original paper by Whittington & Bogacz (2017) demonstrated that for MLPs, predictive coding converges to backpropagation using local learning rules."
That Whittington & Bogacz didn't extend to complex ANN architectures, but it would have been very surprising if what they showed didn't extend to other ANNs.
OTOH, while local-only updates are great it doesn't help much if the overall algorithm needs vastly more iterations. Again, from the ICLR review: "The increase in computational cost (of 100x) is mentioned quite late and seems to be glossed over a bit."
In my view, there's a big difference between successfully training, say, LSTM RNNs, versus successfully training "vanilla" MLPs.
This work opens the door for using new kinds of massively parallel "neuromorphic" hardware to implement orders of magnitude more layers and units, without requiring greater communications bandwidth between layers, because the model no longer needs to wait until gradients have back-propagated from the last to the first layer before moving on to the next sample.
Scaling backpropagation to GPT-3 levels and beyond (think trillions of dense connections) is very hard -- it requires a lot of complicated plumbing and bookkeeping.
Wouldn't you want to be able to throw 100x, 1000x, or even 1Mx more fully distributed computing power at problems? This work has paved a path pointing in that direction :-)
But the authors successfully show how to train CNNs, RNNs, and LSTM RNNs without backpropagation, i.e., every layer learning only via local rules, without having to wait for gradients to be backpropagated to all layers before the entire model can move on to the next sample.
As I understand it, this work has paved a path for training very large networks in massively parallel, fully distributed hardware -- in the not too distant future.