I seem to recall that there a recent theory paper that got a best paper award, but can't find it.
If I remember correctly, their counter-intuitive result was that big overparameterized models could learn more efficiently, and were less likely to get trapped in poor regions of the optimization space.
[This is also similar to how introducing multimodal training gives an escape hatch to get out of tricky regions.]
So with this hand-wavey argument, it might be the case that two-phase training is needed: A large overcomplete pretraining focused on assimilating all the knowledge, and a second that makes it compact. Other, that there is a hyperparameter that controls overcompleteness vs compactness and you adjust it over training.
I don't see that contuer-intuitive at all. If you have a barrier in your cost function in 1d model you have to cross over it no matter what. In 2d it could be only a mount that you can go around. More dimensions mean more ways to go around.
This is also how the human brain works. A young babby will have something more similar to a fully connected network. Versus a Biden type elderly brain will be more of a sparse minimally connected feed forward net. The question is (1) can this be adjusted dynamically in silico and (2) if we succeed in that, does fine-tuning still work?
If I remember correctly, their counter-intuitive result was that big overparameterized models could learn more efficiently, and were less likely to get trapped in poor regions of the optimization space.
[This is also similar to how introducing multimodal training gives an escape hatch to get out of tricky regions.]
So with this hand-wavey argument, it might be the case that two-phase training is needed: A large overcomplete pretraining focused on assimilating all the knowledge, and a second that makes it compact. Other, that there is a hyperparameter that controls overcompleteness vs compactness and you adjust it over training.