Thanks for the loss function reference! I wonder if there’s something waiting to be discovered here about doing gradient descent but only taking steps with some probability. Definitely something to think about, I can’t imagine this idea hasn’t been explored before.
Thanks a lot for the insightful comments, I’ve definitely seen that work in a very new light after knowing about it for years!!