The problem with looking for a theoretical as to why one method should be chosen...

The problem with looking for a theoretical as to why one method should be chosen over another is that you run into the "No Free Lunch theorem"[1]:

any two optimization algorithms are equivalent when their performance is averaged across all possible problems

Once you accept that, then you start looking at practical considerations.

Having said that, if you do want to do the math then you might like the course from Oxford/Nando DeFreitas (now at DeepMind/Oxford)[2]