Stochastic gradient descent is a "crappy" method that works. Theoretically, it ought to be slow to converge and often involves annoying tuning parameters because you want to slow learning with each iteration. In practice, though, it works quite well. Quite a large number of successful neural-net/deep-learning implementations use SGD rather than second-order (quasi-Newton) or even momentum methods.
Pointwise SGD tends to be uncommon and too noisy. Mini-batch sampling is more common. Each iteration, you train on a bootstrap subset (say, 10,000 points chosen with or without replacement from the full dataset). If you have less than 10k data points, you generally don't need to use stochastic algorithms at all.
Does gradient descent apply to problems with many local extrema, or is it restricted to purely globally convex problems?
There are a variety of approaches:
* Multiple starting points. Often gets the job done and finds a good local minimum. The idea is that, usually, better (lower) local minimums will have larger drainage regions, a sort of "deep is usually broad" hypothesis that is often but not always true.
* Momentum. Instead of moving with a velocity proportionate to the gradient, it's proportionate to a smoothed average of previous gradients. This makes it far less likely that you end up at a bad local minimum or some other stationary point (at a local maximum or saddlepoint, the gradient is zero).
* Other stochastic devices such as simulated annealing or (for neural nets) drop-out.
* Dimensionality reduction and regularization (weight decay) don't necessarily solve the "multiple local minima" problem but often make the search space smaller and the problem, therefore, more tractable. (The main purpose of DR and regularization, however, is to improve generalization performance.)
Pointwise SGD tends to be uncommon and too noisy. Mini-batch sampling is more common. Each iteration, you train on a bootstrap subset (say, 10,000 points chosen with or without replacement from the full dataset). If you have less than 10k data points, you generally don't need to use stochastic algorithms at all.
Does gradient descent apply to problems with many local extrema, or is it restricted to purely globally convex problems?
There are a variety of approaches:
* Multiple starting points. Often gets the job done and finds a good local minimum. The idea is that, usually, better (lower) local minimums will have larger drainage regions, a sort of "deep is usually broad" hypothesis that is often but not always true.
* Momentum. Instead of moving with a velocity proportionate to the gradient, it's proportionate to a smoothed average of previous gradients. This makes it far less likely that you end up at a bad local minimum or some other stationary point (at a local maximum or saddlepoint, the gradient is zero).
* Other stochastic devices such as simulated annealing or (for neural nets) drop-out.
* Dimensionality reduction and regularization (weight decay) don't necessarily solve the "multiple local minima" problem but often make the search space smaller and the problem, therefore, more tractable. (The main purpose of DR and regularization, however, is to improve generalization performance.)