Popular Optimization Algorithms
Discover the most frequently-used alternatives of gradient descent and the intuition behind them.
Concerns on SGD
This basic version of SGD comes with some limitations and problems that might negatively affect the training.
-
If the loss function changes quickly in one direction and slowly in another, it may result in a high oscillation of gradients making the training progress very slow.
-
If the loss function has a local minimum or a saddle point, it is highly likely that SGD will be stuck there without being able to “jump out” and proceed in finding a better minimum. This happens because the gradient becomes zero, so there is no update in the weight whatsoever.
A saddle point is a point on the surface of the graph of a function where the slopes (derivatives) are all zero but which is not a local maximum of the function.