The Nesterov Momentum
Explore how the Nesterov momentum method improves gradient descent for non-convex problems by maintaining a velocity vector that helps escape local optima. Learn to implement this technique using the Rosenbrock function and visualize its convergence to the global optimum.
We'll cover the following...
Need for momentum
As shown in the figure below, imagine a ball falling down a valley. If its momentum (mass
When applied to non-convex optimization, we cannot guarantee the convergence of gradient descent to the global optimal solution. It often gets stuck at a local optimum because the gradient vanishes at that point and we cannot perform updates anymore.
Similar to the “ball falling down a valley” situation above, we also need a sense of momentum in non-convex optimization to escape a local optimum. The Nesterov momentum is a popular technique that mimics this behavior by maintaining a velocity vector that is an exponential moving average of negative gradients.
How does the Nesterov momentum work?
At every step, it then performs an update in the direction of the velocity vector. In simple terms, a velocity vector is an average direction that can be used to perform updates when the actual gradient is zero.
The Nesterov momentum update at a time
where