Adaptive Gradient Descent
Explore the Adaptive Gradient Descent algorithm AdaGrad, which automatically adjusts learning rates based on past gradients. Understand how it stabilizes convergence in non-convex optimization and see its implementation and visualization using NumPy and Matplotlib.
We'll cover the following...
The stability of gradient descent highly depends on the step size of the algorithm. Choosing a large step size can lead to oscillations or overshooting of parameters, whereas using one that is too small can lead to slow convergence or increase the chances of getting in local minima. The Adaptive Gradient Algorithm (AdaGrad) is an optimization algorithm that adapts the learning rate for each parameter based on past gradients. Using AdaGrad, we can avoid the need to manually tune the learning rate during the optimization process.
What is AdaGrad?
The main idea of AdaGrad is to scale the gradient for each parameter by the inverse square root of the sum of squares of the past gradients. This means that parameters with large gradients will have smaller updates, and parameters with small gradients will have larger updates.
The update rule of AdaGrad at a time
where