Search⌘ K
AI Features

Adaptive Gradient Descent

Explore how Adaptive Gradient Descent (AdaGrad) adjusts learning rates for each parameter by scaling gradients according to their past squared sums. This lesson helps you understand the algorithm's step-by-step implementation to enhance stability and convergence when optimizing non-convex functions.

We'll cover the following...

The stability of gradient descent highly depends on the step size of the algorithm. Choosing a large step size can lead to oscillations or overshooting of parameters, whereas using one that is too small can lead to slow convergence or increase the chances of getting in local minima. The Adaptive Gradient Algorithm (AdaGrad) is an optimization algorithm that adapts the learning rate for each parameter based on past gradients. Using AdaGrad, we can avoid the need to manually tune the learning rate during the optimization process.

What is AdaGrad?

The main idea of AdaGrad is to scale the gradient for each parameter by the inverse square root of the sum of squares of the past gradients. This means that parameters with large gradients will have smaller updates, and parameters with small gradients will have larger updates.

The update rule of AdaGrad at a time tt is given as follows:

where Gt=Gt1+(θJ(θt1))2G_t = G_{t-1} + (\nabla_{\theta} J(\theta_{t-1}))^2 ...