Search⌘ K
AI Features

Limitations of Gradient Descent

Understand the limitations of gradient descent when applied to non-convex optimization problems in machine learning. Learn how local optima, intractability with large datasets, sensitivity to starting points, and learning rate choices affect convergence and model performance. This lesson helps you evaluate where gradient descent may fall short and why adjustments are necessary.

We have seen how well gradient descent works in the case of convex optimization because of the presence of a single global optimal solution. We will now look at some of the limitations of gradient descent and address them in this chapter.

Intractability

Consider a machine learning problem where we want to minimize the discrepancy between the model prediction fθ(xi)f_\theta(x_i) and the ground-truth label yiy_i, as follows:

Here, L\mathcal{L} is an arbitrary loss function, such as cross-entropy, to measure the discrepancy between the predicted and the ground-truth value. The gradient descent update for the objective above at any time tt can be written as follows:

To compute the gradient θJ(θ)\nabla_\theta J(\theta), we need to aggregate the gradients, θL(fθ(xi),yi)\nabla_\theta \mathcal{L}(f_\theta(x_i), y_i) ...