Gradient Descent

Learn about the gradient descent algorithm and the mathematics behind it.

We learned about the binary search algorithm. Among the benefits of this algorithm is the possibility of optimizing a function even if we don’t know its formula. On the other hand, this method requires the function to be monotonic and the restriction (if any) should divide the x-axis into two separable and continuous regions. We can’t rely on binary search to solve all optimization problems because these conditions are too restrictive, and it’s not clear how to extend the method to solve multivariate problems.

In this lesson, we’re going to learn another method that requires different conditions and can be naturally extended to solve problems with many variables. This method is called the gradient descent algorithm. We’ll learn what gradient descent is, its advantages and disadvantages, when to use it, and how to implement it using Python.

Gradient descent’s applicability is widespread in many fields like robotics and video game development. But undoubtedly, the main application of gradient descent (and its variants) is in machine learning. Many algorithms, like linear regression and neural networks, can learn because of a gradient descent variant that carries on the process of automatic learning (which is just solving an optimization problem).

Let’s master gradient descent then!

Gradient descent requirements

As with binary search, we can apply gradient descent only under certain conditions. If some of these conditions aren’t met, then the algorithm isn’t guaranteed to converge. That means that we might not find the minimum using gradient descent.

In the first place, the target function should be differentiable and convex. A differentiable function is a function that has all its first-order derivatives defined in its whole domain. These tend to be smooth functions when we draw them, without corners or discontinuities (regions where the function isn’t defined). On the other hand, nondifferentiable functions always have some type of discontinuities and/or corners and cusps.

Press + to interact
Differentiable functions are smooth and continuous
Differentiable functions are smooth and continuous

In the previous figure, the function in the middle has a corner at x=0x = 0 and the function at the right is not defined at that same point. That’s why those functions are not differentiable at x=0x = 0.

A convex function is a function that’s always curved up. If we have a convex function of one variable and we draw the line that describes it, then the segment that connects any pair of points in that line is always over the functions and never crosses it.

Press + to interact
Concave and convex functions
Concave and convex functions

There’s a mathematical way of defining convexity. We’re going to include it here for the sake of formality, but don’t worry if it seems too complicated. A function ff is convex if it has the following property:

f(λx1+(1λ)x2)λf(x1)+(1λ)f(x2)f(\lambda x_1 + (1 - \lambda)x_2) \leq \lambda f(x_1) + (1-\lambda) f(x_2) ...