Gradient Descent

Discover the math behind gradient descent to deepen our understanding by exploring graphical representations.

Background

Let’s look for a better train() algorithm. The job of train() is to find the parameters that minimize the loss, so let’s start by focusing on loss() itself:

def loss(X, Y, w, b):
  return np.average((predict(X, w, b) - Y) ** 2)

Look at this function’s arguments. The XX and YY contain the input variables and the labels, so they never change from one call of loss() to the next. To make the discussion simple, let’s also temporarily fix bb at 00. So now the only variable is ww.

How does the loss change as w changes? We put together a program that plots loss()` for w ranging from 1-1 to 44, and draws a green cross on its minimum value. Let’s look at the following graph:

Let’s call it the loss curve. The entire idea of train() is to find that marked spot at the bottom of the curve. It is the value of ww that gives the minimum loss. At ww, the model approximates the data points as well as it can.

Now imagine that the loss curve is a valley, and there is a hiker standing somewhere in this valley. The hiker wants to reach their basecamp, right where the marked spot is—but it’s dark, and they can only see the terrain right around her feet. To find the basecamp, they can follow a simple approach. They can walk in the direction of the steepest downward slope. If the terrain does not contain holes or cliffs and our loss function does not—then each step will take the hiker closer to the basecamp.

To convert that idea into running code, we need to measure the slope of the loss curve. In mathspeak, that slope is called the gradient of the curve. By convention, the gradient at a certain point is an arrow that points directly uphill from that point, like this:

To measure the gradient, we can use a mathematical tool called the derivative of the loss with respect to the weight, that is written as Lw\frac{\partial L}{\partial w} More formally, the derivative at a certain point measures how the loss LL changes at that point for small variations of ww. Imagine if we increase the weight just a little bit, what will happen to the loss?

In the case of the diagram here, the derivative would be a negative number, meaning that the loss decreases when ww increases. If the hiker were standing on the right side of the diagram, the derivative would be positive, meaning that the loss increases when ww increases. At the minimum point of the curve, (the point marked with the cross) the curve is level, and the derivative is zero.

Note that the hiker would have to walk in the direction opposite to the gradient to approach the minimum. So, in the case of a negative derivative like the one in the picture, they would take a step in the positive direction. The size of the hiker’s steps should also be proportional to the derivative. If the derivative is a big number (either positive or negative), this means the curve is steep, and the basecamp is far away. So the hiker can take big steps with confidence. As they approach the basecamp, the derivative becomes smaller, and so do their steps.

The algorithm described above is called gradient descent or GD for short. Implementing it requires a tiny bit of math.

The math behind gradient descent

Let’s discuss how to crack gradient descent with math. First, let’s rewrite the mean squared error loss in mathematical notation:

In this notation, the ∑ symbol stands for “sum.” Also, the m in the preceding formula that stands for “the number of examples.”

In English, the formula reads as follows:

Add the squared errors of all the examples, from the example number one to the example number mm, and divide the result by the number of examples.

Remember that the xx's and yy's are constants. They are the values of the input variable and the labels, respectively. The mm is also a constant, because the number of examples never changes. So is bb , because we temporarily fixed it at zero. We’ll reintroduce bb in a short while, but for the moment, the only value that varies in this formula is ww.

Now we have to calculate the direction and size of the gradient— the derivative of LL with respect to ww. We’ll use calculus to calculate this derivative:

Lw=2mi=1mxi((wxi+b)yi)\large{\frac{\partial L}{\partial w} = \frac{2}{m} \sum_{i=1}^{m} x_i ((wx_i + b)-y_i)}

The derivative of the loss is similar to the loss itself, except that the power of 2 is gone, each element of the sum has been multiplied by xx, and the final result is multiplied by 2. We can add any value of w into this formula and calculate the gradient at that point.

Here’s the same formula converted to code where b is fixed at 0:

def gradient(X, Y, w):
  return 2 * np.average(X * (predict(X, w, 0) - Y))

Now that we have a function to calculate the gradient, let’s rewrite train() to do gradient descent.

Downhill riding

Here is train(), updated for gradient descent:

def train(X, Y, iterations, lr):
w = 0
for i in range(iterations):
print("Iteration %4d => Loss: %.10f" % (i, loss(X, Y, w, 0)))
w -= gradient(X, Y, w) * lr
return w

This version of train() is much terser than the previous one. With gradient descent, we do not need any if. We just initialize ww and then repeatedly step in the opposite direction of the gradient. Remember, the gradient is pointing uphill, and we want to go downhill. The lrlr hyperparameter is still there, but now it tells how large each step should be in proportion to the gradient.

When we do gradient descent, we have to decide when to stop it. The old version of train() returned after a maximum number of iterations, or when it failed to decrease the loss further(whichever comes first). With gradient descent, the loss could in theory decrease forever, inching toward the minimum in smaller and smaller steps, without ever quite reaching it. So, when should we stop taking those steps?

We could decide to stop when the gradient becomes small enough because we are very close to the minimum. This code, however, follows a less refined approach: when we call train(), we tell it for how many iterations to run. More iterationsiterations lead to a lower loss, but since the loss decreases progressively more slowly, at a certain point we can just decide that the additional precision is not worth the wait.

We’ll learn to choose good values for hyperparameters such as iterations and lr in the chapter “Let’s Do Development.

For now, after trying a bunch of different values, we just try a bunch of different values and end up with the result, which seem to result in a low enough, precise enough loss:

Run the code and notice the change after each iterationiteration:

main.py
pizza.txt
X, Y = np.loadtxt("pizza.txt", skiprows=1, unpack=True)
w = train(X, Y, iterations=100, lr=0.001)
print("\nw=%.10f" % w)

The loss drops at each iteration. After 100 iterationsiterations, GD gets so close to the minimum that we can not see the difference between the last two losses. It seems like our gradient descent algorithm is doing its job.