Gradient Descent
Discover the math behind gradient descent to deepen our understanding by exploring graphical representations.
Background
Let’s look for a better train()
algorithm. The job of train()
is to find the parameters that minimize the loss, so let’s start by focusing on loss()
itself:
def loss(X, Y, w, b):
return np.average((predict(X, w, b) - Y) ** 2)
Look at this function’s arguments. The and contain the input variables and the labels, so they never change from one call of loss()
to the next. To make the discussion simple, let’s also temporarily fix at . So now the only variable is .
How does the loss change as w changes? We put together a program that plots loss()` for w ranging from to , and draws a green cross on its minimum value. Let’s look at the following graph:
Let’s call it the loss curve. The entire idea of train()
is to find that marked spot at the bottom of the curve. It is the value of that gives the minimum loss. At , the model approximates the data points as well as it can.
Now imagine that the loss curve is a valley, and there is a hiker standing somewhere in this valley. The hiker wants to reach their basecamp, right where the marked spot is—but it’s dark, and they can only see the terrain right around her feet. To find the basecamp, they can follow a simple approach. They can walk in the direction of the steepest downward slope. If the terrain does not contain holes or cliffs and our loss function does not—then each step will take the hiker closer to the basecamp.
To convert that idea into running code, we need to measure the slope of the loss curve. In mathspeak, that slope is called the gradient of the curve. By convention, the gradient at a certain point is an arrow that points directly uphill from that point, like this:
To measure the gradient, we can use a mathematical tool called the derivative of the loss with respect to the weight, that is written as ...