Gradient Descent

Learn about what gradient descent is, why visualizing it is important, and the model being used.

Introduction to gradient descent

According to Wikipedia:

“Gradient descent is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function.”

But if I were to describe it, I would say:

Gradient descent is an iterative technique commonly used in Machine Learning and Deep Learning to try to find the best possible set of parameters/coefficients for a given model, data points, and loss function, starting from an initial, and usually, random guess.

Why visualize gradient descent?

I believe the way gradient descent is usually explained lacks intuition. Students and beginners are left with a bunch of equations and general rules; this is not the way one should learn such a fundamental topic.

If you really understand how gradient descent works, you will also understand how the characteristics of your data and your choice of hyper-parameters (e.g., mini-batch size and learning rate) have an impact on the speed of the model training.

But really understanding something does not mean you’re only manually working through the equations; this does not develop intuition either. Rather, understanding means visualizing the effects of different settings or telling a story to illustrate the concept. That is how you develop intuition.

With that being said, we will cover the five basic steps you would need to go through to use gradient descent. In addition, we will also show you the corresponding Numpy code while explaining many fundamental concepts along the way.

But first, we need some data to work with. Instead of using some external dataset, we will:

  • Define which model we want to train to better understand gradient descent.

  • Generate synthetic data for that model.


The model must be simple and familiar, so you can focus on the inner workings of gradient descent.

So, we will stick with a model as simple as it can be; a linear regression with a single feature x, which has the following equation:

y=b+wx+ϵy = b + w x + \epsilon

In this model, we use a feature (x) to try to predict the value of a label (y). There are three elements in our model:

  • Parameter b is the bias (or intercept), which tells us the expected average value of y when x is zero.

  • Parameter w is the weight (or slope), which tells us how much y increases (on average) if we increase x by one unit.

  • And that last term (why does it always have to be a Greek letter?), epsilon, is there to account for the inherent noise, which is the error we cannot remove.

We can also conceive the very same model structure in a less abstract way:

Salary = minimum wage + increase per year * years of experience + noise

And to make it even more concrete, let us say that the minimum wage is $1,000 (whatever the currency or time frame is not important). If you have no experience, your salary is going to be the minimum wage (parameter b).

In addition to this, let us assume that (on average) you get a $2,000 increase (parameter w) for every year of experience you have. So, if you have two years of experience, you are expected to earn a salary of $5,000. However, your actual salary appears to be $5,600 (lucky you!). Since the model cannot account for those extra $600, your extra money is noise, technically speaking.


Try to solve this short quiz to test your understanding of the concepts explained in this lesson:


In a simple linear regression model, given that a feature (x) and a label (y) has been provided, the linear equation of this model would be y = b + wx where b and w are the terms:


Slope, intercept


Intercept, slope


Bias, input


Input, bias

Question 1 of 20 attempted