Deep Learning with PyTorch Step-by-Step: Part I - Fundamentals/

...

Gradient Descent

Learn about what gradient descent is, why visualizing it is important, and the model being used.

We'll cover the following...

Introduction to gradient descent
Why visualize gradient descent?
Model
Practice

Why visualize gradient descent?

I believe the way gradient descent is usually explained lacks intuition. Students and beginners are left with a bunch of equations and general rules; this is not the way one should learn such a fundamental topic.

If you really understand how gradient descent works, you will also understand how the characteristics of your data and your choice of hyper-parameters (e.g., mini-batch size and learning rate) have an impact on the speed of the model training.

But really understanding something does not mean you’re only manually working through the equations; this does not develop intuition either. Rather, understanding means visualizing the effects of different settings or telling a story to illustrate the concept. That is how you develop intuition.

With that being said, we will cover the five basic steps you would need to go through to use gradient descent. In addition, we will also show you the corresponding Numpy code while explaining many fundamental concepts along the way.

But first, we need some data to work with. Instead of using some external dataset, we will:

Define which model we want to train to better understand gradient descent.
Generate synthetic data for that model.

Model

The model must be simple and familiar, so you can focus on the inner workings of gradient descent.

So, we will stick with a model as simple as it can be; a linear regression with a single feature x, which has the following equation:

$y = b + w x + \epsilon$

In this model, we use a feature (x) to try to predict the value of a label (y). There are three elements in our model:

Parameter b is the bias (or intercept), which tells us the expected average value of y when x is zero.
Parameter w is the weight (or slope), which tells us how much y increases (on average) if we increase x by one unit.
And that last term (why does it always have to be a Greek letter?), epsilon, is there to account for the inherent noise, which is the error we cannot remove.

We can also conceive the very same model structure in a less abstract way:

Salary = minimum wage + increase per year * years of experience + noise

And to make it even more concrete, let us say that the minimum wage is $1,000 (whatever the currency or time frame is not important). If you have no experience, your salary is going to be the minimum wage (parameter b).

In addition to this, let us assume that (on average) you get a $2,000 increase (parameter w) for every year of experience you have. So, if you have two years of experience, you are expected to earn a salary of $5,000. However, your actual salary appears to be $5,600 (lucky you!). Since the model cannot account for those extra $600, your extra money is noise, technically speaking.

Practice

Try to solve this short quiz to test your understanding of the concepts explained in this lesson:

Introduction

Visualizing Gradient Descent

A Simple Regression Problem

Rethinking the Training Loop

Going Classy

A Simple Classification Problem

Conclusion

Appendix

Gradient Descent

Introduction to gradient descent

Why visualize gradient descent?

Model

Practice