Tune Learning Rate and Batch Size
Explore how to effectively tune the learning rate and batch size hyperparameters in neural networks. Understand the impact of different learning rates on gradient descent behavior and model convergence. Learn to balance training speed and stability by experimenting with batch sizes to reduce loss and optimize training outcomes.
We'll cover the following...
Tune the learning rate
We’ll use our old hyperparameter called lr. This hyperparameter has been with us since almost the beginning of this course. Chances are, we already tuned it, maybe by trying a few random values. It’s time to be more precise about lr tuning.
To understand the trade-off of different learning rates, let’s go back to the basics and visualize gradient descent. The following diagrams show a few steps of GD along a one-dimensional loss curve, with three different values of lr. The red cross marks the starting point, and the green cross marks the minimum:
Let’s remember what lr does. The bigger it is, the larger each step of GD is. The first diagram uses a small lr, so the algorithm takes tiny steps towards the minimum. The second example uses a larger lr, which results in bolder steps and a faster descent.
However, we cannot just set a very large lr and blaze towards the minimum at ludicrous speed, as the third diagram proves. In this case, lr is so large that each step of gradient descent lands farther away from the goal than it started. Not only does this training process fail to find the minimum, but it also fails to converge, increasing the loss at every step instead of decreasing it. In fact, if we have a smooth loss function with a single minimum, we could prove mathematically that batch gradient descent always finds that minimum—as long as lr is sufficiently small. With a large lr, all bets are off.
Now we’ve seen that a very small lr can slow down GD, and a very large lr can derail GD completely. We need to strike a balance. So let’s brush up compare.py and try a few values of lr.
The following playground shows the effect of changing learning rate on loss. Here TIME is set to 20 seconds for training the classifier for a specific learning rate. Increasing the time spent for training generates more accurate results as shared after the following code widget.
We can expect to get slightly different results each time we run this program, because it uses random values for its weights. To view the loss comparison curve by tweaking learning rates, wait till the training phase is finished and the server gets started. Then click the app-link given below.
<!DOCTYPE html> <html> <body> <img src="graph.png" alt="output" width="600" height="400"> </body> </html>
The lr has a wide range of reasonable values, so it does not make sense to try values on a linear scale, such as 0.1, 0.2, and 0.3. Instead, the previous code tries an exponential scale: 0.001, 0.01, 0.1, and 1. In some cases, we might have to find even bigger or, more frequently, smaller values of lr. In our case, it seems that these values span a big enough range:
Training: lr=0.001
Loss: 1.64907297 (3 epochs completed, 1543 total steps)
Training: lr=0.01
Loss: 0.53636065 (3 epochs completed, 1600 total steps)
Training: lr=0.1
Loss: 0.23230808 (3 epochs completed, 1552 total steps)
Training: lr=1
Loss: 0.07311269 (3 epochs completed, 1572 total steps)
See those spikes with lr=1? This value seems slightly too large, causing some steps of GD to land further from the minimum than they started from. On the other hand, it isn’t large enough to derail the algorithm entirely, and it yields a lower loss than the other values we tried. Also, after a few minutes of training, it seems that the instability fades away, probably because each step of GD becomes so small that even a large lr is not enough to make it diverge.
In the long term, it seems that an lr of 1 pays back for its negative effects. We’ll stick with this value, and move on to the next hyperparameter.
Tuning the batch size
We have already compared the effects of different batch sizes in Batches Large and Small. Back then, we found that batches could speed up training and shorten the development cycle. Now that we are near the end of that cycle and have set all the other hyperparameters, let’s compare batch sizes again:
The following playground shows the batch size effect on loss after training the neural network for 10 seconds per batch. Increasing the time spent for training generates more accurate results as shown after the following code widget, where TIME is kept 300 seconds per batch.
We can expect to get slightly different results each time we run this program, because it uses random values for its weights. To view the batch size effect on the loss graph, wait till the training phase is finished and the server gets started. Then click the app-link given below.
<!DOCTYPE html> <html> <body> <img src="graph.png" alt="output" width="600" height="400"> </body> </html>
This time, we skip the test on stochastic GD, that is, a batch size of 1. The last time we tried it, stochastic GD did not go anywhere in terms of performance. Instead, we need to focus on three more promising batch sizes: 64, 128, 256, and batch GD, that is, all the examples in one batch. Below is the resulting diagram if we set the TIME to 300 seconds:
The losses are so close together that it’s hard to make out which batch size is doing better. Let’s look at the exact numbers:
Training: batch_size=60000
Loss: 0.18655365 (241 epochs completed, 241 total steps)
Training: batch_size=256
Loss: 0.11773560 (2 epochs completed, 678 total steps)
Training: batch_size=128
Loss: 0.12600472 (1 epochs completed, 668 total steps)
Training: batch_size=64
Loss: 0.14859866 (0 epochs completed, 656 total steps)
The numbers show that after 5 minutes of training, a batch size of 256 results in a lower loss than the 128 that we have used so far. Let’s switch to 256 from now on.
We went through each and every hyperparameter in our neural network, and now we have good values for all of them. That does not mean that we must stick with those values forever. We could probably reduce a few more decimal points off the loss by cycling through the hyperparameters a second or a third time. That being said, we did make a lot of progress on tuning our network.