...
/Using Dropout to Combat Overfitting
Using Dropout to Combat Overfitting
Explore how dropout techniques enhance neural network models.
We'll cover the following...
A major shortcoming of the baseline model was overfitting. Overfitting is commonly due to a phenomenon found in large models called coadaptation. This can be addressed with dropout. Both the coadaptation issue and its resolution with dropout are explained below.
What’s coadaptation?
If all the weights in a deep learning network are learned together, it‘s usual that some nodes have more predictive capability than others.
In such a scenario, because the network is trained iteratively, these powerful nodes start to suppress the weaker ones. These nodes usually constitute a fraction of all. However, over many iterations, only these powerful nodes are trained and the rest stop participating.
This phenomenon is called coadaptation. It’s difficult to prevent with the traditional and regularization. The reason is that they also regularize based on the predictive capability of the nodes. As a result, the traditional methods become close to deterministic in choosing and rejecting weights. And so, a strong node gets stronger, and the weak gets weaker.
A major fallout of coadaptation is expanding the neural network size does not help.
This had been a severe issue in deep learning for a long time. Then, in around 2012, the idea of dropout, a new regularization approach emerged.
Dropout resolved coadaptation, which naturally revolutionized deep learning. With dropout, deeper and broader networks were possible.
What is dropout?
Dropout changed the approach of learning weights. Instead of learning all the network weights collectively, dropout trains a subset of them in a batch training iteration.
The illustration above and below show the model weights training during a batch iteration. Here, a simple example to train weights of four nodes is shown. The usual training without dropout is in the above illustration. In this scheme, all the nodes are active. That is, all the weights will be trained together.
On the other hand, with dropout, only a subset of nodes is kept active during batch learning. The three images in the illustration above correspond to three different batch iterations. Half of the nodes are switched off in each batch iteration, while the remaining are learned. After iterating through all the batches, the weights are returned as the average of their batch-wise estimations.
This technique acts as network regularizatio.n However, familiarity with traditional methods might make dropout appear not as regularization. Yet, there are some commonalities.
Like ...