What is weight decay?
Weight decay is a widely used type of regularization. It is also known as
When training our model, we often run into the problem of overfitting, where our model performs perfectly on the training data but fails to generalize well enough to be practically useful.
There are many different ways to counter overfitting. One of the most common misconceptions is to increase the number of samples in the dataset. This may be true in some scenarios, but often people waste a lot of resources in collecting new data. Let's take an example of polynomial fitting to understand.
Evidently, the figure on the left generalizes well, but the figure on the right gives better results on training data. Let's say that figure (a) fits using the following hypothesis function:
Here,
Notice that the only difference is that model in figure (b) has more features, making it harder for our model to generalize. It can easily be solved by making them zero. This can be applied to logistic regression as well.
Note: To figure out overfitting, here we have plotted our hypothesis. This may not be possible in other problems as there can be hundreds of features and plotting such a big plot is impossible.
Weight decay
One way to counter the problem of overfitting would be to just remove the unneeded features. However, in weight decay, we penalize the parameters corresponding to those features by making them close to zero instead. Let's take the cost function for regression and add the regularization term to it.
Here,
Note: We do not penalize the parameter
as it is a convention and makes little to no difference.
Similarly, the cost function for logistic regression with weight decay would be:
Comparison with regularization
Since weight decay is said to be
The only difference in the equation from before is that we are taking the absolute magnitude of the parameters instead of squaring them. The same goes for logistic regression.
The difference between
It is easier to calculate the derivative of
regularization. The
regularization gives better performance in prediction.
Just like and regularizations are regularizations used during model training. What could be the regularization technique on the data part for any application?
Free Resources