What is weight decay?

Weight decay is a widely used type of regularization. It is also known as $l_2$ regularization. But before we dive into weight decay, let's understand why we need regularization in the first place.

When training our model, we often run into the problem of overfitting, where our model performs perfectly on the training data but fails to generalize well enough to be practically useful.

There are many different ways to counter overfitting. One of the most common misconceptions is to increase the number of samples in the dataset. This may be true in some scenarios, but often people waste a lot of resources in collecting new data. Let's take an example of polynomial fitting to understand.

Note: To figure out overfitting, here we have plotted our hypothesis. This may not be possible in other problems as there can be hundreds of features and plotting such a big plot is impossible.

Weight decay

One way to counter the problem of overfitting would be to just remove the unneeded features. However, in weight decay, we penalize the parameters corresponding to those features by making them close to zero instead. Let's take the cost function for regression and add the regularization term to it.

Here, $J(\theta)$ is the cost function with $n$ number of parameters. The term $\lambda$ is called the regularization parameter and $m$ is the total number of training examples. The choice of $\lambda$ will determine how we penalize the parameters. If it is too large, it will lead to underfitting and if it is too small, it will lead to overfitting.

Note: We do not penalize the parameter $\theta_0$ as it is a convention and makes little to no difference.

Similarly, the cost function for logistic regression with weight decay would be:

The difference between $l_1$ regularization and $l_2$ regularization is that $l_1$ regularization performs feature selection. Only the weights of the useful features get to live, the rest of them are made zero. Whereas, in $l_2$ regularization, the parameters are made close to zero but not zero. The key benefits of using $l_2$ regularization are as follows:

It is easier to calculate the derivative of $l_2$ regularization.
The $l_2$ regularization gives better performance in prediction.

What is weight decay?

Weight decay

Comparison with l1l_1l1​ regularization

Comparison with $l_1$ regularization