Related Tags

# What is weight decay?

Ace your System Design Interview and take your career to the next level. Learn to handle the design of applications like Netflix, Quora, Facebook, Uber, and many more in a 45-min interview. Learn the RESHADED framework for architecting web-scale applications by determining requirements, constraints, and assumptions before diving into a step-by-step design process.

Weight decay is a widely used type of regularization. It is also known as $l_2$ regularization. But before we dive into weight decay, let's understand why we need regularization in the first place.

When training our model, we often run into the problem of overfitting, where our model performs perfectly on the training data but fails to generalize well enough to be practically useful.

There are many different ways to counter overfitting. One of the most common misconceptions is to increase the number of samples in the dataset. This may be true in some scenarios, but often people waste a lot of resources in collecting new data. Let's take an example of polynomial fitting to understand.

Evidently, the figure on the left generalizes well, but the figure on the right gives better results on training data. Let's say that figure (a) fits using the following hypothesis function:

Here, $h_{\theta}$ is the hypothesis function with $\theta$ as the model parameters and $x$ as the inputs. Similarly, the hypothesis function for figure (b) would be:

Notice that the only difference is that model in figure (b) has more features, making it harder for our model to generalize. It can easily be solved by making them zero. This can be applied to logistic regression as well.

Note: To figure out overfitting, here we have plotted our hypothesis. This may not be possible in other problems as there can be hundreds of features and plotting such a big plot is impossible.

### Weight decay

One way to counter the problem of overfitting would be to just remove the unneeded features. However, in weight decay, we penalize the parameters corresponding to those features by making them close to zero instead. Let's take the cost function for regression and add the regularization term to it.

Here, $J(\theta)$ is the cost function with $n$ number of parameters. The term $\lambda$ is called the regularization parameter and $m$ is the total number of training examples. The choice of $\lambda$ will determine how we penalize the parameters. If it is too large, it will lead to underfitting and if it is too small, it will lead to overfitting.

Note: We do not penalize the parameter $\theta_0$ as it is a convention and makes little to no difference.

Similarly, the cost function for logistic regression with weight decay would be:

### Comparison with $l_1$ regularization

Since weight decay is said to be $l_2$ regularization, you must be wondering what $l_1$ regularization is like. Let's briefly look at what $l_1$ regularization is for regression.

The only difference in the equation from before is that we are taking the absolute magnitude of the parameters instead of squaring them. The same goes for logistic regression.

The difference between $l_1$ regularization and $l_2$ regularization is that $l_1$ regularization performs feature selection. Only the weights of the useful features get to live, the rest of them are made zero. Whereas, in $l_2$ regularization, the parameters are made close to zero but not zero. The key benefits of using $l_2$ regularization are as follows:

• It is easier to calculate the derivative of $l_2$ regularization.
• The $l_2$ regularization gives better performance in prediction.
###### Question

Just like $L_1$ and $L_2$ regularizations are regularizations used during model training. What could be the regularization technique on the data part for any application?

RELATED TAGS

CONTRIBUTOR