What is the natural gradient descent?

Usage of gradient descent

Neural networks work by majorly attempting to model the data distribution and tune their weights so that the loss is minimizedThis is the difference between the model's output and the actual dependent variable.. Like all supervised learning mechanisms, a set of labeled data instances is provided to the network in the training phase. In each iteration, the features undergo forward propagation and produce some predicted values. By such means, the weights are adjusted so as to minimize the loss.

Note: It is entirely up to our gradient descent technique as it determines the direction and distance that our weights are steered onto.

How is natural descent different?

It is essential to be able to visualize how normal gradient descent has an impact on the model weights.

Motivation

Normal gradient descent alters all of the weights to a direction that minimizes loss.

It is essential to note that the step size is controlled carefully with the help of a learning rate, alpha. This is to ensure that the algorithm does not miss out on the shortly existing local minima points. Hence, a small learning rate would prevent the algorithm from overstepping and potentially looping around infinitely. Multiplying with the learning rate causes the algorithm to limit the distanceDistance is calculated as per Euclidean distance between the old parameter values and the new parameter values. that each parameter axis is steered into. Hence, all parameters are bound and constricted by the exact same learning rate.

This approach consequently runs into a problem.

It would be flawed to assume that every parameter has an equivalent effect on the neural network. The extent and contribution of each and every parameter value are largely varying, and hence, there can be better ways to tune the parameter values than to simply constrict their updates by a small, constant learning rate.

Note: From another perspective, we can model the distribution of the neural network's predicted values after each iteration and ensure that the parameters are tuned with steps that would put a limit to the amount that the output distribution of the network would change.

Method

KL divergence and fisher information matrix

KL divergence between the probability distributions that model the output of the neural network (before and after the tuning of parameters) is calculated and ensured to remain within a certain epsilon distance to prevent overstepping of gradient descent.

The Fisher information matrix is used to deduce the curvature metric that corresponds to a particular KL divergence ratio. Hence, the fisher matrix can be used to map the KL divergence ratio onto the steepness of our graph.

The Fisher information matrix, G, can be calculated as below for the probability distribution that models the output of a model.

$X_{n}\: \approx\: P_{\theta_{t}} \:$ where $P_{\theta_{t}}$ is the probability distribution of the output
Let $X_{i}$ be the $i^{th}$ sample from N output samples obtained from $X_{n}$

What is the natural gradient descent?

Usage of gradient descent

How is natural descent different?

Motivation

Method

KL divergence and fisher information matrix

Scaling and learning rate

Step size and parameter update

Conclusion