What is the natural gradient descent?
Usage of gradient descent
Neural networks work by majorly attempting to model the data distribution and tune their weights so that the
Note: It is entirely up to our gradient descent technique as it determines the direction and distance that our weights are steered onto.
How is natural descent different?
It is essential to be able to visualize how normal gradient descent has an impact on the model weights.
Motivation
Normal gradient descent alters all of the weights to a direction that minimizes loss.
It is essential to note that the step size is controlled carefully with the help of a learning rate, alpha. This is to ensure that the algorithm does not miss out on the shortly existing local minima points. Hence, a small learning rate would prevent the algorithm from overstepping and potentially looping around infinitely. Multiplying with the learning rate causes the algorithm to limit the
This approach consequently runs into a problem.
It would be flawed to assume that every parameter has an equivalent effect on the neural network. The extent and contribution of each and every parameter value are largely varying, and hence, there can be better ways to tune the parameter values than to simply constrict their updates by a small, constant learning rate.
Note: From another perspective, we can model the distribution of the neural network's predicted values after each iteration and ensure that the parameters are tuned with steps that would put a limit to the amount that the output distribution of the network would change.
Method
KL divergence and fisher information matrix
KL divergence between the probability distributions that model the output of the neural network (before and after the tuning of parameters) is calculated and ensured to remain within a certain epsilon distance to prevent overstepping of gradient descent.
The Fisher information matrix is used to deduce the curvature metric that corresponds to a particular KL divergence ratio. Hence, the fisher matrix can be used to map the KL divergence ratio onto the steepness of our graph.
The Fisher information matrix, G, can be calculated as below for the probability distribution that models the output of a model.
where is the probability distribution of the output Let
be the sample from N output samples obtained from
Scaling and learning rate
The natural gradient descent scaling is applied by multiplying the model loss gradient with the inverse of Fisher matrix information.
A learning rate equivalent, beta, can be defined for a small epsilon value as follows:
Step size and parameter update
The step size, s, can be calculated which determines the amount that each one of the parameters should be altered along each axis. Finally, all the parameters are altered and are ready to be fed inputs and generate an output. The process is repeated for the generated output until the model loss output converges to a value close to zero.
Conclusion
On the whole, natural gradient descent refines the ages-old gradient descent with the aid of a better proposed way to alter the parameters. Blatant changing of parameters without any account of how pivotal each parameter can be efficiently circumvented by natural gradient descent.
Free Resources