What is the natural gradient descent?

Usage of gradient descent

Neural networks work by majorly attempting to model the data distribution and tune their weights so that the loss is minimizedThis is the difference between the model's output and the actual dependent variable.. Like all supervised learning mechanisms, a set of labeled data instances is provided to the network in the training phase. In each iteration, the features undergo forward propagation and produce some predicted values. By such means, the weights are adjusted so as to minimize the loss.

Note: It is entirely up to our gradient descent technique as it determines the direction and distance that our weights are steered onto.

How is natural descent different?

It is essential to be able to visualize how normal gradient descent has an impact on the model weights.

Motivation

Normal gradient descent alters all of the weights to a direction that minimizes loss.

It is essential to note that the step size is controlled carefully with the help of a learning rate, alpha. This is to ensure that the algorithm does not miss out on the shortly existing local minima points. Hence, a small learning rate would prevent the algorithm from overstepping and potentially looping around infinitely. Multiplying with the learning rate causes the algorithm to limit the distanceDistance is calculated as per Euclidean distance between the old parameter values and the new parameter values. that each parameter axis is steered into. Hence, all parameters are bound and constricted by the exact same learning rate.

1 of 7

This approach consequently runs into a problem.

It would be flawed to assume that every parameter has an equivalent effect on the neural network. The extent and contribution of each and every parameter value are largely varying, and hence, there can be better ways to tune the parameter values than to simply constrict their updates by a small, constant learning rate.

Note: From another perspective, we can model the distribution of the neural network's predicted values after each iteration and ensure that the parameters are tuned with steps that would put a limit to the amount that the output distribution of the network would change.

Method

KL divergence and fisher information matrix

KL divergence between the probability distributions that model the output of the neural network (before and after the tuning of parameters) is calculated and ensured to remain within a certain epsilon distance to prevent overstepping of gradient descent.

The Fisher information matrix is used to deduce the curvature metric that corresponds to a particular KL divergence ratio. Hence, the fisher matrix can be used to map the KL divergence ratio onto the steepness of our graph.

The Fisher information matrix, G, can be calculated as below for the probability distribution that models the output of a model.

  • XnPθtX_{n}\: \approx\: P_{\theta_{t}} \:where PθtP_{\theta_{t}} is the probability distribution of the output

  • Let XiX_{i} be the ith i^{th} sample from N output samples obtained from Xn X_{n}

Scaling and learning rate

The natural gradient descent scaling is applied by multiplying the model loss gradient with the inverse of Fisher matrix information.

A learning rate equivalent, beta, can be defined for a small epsilon value as follows:

Step size and parameter update

The step size, s, can be calculated which determines the amount that each one of the parameters should be altered along each axis. Finally, all the parameters are altered and are ready to be fed inputs and generate an output. The process is repeated for the generated output until the model loss output converges to a value close to zero.

Conclusion

On the whole, natural gradient descent refines the ages-old gradient descent with the aid of a better proposed way to alter the parameters. Blatant changing of parameters without any account of how pivotal each parameter can be efficiently circumvented by natural gradient descent.

Free Resources

Copyright ©2026 Educative, Inc. All rights reserved