What is the Fisher information matrix?

Overview

Fisher information is a statistical technique that encapsulates how close or far some random instance of a variable is from its true parameter value. It may occur so that there are many parameter values on which a probability distribution depends. In that case, there is a different value for each of the parameters.

Definition

We can compute Fisher information using the formula shown below:

Decoding Fisher information

Initially, in most probabilistic applications, we have little information about how true the parameter values are of our model presumptuously operates upon. An example is neural networks where we have few clues regarding the model parameters. However, we instantiate the training process with a reasonable approximation of the parameter values.

An example of a neuron

For explanation, let's consider an example of a neuron that is set up to be trained to predict the number of fish in a pond on the basis of input features.

For instance, let's suppose the predicted number of fish to be $40$ . Contextually, likelihood is the probability that $\theta$ , which we randomly initialized to be $5$ , is actually $5$ .

The advantage of taking log

It is convenient to take the logarithm of the likelihood function for the ease of differentiation concerning the parameter value. Also, a goal of training is to reach the optimal point where parameters are closest to their true form.

This, when modeled onto a graph, is a maximizing problem, and we need to reach the point where we achieve this maxima. We always ensure this concavity by taking the log. whereas it can't be promised in the naturally occurring likelihood distribution.

The log-likelihood's rate of change

When we take the first derivative of the log-likelihood with respect to $\theta$ , we learn about how volatile are the likelihood values with respect to the parameter values.

Conceptually, as per our example, for $y$ amount of fish predicted by our neural network, how rapidly would the likelihood change if we just change the value of $\theta$ ?

Variance

We can perceive the derivative of log-likelihood as just another probabilistic variable that can be modeled by a probability distribution. Consequently, it also possesses variance that can be computed.

Variance provides intuition into the spread associated with the rate of change of log-likelihood, with respect to $\theta$ .

On the whole, a higher variance indicates that the output contains lower information about the value of the true parameter. This gives us the basis of parameter tuning in evolving techniques, such as the natural gradient descent.

Conclusion

Practically, Fisher information allows us to obtain ample information on how accurate a particular model is in terms of its proposed parameters. Hence, it follows that it is pivotal in determining how the parameters can be tuned to suit the distribution better.