What is the Fisher information matrix?
Overview
Fisher information is a statistical technique that encapsulates how close or far some random instance of a variable is from its true parameter value. It may occur so that there are many parameter values on which a probability distribution depends. In that case, there is a different value for each of the parameters.
Definition
We can compute Fisher information using the formula shown below:
Here,
Alternatively, we can write down the variance of a variable
Decoding Fisher information
Initially, in most probabilistic applications, we have little information about how true the parameter values are of our model presumptuously operates upon. An example is neural networks where we have few clues regarding the model parameters. However, we instantiate the training process with a reasonable approximation of the parameter values.
An example of a neuron
For explanation, let's consider an example of a neuron that is set up to be trained to predict the number of fish in a pond on the basis of input features.
The use of likelihood
Likelihood answers the question about how likely is a certain parameter value concerning a certain output.
We can quantify likelihood as the following, for a given parameter value
For instance, let's suppose the predicted number of fish to be
The advantage of taking log
It is convenient to take the logarithm of the likelihood function for the ease of differentiation concerning the parameter value. Also, a goal of training is to reach the optimal point where parameters are closest to their true form.
This, when modeled onto a graph, is a maximizing problem, and we need to reach the point where we achieve this maxima. We always ensure this concavity by taking the log. whereas it can't be promised in the naturally occurring likelihood distribution.
The log-likelihood's rate of change
When we take the first derivative of the log-likelihood with respect to
Conceptually, as per our example, for
Variance
We can perceive the derivative of log-likelihood as just another probabilistic variable that can be modeled by a probability distribution. Consequently, it also possesses variance that can be computed.
Variance provides intuition into the spread associated with the rate of change of log-likelihood, with respect to
On the whole, a higher variance indicates that the output contains lower information about the value of the true parameter. This gives us the basis of parameter tuning in evolving techniques, such as the natural gradient descent.
Conclusion
Practically, Fisher information allows us to obtain ample information on how accurate a particular model is in terms of its proposed parameters. Hence, it follows that it is pivotal in determining how the parameters can be tuned to suit the distribution better.
Free Resources