Computing Continuous Posterior Distribution
In the previous lesson, we posed and solved a problem in Bayesian reasoning involving only discrete distributions, and then proposed a variation on the problem whereby we change the prior distribution to a continuous distribution while preserving that the likelihood function produced a discrete distribution.
Continuous Posterior Given an Observation of Discrete Result
The question is: what is the continuous posterior when we are given an observation of the discrete result?
More specifically, the problem we gave was: suppose we have a prior in the form of a process which produces random values between and . We sample from that process and produce a coin that is heads with the given probability. We flip the coin; it comes up heads. What is the posterior distribution of coin probabilities?
Here’s one way to think about it: Suppose we stamp the probability of the coin coming up heads onto the coin. We mint and then flip a million of those coins once each. We discard all the coins that came up tails. The question is: what is the distribution of probabilities stamped on the coins that came up heads?
Let’s remind ourselves of Bayes’ Theorem. For prior and likelihood function , that the posterior is:
Remembering of course that is logically a function that takes a and returns a distribution of and similarly for .
But so far we’ve only seen examples of Bayes’ Theorem for discrete distributions. Fortunately, it turns out that we can do almost the same arithmetic in our weighted non-normalized distributions and get the correct result.
A formal presentation of the continuous version of Bayes’ Theorem and proving that it is correct would require some calculus and distract from where we want to go in this lesson, so we are just going to wave my hands here. Rest assured that we could put this on a solid theoretical foundation if we chose to.
Let’s think about this in terms of our type system. If we have a prior:
IWeightedDistribution<double> prior = // whatever;
and a likelihood:
Func<double, IWeightedDistribution<Result>> likelihood = // whatever;
then what we want is a function:
Func<Result, IWeightedDistribution<double>> posterior = // ???
Let’s suppose our result is
Heads. The question is: what is
posterior(Heads).Weight(d) equal to, for any
d we care to choose? We just apply Bayes’ Theorem on the weights. That is, this expression should be equal to:
prior.Weight(d) * likelihood(d).Weight(Heads) / ???.Weight(Heads)
We have a problem; we do not have an
IWeightedDistribution<Result> to get
Weight(Heads) from to divide through.
That is: we need to know what the probability is of getting
Heads if we sample a coin from the mint, flip it, and do not discard
We could estimate it by repeated computation. We could call:
a billion times; the fraction of them that are
Heads is the weight of
That sounds expensive though. Let’s give this some more thought.
Weight(Heads) is, it is a positive constant, right? And we have already abandoned the requirement that weights have to be normalized so that the area under the curve is exactly .
Positive constants do not affect proportionality.
We do not need to compute the denominator at all to solve a continuous Bayesian inference problem; we just assume that the denominator is a positive constant, and so we can ignore it.
posterior(Heads) must produce a distribution such that
posterior(Heads).Weight(d) is proportional to:
prior.Weight(d) * likelihood(d).Weight(Heads)
But that is just a non-normalized weight function, and we already have the gear to produce a distribution when we are given a non-normalized weight function; we can use our
Metropolis class from two lessons ago. It can take a non-normalized weight function, an initial distribution, and a proposal distribution, and produce a weighted distribution from it.
Notice that we don’t even need a distribution that we can sample from; all we need is its weight function.
That was all very abstract, so let’s look at the example we proposed last time: a mint with poor quality control produces coins with a particular probability of coming up heads; that’s our prior.
Therefore we’ll need a PDF that has zero weight for all values less than and greater than . We don’t care if it is normalized or not.
Remember, this distribution represents the quality of coins that come from the mint, and the value produced by sampling this distribution is the bias of the coin towards heads, where is “double-tailed” and is “double-headed”.
Let’s choose a beta distribution.