Bayesian inference

Bayesian inference#

Bayesian inference is a method of statistical inference in which probability is used to update beliefs about model’s parameters based on available evidence or data.

It offers a conceptual framework for estimating unknown variables while accounting for uncertainty.

By employing a model that describes dependencies of random variables, probability theory can be used to infer all the unknown quantities. All uncertainties, either in observations or model parameters, are modeled as probability distributions.

In short, Bayesian inference is the process of deducing properties of a probability distribution from data using Bayes’ theorem. It incorporates the idea that probability should include a measure of belief about a prediction or outcome.

Group Task

To better understand the role of prior beliefs and subjective probability, discuss with your neighbour the following questions:

What is the probability that it will rain tomorrow?
What is the probability that the next president of your country will be a woman?
What is the probability that aliens built the pyramids?

How do these questions compare to the probability that a die will roll a 6?

Such questions, unlike the die, cannot be answered by “long-run” probability, i.e., probability obtained from multiple repeated runs of the same experiment. A certain degree of belief is involved.

Priors and “subjective” probability are foundational for Bayesian inference!

Bayesian and frequentist paradigms#

In frequentist statistics, probability has to be seen as frequency of occurrence of events, i.e. the frequentist probability of an event is the limit of its relative frequency of occurrence when the experiment is repeated a very large number of times.

The Bayesian approach describes prior knowledge about the parameters governing a phenomena through probability distributions. New knowledge about the parameters governing a phenomena is provided by new observed data described by the likelihood function, which is the probability distribution of the observed data conditioned on the parameters governing the phenomena. Through Bayes’ theorem, prior probability distribution of the parameters governing the phenomena is updated with the observed data likelihood function, obtaining a posterior probability distribution of the parameters governing the phenomena.

Probability density#

We have seen examples of probability density functions (PDF) in the previous chapter. What are characteristics of a PDF in general?

If \(X\) is a random variable with a probability density function \(p(x),\) the probability of the event that \(X\) is in the interval \((a,b)\) can be computed as

\[ p(X\in(a,b)) = \int_a^b p(x)dx. \]

The sum and product rule#

The sum and product rules for probability densities take the form

\[\begin{split} \begin{align*} p(y) = \int p(x,y)dx \quad &\text{- sum rule},\\ p(x,y) = p(y|x) p(x) = p(x|y) p(y) \quad &\text{- product rule}.\\ \end{align*} \end{split}\]

The probability of \(y\), \(p(y)\), is called the marginal probability.

The product rule specifies that the joint probability distribution of two variables can be expressed as the product of a conditional distribution \(p(y|x)\) and a marginal distribution \(p(x)\), or vice-versa.

The marginal distribution#

The marginal distribution refers to the probability distribution of one or more variables in a subset of a larger joint probability distribution, obtained by summing or integrating over the other variables.

It provides insights into the individual probabilities or frequencies of specific outcomes for one or more variables, independent of the other variables in a dataset or probability distribution.

For the related quantities \((z, x, y)\), we can specify the product rule

\[ p(z, x, y) = p(z|x, y)p(x, y), \]

Prediction of the unknown \(z\) can be obtained by integration over all the different explanations \((x, y)\):

\[ p(z) = \int p(z|x, y)p(x, y)dx dy. \]

The likelihood function \(p(z|x, y)\) gives the probability of the unknown \(z\) for a particular explanation \((x,y)\), and \(p(x,y)\) gives the weights for every possible explanation.

The Bayes’ theorem#

From the product rule, and with the symmetry property \(p(x|y)p(y) = p(y|x)p(x)\), we immediately derive the Bayes’ rule :

\[ p(y|x) = \frac{p(x|y)p(y)}{p(x)} = \frac{p(x|y)p(y)}{\int p(x|y)p(y)dy} \]

which is the key element in Bayesian inference since it defines the posterior density of \(y\), \(p(y|x)\), after including information \(x\) through the conditional probability model \(p(x|y)\). The marginal probability of \(x\), \(p(x)\), makes a normalization constant ensuring that \(p(y|x)\) is a proper probability density function.

Bayesian modelling and inference#

Bayesian modeling consists in describing in a mathematical form all observable data \(y\), and unobservable (“latent”) parameters \(θ\) through defining the joint probability distribution of data and parameters \(p(y, \theta)\).

We define probability models for the observed quantities given the parameters \(p(y|θ)\), and unobserved quantities about we wish to learn, \(p(θ)\), and combine them through the product rule in a joint probability distribution:

\[ p(y, \theta) = p(y|\theta)p(\theta). \]

The observational model

\[p(y|θ)\]

is a probabilistic model of the observed data that relates \(y\) with the unknown parameters \(\theta\) we want to learn. This model represents the evidence provided by the data. It is the main source of information and is called likelihood function.

The distribution

\[p(θ)\]

denotes a prior probability distribution for the parameters, that encodes our prior knowledge about the parameters. This probability distribution can be an informative or non-informative prior distribution, depending on the reliable information (knowledge) available for the parameters. This is one of the key features that differentiate from the frequentist approach, i.e. probability distributions are defined for the unknown quantities (parameters) and combined with the likelihood function.

Group Task: Diagnosing cause of headache

Imagine a situation where you need make a decision concerning your health. You have a headache, and can choose between two doctors:

Doctor 1:

Has a mental model for the cause of pain.
Performs tests.

Doctor 2:

Has a mental model for the cause of pain.
Has access to the patient’s chronic history.
Performs tests.

Which doctor do you choose? Can you make sense of which parts are the data, likelihood and prior in this scenario?

Inference without priors is like a doctor who does not know the patient’s history!

Group Task: Diagnosing COVID-19

We know that the probability of having fever this time of the year is 10%, the probability of having COVID is 7%, and among all people who have COVID, 70% of them have fever.

If you’re a doctor, you don’t know whether someone has COVID until you test them, but they may present with a high temperature and you want to reason whether to isolate them on that basis! So you are interested in knowing the chance that someone has COVID given they have a high temperature.

Find the probability that a patient has COVID given they have high temperature (fever).

Parameter inference#

Obtaining the posterior distribution of the unknown parameters is the key element of the Bayesian approach. By reframing the Bayes’ rule in terms of \(y\) and \(\theta\), we get a formula showing how to compute a posterior:

\[ p(\theta | y) = \frac{p(y |\theta)p(\theta)}{p(y)} = \frac{p(y|\theta)p(\theta)}{ \int p(y|\theta)p(\theta) d\theta} \]

The denominator of Bayes’ rule,

\[ p(y) = \int p(y|\theta)p(\theta) d \theta \]

is the marginal likelihood, as it integrates the likelihood over the prior information of parameters, also known as the evidence of the model. The marginal likelihood normalizes the posterior into a proper probability distribution. The final inference will be a compromise between the evidence provided by the data and the prior information.

Predictive inference#

The posterior distribution of the parameters can be used to model the uncertainty of predictions \(\tilde{y}\) for new observations. The posterior predictive distribution of \(\tilde{y}\) is obtained by marginalizing out the joint posterior of predictions \(\tilde{y}\) and model parameters \(\theta\) over the model parameters:

\[ p(\tilde{y}|y) = \int p(\tilde{y}, \theta|y)d \theta = \int p(\tilde{y}|\theta, y)p(\theta|y)d\theta. \]

The predictive distribution can also be seen as averaging the predictions of the model \(p(\tilde{y}|\theta, y)\) over the posterior distribution of the model \(p(\theta|y)\).

Bayesian inference in the nutshell: likelihood, prior, posterior#

To summarise, here are all of the component of Bayesian inference:

The Bayes’ theorem in terms of data \(y\) and model parameters \(\theta\):

\[ p(\theta | y) = \frac{p(y | \theta)p(\theta)}{p(y)} \]

The denominator \(p(y)\) is the normaliser or evidence .
\(p(\theta)\) is the prior .
\(p(y | \theta)\) is the likelihood .
and \(p(\theta| y)\) is the posterior.

For this reason, you will often see Bayes rule summarised as

posterior \(\propto\) prior \(\times\) likelihood

which ignores the denominator since it is a constant (independent of \(y\)). This posterior summarises our belief state about the possible values of \(\theta\).

Role of priors#

One of the key distinctions of the Bayesian approach lies in its incorporation of prior knowledge regarding model parameters. By stating those priors, we are forced to explicitly state all of out assumptions about the model structure and its parameters. At the same time, priors are also the main source of critique for Bayesian inference due to the subjectivity which they bring.

Some non-obvious advantages of Bayesian inference, which are not easy to spot at the first glance, is their

ability to work with small data,
ability to perform model regularisation.

How can we perform Bayesian inference?#

What does it take?

Data,
A generative model,
Our beliefs before seeing the data.

What does it make?

The values of parameters that could give rise to the observed data in the form of a distribution.

How can we perform it?

Analytically

Solving the maths! This is an elegant approach. However, it is rarely available in real life.
Numerically
- Rather than deriving a posterior distribution in the closed form, we can use computational tools to sample from the posterior. The obtained samples describe the distributions of parameters.
- We achieve this by exploring the space of parameters to find the most probable combinations of parameters.
- Further we treat the obtained sampled as new data, to extract information about parameters, such as mean, credible interval or other statistics.

Numerical methods#

Markov Chain Monte Carlo (MCMC) family of algorithm, e.g.,
- Metropolis-Hastings
- Gibbs
- Hamiltonian Monte Carlo (HMC)
- No-U-Turn sampler (NUTS)
- further variants such as SGHMC, LDHMC, etc
Variational Inference
Approximate Bayesian Computation (ABC)
Particle filters
Laplace approximation

More on this later! First, let’s discuss some analytics and point estimates.

Task 15: Point estimates for Bernoulli-Beta coin flips

Consider the coin flipping problem with probability of heads (‘success’) being modelled as \(\theta\). The experiment was repeated \(n\) times, and we observed \(h\) ‘sucesses’. Assume that the coin flips follow the Bernoulli distribution \(\mathcal{Bern}(\theta)\).

Derive the maximum likelihood estimate (MLE) for parameter \(\theta\)

\[ \hat{\theta}_\text{MLE} = {\arg \max}_\theta p(y | \theta) \]

How does the amount of data affect the MLE estimate?
Derive the maximum aposteriori estimate (MAP) for parameter \(\theta\), using a \(\mathcal{Beta}(a,b)\) prior

\[ \hat{\theta}_\text{MAP} = {\arg \max}_\theta p(y | \theta)p(\theta) \]

How does the amount of data and parameters of the Beta distribution affect the MAP estimate?
Visualise MAP as a function of \(n, h, a, b\).
What is the difference between the two approaches? Does a point estimate tell us anything about our uncertainty or the distribution from which we draw the estimate?

It turns out that under the independense assumption minimising the KL divergence is the same as maximising the likelihood estimate!

\[ \arg \min_\theta \mathrm{KLD} (p(y \mid \theta^*) \mid\mid p(y \mid \theta)) = \arg \max_\theta p(y \mid \theta) \]