{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Bayesian inference\n",
    "\n",
    "Bayesian inference is a method of statistical inference in which probability is used to update <font color='orange'>beliefs</font> about model's parameters based on available <font color='orange'>evidence</font> or <font color='orange'>data</font>.\n",
    "\n",
    "It offers a conceptual framework for estimating unknown variables while accounting for <font color='orange'>uncertainty</font>. \n",
    "\n",
    "By employing a <font color='orange'>model</font> that describes dependencies of random variables, probability theory can be used to infer all the unknown quantities. All uncertainties, either in observations or model parameters, are modeled as probability distributions.\n",
    "\n",
    "In short, <font color='orange'>Bayesian inference is the process of deducing properties of a probability distribution from data using Bayes’ theorem</font>. It incorporates the idea that probability should include a measure of belief about a prediction or outcome.\n",
    "\n",
    "`````{admonition} Group Task\n",
    "To better understand the role of prior <font color='orange'>beliefs and subjective probability</font>, discuss with your neighbour the following questions:\n",
    "\n",
    "- What is the probability that it will rain tomorrow?\n",
    "- What is the probability that the next president of your country will be a woman?\n",
    "- What is the probability that aliens built the pyramids?\n",
    "\n",
    "How do these questions compare to the probability that a die will roll a 6?\n",
    "`````\n",
    "\n",
    "Such questions, unlike the die, cannot be answered by \"long-run\" probability, i.e., probability obtained from multiple repeated runs of the same experiment. A certain degree of <font color='orange'>belief</font> is involved.\n",
    "\n",
    "<font color='orange'>Priors</font> and \"subjective\" probability are foundational for Bayesian inference!\n",
    "\n",
    "## Bayesian and frequentist paradigms\n",
    "\n",
    "In <font color='orange'>frequentist</font> statistics, probability has to be seen as <font color='orange'>frequency of occurrence of events</font>, i.e. the frequentist probability of an event is the <font color='orange'>limit</font> of its relative frequency of occurrence when the experiment is repeated a very large number of times.\n",
    "\n",
    "The Bayesian approach describes prior knowledge about the parameters governing a phenomena through probability distributions. New knowledge about the parameters governing a phenomena is provided by new observed data described by the likelihood function, which is the probability distribution of the observed data conditioned on the parameters governing the phenomena. Through Bayes’ theorem, <font color='orange'>prior probability distribution</font> of the parameters governing the phenomena <font color='orange'>is updated</font> with the observed data likelihood function, obtaining a <font color='orange'>posterior probability</font> distribution of the parameters governing the phenomena.\n",
    "\n",
    "\n",
    "## Probability density\n",
    "\n",
    "We have seen examples of probability density functions (PDF) in the previous chapter. What are characteristics of a PDF in general?\n",
    "\n",
    "If $X$ is a random variable with a probability density function $p(x),$ the probability of the event that $X$ is in the interval $(a,b)$ can be computed as\n",
    "\n",
    "```{margin}\n",
    "For discrete variables, integration turns into summation.\n",
    "```\n",
    "$$\n",
    "p(X\\in(a,b)) = \\int_a^b p(x)dx.\n",
    "$$\n",
    "\n",
    "## The sum and product rule\n",
    "\n",
    "The sum and product rules for probability densities take the form\n",
    "\n",
    "$$\n",
    "\\begin{align*}\n",
    "p(y) = \\int p(x,y)dx \\quad &\\text{- sum rule},\\\\\n",
    "p(x,y) = p(y|x) p(x) = p(x|y) p(y) \\quad &\\text{- product rule}.\\\\\n",
    "\\end{align*}\n",
    "$$\n",
    "\n",
    "The probability of $y$, $p(y)$, is called the <font color='orange'>marginal</font> probability.\n",
    "\n",
    "The product rule specifies that the <font color='orange'>joint</font> probability distribution of two variables can be expressed as the product of a <font color='orange'>conditional distribution</font> $p(y|x)$ and a marginal distribution $p(x)$, or vice-versa.\n",
    "\n",
    "## The marginal distribution\n",
    "\n",
    "The marginal distribution refers to the probability distribution of one or more variables in a subset of a larger joint probability distribution, obtained by summing or integrating over the other variables.\n",
    "\n",
    "It provides insights into the individual probabilities or frequencies of specific outcomes for one or more variables, independent of the other variables in a dataset or probability distribution.\n",
    "\n",
    "For the related quantities $(z, x, y)$, we can specify the product rule\n",
    "\n",
    "$$\n",
    "p(z, x, y) = p(z|x, y)p(x, y),\n",
    "$$\n",
    "\n",
    "Prediction of the unknown $z$ can be obtained by integration over all the different explanations $(x, y)$:\n",
    "\n",
    "$$\n",
    "p(z) = \\int p(z|x, y)p(x, y)dx dy.\n",
    "$$\n",
    "\n",
    "The likelihood function $p(z|x, y)$ gives the probability of the unknown $z$ for a particular explanation $(x,y)$, and $p(x,y)$ gives the weights for every possible explanation.\n",
    "\n",
    "\n",
    "## The Bayes' theorem\n",
    "\n",
    "From the product rule, and with the symmetry property $p(x|y)p(y) = p(y|x)p(x)$, we immediately derive the Bayes’ rule :\n",
    "\n",
    "$$\n",
    "p(y|x) =  \\frac{p(x|y)p(y)}{p(x)} = \\frac{p(x|y)p(y)}{\\int p(x|y)p(y)dy}\n",
    "$$\n",
    "\n",
    "which is the key element in Bayesian inference since it defines the <font color='orange'>posterior density</font> of $y$, $p(y|x)$, after including information $x$ through the conditional probability model $p(x|y)$. The marginal probability of $x$, $p(x)$, makes a normalization constant ensuring that $p(y|x)$ is a proper probability density function."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Bayesian modelling and inference\n",
    "\n",
    "Bayesian modeling consists in describing in a mathematical form all <font color='orange'>observable data</font> $y$, and unobservable <font color='orange'>(\"latent\") parameters</font> $θ$ through defining the joint probability distribution of data and parameters $p(y, \\theta)$.\n",
    "\n",
    "We define probability models for the observed quantities given the parameters $p(y|θ)$, and unobserved quantities about we wish to learn, $p(θ)$, and combine them through the product rule in a joint probability distribution:\n",
    "\n",
    "$$\n",
    "p(y, \\theta) = p(y|\\theta)p(\\theta).\n",
    "$$\n",
    "\n",
    "The <font color='orange'>observational model</font> \n",
    "\n",
    "$$p(y|θ)$$ \n",
    "\n",
    "```{margin}\n",
    "The likelihood here is no different from the likelihood in the frequentist approach, where it also links observed data to the unknown parameters.\n",
    "```\n",
    "is a probabilistic model of the observed data that relates $y$ with the unknown parameters $\\theta$ we want to learn. This model represents the evidence provided by the data. It is the main source of information and is called <font color='orange'>likelihood function</font>. \n",
    "\n",
    "The distribution \n",
    "\n",
    "$$p(θ)$$ \n",
    "\n",
    "denotes a <font color='orange'>prior probability distribution</font> for the parameters, that encodes our prior knowledge about the parameters. This probability distribution can be an informative or non-informative prior distribution, depending on the reliable information (knowledge) available for the parameters. This is one of the key features that differentiate from the frequentist approach, i.e. probability distributions are defined for the unknown quantities (parameters) and combined with the likelihood function.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "`````{admonition} Group Task: Diagnosing cause of headache\n",
    "\n",
    "Imagine a situation where you need make a decision concerning your health. You have a headache, and can choose between two doctors:\n",
    "\n",
    "**Doctor 1:**\n",
    "- Has a mental model for the cause of pain.\n",
    "- Performs tests.\n",
    "\n",
    "**Doctor 2:**\n",
    "- Has a mental model for the cause of pain.\n",
    "- Has access to the patient's chronic history.\n",
    "- Performs tests.\n",
    "\n",
    "Which doctor do you choose? Can you make sense of which parts are the <font color='red'>`data`</font>, <font color='teal'>`likelihood`</font> and <font color='purple'>`prior`</font> in this scenario?\n",
    "`````\n",
    "\n",
    "Inference without priors is like a doctor who does not know the patient's history!\n",
    "\n",
    "`````{admonition} Group Task: Diagnosing COVID-19\n",
    "\n",
    "We know that the probability of having fever this time of the year is 10%, the probability of having COVID is 7%, and among all people who have COVID, 70% of them have fever.\n",
    "\n",
    "If you're a doctor, you don't know whether someone has COVID until you test them, but they may present with a high temperature and you want to reason whether to isolate them on that basis! So you are interested in knowing the chance that someone has COVID given they have a high temperature.\n",
    "\n",
    "Find the probability that a patient has COVID given they have high temperature (fever).\n",
    "`````"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Parameter inference\n",
    "\n",
    "Obtaining the posterior distribution of the unknown parameters is the key element of the Bayesian approach. By reframing the Bayes' rule in terms of $y$ and $\\theta$, we get a formula showing how to compute a posterior:\n",
    "\n",
    "```{margin}\n",
    "Note that $p(y)$ does not depend on parameters $\\theta$. Hence, in many practical situaltions, it is enough to compute the posterior up to a constant. We can (and often will) write $p(\\theta|y) \\propto p(y|\\theta)p(\\theta).$\n",
    "```\n",
    "\n",
    "$$\n",
    "p(\\theta | y) = \\frac{p(y |\\theta)p(\\theta)}{p(y)} = \\frac{p(y|\\theta)p(\\theta)}{ \\int p(y|\\theta)p(\\theta) d\\theta}\n",
    "$$\n",
    "\n",
    "The denominator of Bayes’ rule, \n",
    "\n",
    "$$\n",
    "p(y) = \\int p(y|\\theta)p(\\theta) d \\theta\n",
    "$$\n",
    "\n",
    "is the marginal likelihood, as it integrates the likelihood over the prior information of parameters, also known as the <font color='orange'>evidence</font> of the model. The marginal likelihood normalizes the posterior into a proper probability distribution. The final inference will be a compromise between the evidence provided by the data and the prior information.\n",
    "\n",
    "### Predictive inference\n",
    "\n",
    "The posterior distribution of the parameters can be used to model the uncertainty of predictions $\\tilde{y}$ for new observations. The posterior predictive distribution of $\\tilde{y}$ is obtained by marginalizing out the joint posterior of predictions $\\tilde{y}$ and model parameters $\\theta$ over the model parameters:\n",
    "\n",
    "$$\n",
    "p(\\tilde{y}|y) = \\int p(\\tilde{y}, \\theta|y)d \\theta = \\int p(\\tilde{y}|\\theta, y)p(\\theta|y)d\\theta.\n",
    "$$\n",
    "\n",
    "The predictive distribution can also be seen as averaging the predictions of the model $p(\\tilde{y}|\\theta, y)$ over the posterior distribution of the model $p(\\theta|y)$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Bayesian inference in the nutshell: likelihood, prior, posterior\n",
    "\n",
    "To summarise, here are all of the component of Bayesian inference:\n",
    "\n",
    "The Bayes' theorem in terms of <font color='orange'>data</font> $y$ and <font color='orange'>model parameters</font> $\\theta$:\n",
    "\n",
    "$$\n",
    "p(\\theta | y) = \\frac{p(y | \\theta)p(\\theta)}{p(y)}\n",
    "$$\n",
    "\n",
    "* The denominator $p(y)$ is the *normaliser* or <font color='orange'>evidence</font> .\n",
    "* $p(\\theta)$ is the <font color='orange'>prior</font> .\n",
    "* $p(y | \\theta)$ is the <font color='orange'>likelihood</font> .\n",
    "* and $p(\\theta| y)$ is the <font color='orange'>posterior</font>.\n",
    "\n",
    "For this reason, you will often see Bayes rule summarised as \n",
    "\n",
    "`posterior` $\\propto$ `prior` $\\times$ `likelihood`\n",
    "\n",
    "which ignores the denominator since it is a constant (independent of $y$). This posterior summarises our belief state about the possible values of $\\theta$.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Role of priors\n",
    "\n",
    "One of the key distinctions of the Bayesian approach lies in its incorporation of prior knowledge regarding model parameters. By stating those priors, we are forced to explicitly state all of out assumptions about the model structure and its parameters. At the same time, priors are also the main source of critique for Bayesian inference due to the subjectivity which they bring.\n",
    "\n",
    "Some non-obvious advantages of Bayesian inference, which are not easy to spot at the first glance, is their\n",
    "- ability to work with small data,\n",
    "- ability to perform model regularisation.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## How can we perform Bayesian inference?\n",
    "\n",
    "What does it take?\n",
    "\n",
    "- Data,\n",
    "- A generative model,\n",
    "- Our beliefs before seeing the data.\n",
    "\n",
    "What does it make?\n",
    "\n",
    "- The values of parameters that could give rise to the observed data **in the form of a distribution**.\n",
    "\n",
    "\n",
    "How can we perform it?\n",
    "\n",
    "- **Analytically**\n",
    "        \n",
    "     Solving the maths! This is an elegant approach. However, it is rarely available in real life.\n",
    "\n",
    "- **Numerically**\n",
    "\n",
    "    - Rather than deriving a posterior distribution in the closed form, we can use computational tools to **sample** from the posterior. The obtained samples describe the distributions of parameters.\n",
    "    \n",
    "    - We achieve this by exploring the space of parameters to find the most probable combinations of parameters.\n",
    "    \n",
    "    - Further we treat the obtained sampled as new data, to extract information about parameters, such as mean, credible interval or other statistics.\n",
    "\n",
    "### Numerical methods\n",
    "\n",
    "- Markov Chain Monte Carlo (MCMC) family of algorithm, e.g.,\n",
    "  * Metropolis-Hastings\n",
    "  * Gibbs\n",
    "  * Hamiltonian Monte Carlo (HMC)\n",
    "  * No-U-Turn sampler (NUTS)\n",
    "  * further variants such as SGHMC, LDHMC, etc\n",
    "- Variational Inference\n",
    "- Approximate Bayesian Computation (ABC)\n",
    "- Particle filters\n",
    "- Laplace approximation\n",
    "\n",
    "More on this later! First, let's discuss some analytics and point estimates.    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`````{admonition} Task 15: Point estimates for Bernoulli-Beta coin flips\n",
    ":class: tip\n",
    "Consider the coin flipping problem with probability of heads ('success') being modelled as $\\theta$. The experiment was repeated $n$ times, and we observed $h$ 'sucesses'. Assume that the coin flips follow the Bernoulli distribution $\\mathcal{Bern}(\\theta)$.\n",
    "\n",
    "- Derive the <font color='orange'>maximum likelihood estimate (MLE)</font> for parameter $\\theta$\n",
    "\n",
    "$$\n",
    "\\hat{\\theta}_\\text{MLE} = {\\arg \\max}_\\theta p(y | \\theta)\n",
    "$$\n",
    "\n",
    "- How does the amount of data affect the MLE estimate?\n",
    "\n",
    "- Derive the <font color='orange'>maximum aposteriori estimate (MAP)</font> for parameter $\\theta$, using a $\\mathcal{Beta}(a,b)$ prior\n",
    "\n",
    "$$\n",
    "\\hat{\\theta}_\\text{MAP} = {\\arg \\max}_\\theta p(y | \\theta)p(\\theta)\n",
    "$$\n",
    "\n",
    "- How does the amount of data and parameters of the Beta distribution affect the MAP estimate?\n",
    "\n",
    "- Visualise MAP as a function of $n, h, a, b$.\n",
    "\n",
    "- What is the difference between the two approaches?  Does a point estimate tell us anything about our uncertainty or the distribution from which we draw the estimate?\n",
    "\n",
    "`````"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It turns out that under the independense assumption minimising the KL divergence is the same as maximising the likelihood estimate!\n",
    "\n",
    "$$\n",
    "\\arg \\min_\\theta \\mathrm{KLD} (p(y \\mid \\theta^*) \\mid\\mid p(y \\mid \\theta)) = \\arg \\max_\\theta p(y \\mid \\theta)\n",
    "$$\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.18"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}