Model comparison#

It often occurrs in practice that we have several model candidates at hand and need to choose the best model for the given data.

It is a tricky task, since increasing model complexity typically leads to improved data fitting by introducing more parameters, creating the the risk of overfitting.

Hence, the models we are looking for, should not just describe well the observed data, but, ideally, the entire “true” data generating process. We need to find tools to quantify the degree of “closeness” to the true model. Note that in this context models refer to the distributional family as well as the parameter values.

We could use KLD to measure the degree of “closeness” between two models \(M_0\) and \(M_1\):

\[ \text{KLD}(M_0 \parallel M_1) = \int p_{M_0}(y) \log \left( \frac{p_{M_0}(y)}{p_{M_1}(y)} \right) dy = \int p_{M_0}(y) \log p_{M_0}(y) dy - \int p_{M_0}(y) \log p_{M_1}(y) dy \]

Task 19

Assume that the ‘true’ model is \(M_0\) and the two candidate models are \(M_1\) and \(M_2\)

  • \(M_0: y \sim \mathcal{N}(3,2)\)

  • \(M_1: y \sim \mathcal{N}(3.5,2.5)\)

  • \(M_2: y \sim \text{Cauchy}(3,2)\)

For these models,

  • Plot the three densities

  • Argue about which model, you think, is better: \(M_1\) or \(M_2\) (assuming that the ground truth is given by \(M_0\))?

Note that the first term in \(\text{KLD}(M_0 \parallel M_1)\) is always the same. Hence, we only need to compare models on the second term \(\int p_{M_0}(y) \log p_{M_1}(y) dy\), which is the expected log predictive density (elpd):

\[ \int p_{M_0}(y) \log p_{M_1}(y) dy = \mathbb{E}_{p_{M_0}}[ \log p_{M_1}]. \]

A model with larger elpd is preferred over a model with a smaller elpd.

The problem we have here is that in reality we never know the true model \(M_0.\) Several numerical metrics are commonly used for this purpose in the literature such as information criteria and cross validation.

Information criteria#

Akaike Information Criterion (AIC)

\[ \text{AIC} = - 2 \log p(y|\hat{\theta}_\text{MLE}) + 2k, \]

where the first term is the log-likelihood, \(k\) is the number of parameters and \(\hat{\theta}_\text{MLE}\) is the MLE estimate.

A lower AIC value indicates a better trade-off between model fit and complexity, implying a better model.

AIC works best when the probability distribution under \(M_1\) is normal, and the sample size is much larger than the number of parameters. No posterior distribution is used, as \(D\) is computed only based on the MLE. It does not take into account any prior information.

Bayesian Information Criterion (BIC)

\[ \text{BIC} = - 2 \log p(y|\hat{\theta}_\text{MLE}) + k \ln(n), \]

where \(n\) is the number of datapoints.

BIC is derived using the Laplace approximation. It is only valid for sample size \(n\) much larger than the number \(k\) of parameters in the model. The BIC is independent of the prior and generally penalizes free parameters more strongly than the Akaike information criterion.

Watanabe-Akaike Information Criteria (WAIC)

\[\begin{split} \begin{align*} \text{WAIC} = &- 2 \sum_{i=1}^{n} \log \mathbb{E}[p(y_i | \theta, y)] + 2p_\text{WAIC} \\ &-2 \left( \sum_{i=1}^{n} \log \left( \frac{1}{S} \sum_{s=1}^{S} p(y_i|\theta_s) \right) - \sum_{i=1}^{n} \text{Var}_s \left( \log p(y_i|\theta_s) \right) \right) \end{align*} \end{split}\]

where \( \mathbb{E}[p(y_i | \theta, y)]\) is the posterior mean of the likelihood of the \(i\)-th observation, \(n\) is the number of data points, \(S\) is the number of posterior samples.

The WAIC incorporates prior information, and the use of pointwise likelihood makes it more robust when the posterior distributions deviate from normality.

Leave-One-Out Cross Validation#

Cross validation splits the current sample into \(k\) parts. Then a model is being fit on \(k−1\) parts and the predictions are made for the remaining \(1\) part.

A special case is when \(k=n\) so that each time one uses \(n-1\) data points to estimate the model parameters, and estimate the elpd for the observation that was left out. This is called leave-one-out cross-validation (LOO-CV). See Vehrari, Gelman, Gabry (2016) for the details of how LOO elpd can be estimated from samples.

We can use tools from arviz library to help us perform model comparison.

Task 20

Download the kidiq dataset (Gelman & Hill, 2007), which is a data from a survey of adult American women and their respective children. Dated from 2007, it has 434 observations and 4 variables:

  • kid_score: child’s IQ

  • mom_hs: binary/dummy (0 or 1) if the child’s mother has a high school diploma

  • mom_iq: mother’s IQ

  • mom_age: mother’s age

with

import pandas as pd

!wget -O kidiq.csv https://github.com/TuringLang/TuringGLM.jl/raw/main/data/kidiq.csv

df = pd.read_csv('kidiq.csv')

Construct a model predicting kid_score:

\[ \text{kidscore}_i \sim \mathcal{N}(\mu_i, \sigma^2), \]
  • Build separate models for the following variations of \(\mu_i\):

    • \(\mu_i = \beta_0 + \beta_1 \text{mom_iq}_i\)

    • \(\mu_i = \beta_0 + \beta_1 \text{mom_iq}_i + \beta_2 \text{mom_hs}_i\)

    • \(\mu_i = \beta_0 + \beta_1 \text{mom_iq}_i + \beta_2 \text{mom_hs}_i + \beta_3 \text{mom_iq}_i \times \text{mom_hs}_i\)

    • \(\mu_i = \beta_0 + \beta_1 \text{mom_iq}_i + \beta_2 \text{mom_hs}_i + \beta_3 \text{mom_iq}_i \times \text{mom_hs}_i + \beta_4 \text{mom_age}_i\)

  • Show Bayesian workflow for each of the models.

  • Compare the models with each other. Which model would you chose as final and why?