comparing-models.Rmd
This vignette demonstrates model comparison in MALECOT. It covers:
This tutorial assumes some prior knowledge about MALECOT, so if you are completely new to the program we recommend working through the simpler bi-allelic tutorial first.
When carrying out Bayesian analysis it is useful to make a distinction between two types of analysis:
As an example, we might create a model \(\mathcal{M}\) in which we write down the probability of the observed data \(x\) as a function of the unknown allele frequencies \(p\). In other words, we write down the likelihood \(\mbox{Pr}(x \: | \: p, \mathcal{M})\).
In Bayesian parameter estimation we are trying to get at the posterior probability \(\mbox{Pr}(p \: | \: x, \mathcal{M})\). We typically do this using MCMC, which produces a series of draws from the posterior distribution.
But what if we want to know the posterior probability of the model, rather than the parameters of the model? In other words, we want to know \(\mbox{Pr}(\mathcal{M} \: | \: x)\). Calculating this quantity requires that we integrate over all unknown parameters:
\[\mbox{Pr}(\mathcal{M} \: | \: x) = \int \mbox{Pr}(\mathcal{M}, p \: | \: x) dp\]
This is often an extremely high-dimensional integral - for example in MALECOT there could easily be hundreds of unknown parameters - making it computationally infeasible by most methods. Regular MCMC cannot help us here because it only produces draws from the posterior distribution rather than normalised values.
One way around this is to use an advanced MCMC technique known as thermodynamic integration (TI). In TI we run multiple MCMC chains, each at a different “rung” on a temperature ladder. The hotter the chain, the flatter the target distribution. The log-likelihoods over all chains are then combined in a single calculation that - by what can only be described as mathematical magic! - is asymptotically equal to \(\log[\mbox{Pr}(x \: | \: \mathcal{M})]\). We can then apply a prior over models, for example giving each model equal weight, to arrive at the desired posterior value \(\mbox{Pr}(\mathcal{M} \: | \: x)\). Generalised thermodynamic integration (GTI) differs from regular TI in that it uses a slightly different calculation when combining information across rungs that leads to lower bias and higher precision.
Hopefully this is enough background to run thermodynamic MCMC in MALECOT, and to understand the results. For those eager to understand all the mathematical details, see this vignette.
For the sake of this tutorial we will use simulated bi-allelic data drawn from \(K = 3\) subpopulations. We create a new project and bind this data:
# simulate data
mysim <- sim_data(data_format = "biallelic", n = 100, L = 24, K = 3)
# create project and bind data
myproj <- malecot_project()
myproj <- bind_data_biallelic(myproj, df = mysim$data, ID_col = 1, pop_col = 2)
For our first parameter set we will assume a very basic model with default parameters; which means a Poisson prior on COI, a flat prior on allele frequencies and no error estimation:
# create parameter set
myproj <- new_set(myproj, name = "simple model")
We will then run the MCMC for values of \(K\) from 1 to 5. What makes this thermodynamic MCMC rather than regular MCMC is the rungs
argument, which dictates the number of rungs on the temperature ladder. We will opt for 10 rungs for now. Be warned - this MCMC will take considerably longer to run than regular MCMC, so be prepared to go make yourself a cup of tea!
# run thermodynamic MCMC
myproj <- run_mcmc(myproj, K = 1:5, burnin = 1e4, converge_test = 1e2,
samples = 1e4, rungs = 10, pb_markdown = TRUE)
## Running MCMC for K = 1
## Burn-in phase
##
|
|=================================================================| 100%
## converged within 200 iterations
## Sampling phase
##
|
|=================================================================| 100%
## completed in 10.2658 seconds
##
## Running MCMC for K = 2
## Burn-in phase
##
|
|=================================================================| 100%
## converged within 200 iterations
## Sampling phase
##
|
|=================================================================| 100%
## completed in 18.2566 seconds
##
## Running MCMC for K = 3
## Burn-in phase
##
|
|=================================================================| 100%
## converged within 200 iterations
## Sampling phase
##
|
|=================================================================| 100%
## completed in 22.5467 seconds
##
## Running MCMC for K = 4
## Burn-in phase
##
|
|=================================================================| 100%
## converged within 200 iterations
## Sampling phase
##
|
|=================================================================| 100%
## completed in 28.0845 seconds
##
## Running MCMC for K = 5
## Burn-in phase
##
|
|=================================================================| 100%
## converged within 200 iterations
## Sampling phase
##
|
|=================================================================| 100%
## completed in 32.6704 seconds
##
## Processing results
## Total run-time: 1.9 minutes
Notice that convergence times are much longer than under regular MCMC. This is because the MCMC is only deemed to have converged when every chain has converged, meaning we are only as strong as the weakest link. When running thermodynamic MCMC it is therefore important to keep an eye on convergence, and to increase the number of burn-in iterations if needed.
There are more moving parts in thermodynamic MCMC, meaning we have more things to check. First, we should perform the same diagnostic checks as for regular MCMC:
plot_loglike_dignostic(myproj, K = 3)
The plot_loglike_diagnostic()
function uses the cold rung by default (in this case the 10th rung), or we can use the rung
argument to produce plots for any given rung, for example:
plot_loglike_dignostic(myproj, K = 3, rung = 1)
Running get_ESS()
now prints out the effective sample size of every rung:
get_ESS(myproj, K = 3)
## rung1 rung2 rung3 rung4 rung5 rung6 rung7
## 782.7099 729.6338 1521.1626 1693.9167 1719.9875 895.9055 1416.8943
## rung8 rung9 rung10
## 1386.5878 1495.3028 1377.8000
We interpret ESS for hot chains the same way as for the cold chain - as the number of independent samples that we have obtained from the target distribution once autocorrelation has been accounted for. If we see small values of the ESS (less than one thousand as a rule of thumb) then we should be concerned that that particular rung has has not explored the space well, and we should repeat the analysis with a larger number of samples. For the sake of this tutorial we will load in the result of re-running this MCMC with samples = 1e5
, which took around 20 minutes.
A useful plotting function when working with multiple rungs is the plot_loglike()
function, which plots the 95% quantiles of the log-likelihood over every rung:
plot_loglike(myproj, K = 3)
This plot should be always-increasing from left to right, aside from small variations due to the random nature of MCMC. If any rungs stand out from the overall pattern then it is worth running plot_loglike_diagnostic()
on these particular rungs, as they may have mixed badly. As with all other checks, this should be performed on every explored value of \(K\).
The GTI method takes the log-likelihood values above and multiplies them by weights, leading to a GTI “path”. The area between this path and the zero line is our estimate of the log-evidence. We can visualise this path using the plot_GTI_path()
function:
plot_GTI_path(myproj, K = 3)
As our evidence estimate is derived from the area between the line and zero, it is important that this path is not too jagged. We are essentially approximating a smooth curve with a series of straight line segments, so if the straight lines cut the corners off the smooth curve then the area will be too large or too small, leading to a biased estimate. There are two ways that we can mitigate this:
GTI_pow
argumentIncreasing the number of rungs will obviously lead to a smoother path, and it will also help with MCMC mixing, but it comes at the cost of slowing down the MCMC. The GTI_pow
argument changes the curvature of the path by modifying the log-likelihood weights, with large values leading to a more concave path. Ideally we want to choose GTI_pow
such that the path is as straight as possible, as this will lead to smallest difference between the true curve and the discrete approximation. The plot above is slightly concave using the default value GTI_pow = 3
, but it is not pathalogically concave. If we were producing results for publication and had a computer sitting idle overnight then we could consider re-running the MCMC with more rungs and a lower GTI_pow
, but otherwise these results are fine.
Once we are happy with our log-likelihood estimates and our GTI path, we can use our log-evidence estimates to compare values of \(K\). The raw estimates can be plotted using the plot_logevidence_K()
function, which plots 95% credible intervals of the log-evidence for each value of \(K\):
plot_logevidence_K(myproj)
We can see immediately that the model favours \(K = 3\) in this case, which agrees with our simulation parameters. If we had seen large credible intervals at this stage then we could re-run the model with a larger number of samples
, but in this example the intervals are non-overlapping and the signal is clear so there is no need.
We can also use the plot_posterior_K()
function to plot the posterior probability of each value of \(K\), which is obtained by taking these raw estimates out of log space and applying an equal prior over \(K\):
plot_posterior_K(myproj)
This second plot is often easier to interpret than the first as it is in units of ordinary probability. In this case we can see that there is a >99% posterior probability that the data were drawn from \(K = 3\) subpopulations (which we know to be true). Again, if we saw large credible intervals at this stage then it would be worth re-running the MCMC with a larger number of samples
, but here there is no need.
One of the major advantages of the model evidence is that we can use it to compare wider evolutionary models, i.e. different parameter sets. Mathematically this is as simple as summing the evidence for each value of \(K\), weighted by the prior.
We can test this by creating a second parameter set that differs from the first in that in that we now estimate the error terms:
# create new parameter set
myproj <- new_set(myproj, name = "error model", estimate_error = TRUE)
myproj
## DATA:
## data format = biallelic
## samples = 100
## loci = 24
## pops = 3
## missing data = 0 of 2400 gene copies (0%)
##
## PARAMETER SETS:
## SET1: simple model
## * SET2: error model
##
## ACTIVE SET: SET2
## lambda = 1
## COI model = poisson
## COI max = 20
## estimate COI mean = TRUE
## estimate error = TRUE
## e1_max = 0.2
## e2_max = 0.2
We then need to run thermodynamic MCMC for this parameter set over the same range of \(K\) values. For the sake of this tutorial we will save time by loading results from file:
## Running MCMC for K = 1
## Burn-in phase
##
|
|====== | 10%
|
|============= | 20%
|
|==================== | 30%
|
|========================== | 40%
|
|================================ | 50%
|
|======================================= | 60%
|
|============================================== | 70%
|
|==================================================== | 80%
|
|========================================================== | 90%
|
|=================================================================| 100%
## Warning: convergence still not reached within 10 iterations
## Sampling phase
##
|
|====== | 10%
|
|============= | 20%
|
|==================== | 30%
|
|========================== | 40%
|
|================================ | 50%
|
|======================================= | 60%
|
|============================================== | 70%
|
|==================================================== | 80%
|
|========================================================== | 90%
|
|=================================================================| 100%
## completed in 0.0460471 seconds
##
## Running MCMC for K = 2
## Burn-in phase
##
|
|====== | 10%
|
|============= | 20%
|
|==================== | 30%
|
|========================== | 40%
|
|================================ | 50%
|
|======================================= | 60%
|
|============================================== | 70%
|
|==================================================== | 80%
|
|========================================================== | 90%
|
|=================================================================| 100%
## Warning: convergence still not reached within 10 iterations
## Sampling phase
##
|
|====== | 10%
|
|============= | 20%
|
|==================== | 30%
|
|========================== | 40%
|
|================================ | 50%
|
|======================================= | 60%
|
|============================================== | 70%
|
|==================================================== | 80%
|
|========================================================== | 90%
|
|=================================================================| 100%
## completed in 0.065241 seconds
##
## Running MCMC for K = 3
## Burn-in phase
##
|
|====== | 10%
|
|============= | 20%
|
|==================== | 30%
|
|========================== | 40%
|
|================================ | 50%
|
|======================================= | 60%
|
|============================================== | 70%
|
|==================================================== | 80%
|
|========================================================== | 90%
|
|=================================================================| 100%
## Warning: convergence still not reached within 10 iterations
## Sampling phase
##
|
|====== | 10%
|
|============= | 20%
|
|==================== | 30%
|
|========================== | 40%
|
|================================ | 50%
|
|======================================= | 60%
|
|============================================== | 70%
|
|==================================================== | 80%
|
|========================================================== | 90%
|
|=================================================================| 100%
## completed in 0.0808711 seconds
##
## Running MCMC for K = 4
## Burn-in phase
##
|
|====== | 10%
|
|============= | 20%
|
|==================== | 30%
|
|========================== | 40%
|
|================================ | 50%
|
|======================================= | 60%
|
|============================================== | 70%
|
|==================================================== | 80%
|
|========================================================== | 90%
|
|=================================================================| 100%
## Warning: convergence still not reached within 10 iterations
## Sampling phase
##
|
|====== | 10%
|
|============= | 20%
|
|==================== | 30%
|
|========================== | 40%
|
|================================ | 50%
|
|======================================= | 60%
|
|============================================== | 70%
|
|==================================================== | 80%
|
|========================================================== | 90%
|
|=================================================================| 100%
## completed in 0.0964479 seconds
##
## Running MCMC for K = 5
## Burn-in phase
##
|
|====== | 10%
|
|============= | 20%
|
|==================== | 30%
|
|========================== | 40%
|
|================================ | 50%
|
|======================================= | 60%
|
|============================================== | 70%
|
|==================================================== | 80%
|
|========================================================== | 90%
|
|=================================================================| 100%
## Warning: convergence still not reached within 10 iterations
## Sampling phase
##
|
|====== | 10%
|
|============= | 20%
|
|==================== | 30%
|
|========================== | 40%
|
|================================ | 50%
|
|======================================= | 60%
|
|============================================== | 70%
|
|==================================================== | 80%
|
|========================================================== | 90%
|
|=================================================================| 100%
## completed in 0.113186 seconds
# uncomment to run MCMC
#myproj <- run_mcmc(myproj, K = 1:5, burnin = 1e4, converge_test = 1e2, samples = 1e5, rungs = 10)
# plot diagnostics
plot_loglike_dignostic(myproj, K = 3)
# report ESS
get_ESS(myproj, K = 3)
## rung1 rung2 rung3 rung4 rung5 rung6 rung7
## 87878.445 84345.409 44904.486 16722.098 8783.592 2156.834 7503.971
## rung8 rung9 rung10
## 8848.234 8602.125 8910.079
Assuming we are happy with the MCMC behaviour we can go ahead and plot the posterior error estimates:
plot_e(myproj, K = 3)
We know from the simulation parameters that there are no errors in the data, meaning this extra flexibility of estimating the error is actually taking the model further away from the truth. Looking at the posterior credible intervals we can see that the model has struggled to arrive at precise estimates of the error rates, instead exploring the full prior range \([0,0.2]\) quite evenly.
We continue with the same analyses as in the previous section:
# use gridExtra package to arrange plots
library(gridExtra)
# produce plots and arrange in grid
plot1 <- plot_loglike(myproj, K = 3)
plot2 <- plot_GTI_path(myproj, K = 3)
plot3 <- plot_logevidence_K(myproj)
plot4 <- plot_posterior_K(myproj)
grid.arrange(plot1, plot2, plot3, plot4)
As with the previous model, the evidence is in favour of \(K = 3\) in this example.
Finally, we come to the issue of comparing parameter sets. Just as with the evidence estimates over \(K\), we can plot the log-evidence of a parameter set using the plot_logevidence_model()
function, or the posterior distribution over models assuming an equal prior using the plot_posterior_model()
function:
# produce evidence plots over parameter sets
plot1 <- plot_logevidence_model(myproj)
plot2 <- plot_posterior_model(myproj)
grid.arrange(plot1, plot2, nrow = 1)
Here we can see that the former, simpler model has a higher evidence than the more complex error estimation model. To understand this, we can imagine the MCMC exploring the parameter space under each model. In the simple model the error is fixed at the correct value of zero, while in the more complex model the MCMC has a finite chance of exploring different error values, all of which are sub-optimal in terms of the likelihood. So the likelihood will, on average, be higher under the simple model. In other words, in the complex model we have introduced additional flexibility that has not been compensated for by a commensurate increase in the likelihood, and so the model evidence naturally punishes the model for being over-fitted. In this example we would conclude that a zero-error model is a better fit to the data, and so we would most likely only report detailed results from this model.
It is clear from this tutorial that thermodynamic MCMC over multiple temperature rungs can take considerably longer to run than regular MCMC. Hence, the next tutorial descibes how to run MALECOT in parallel over multiple cores.