running-in-parallel.Rmd
This vignette demonstrates how to run MALECOT in parallel, making use of multiple cores if they are available. It covers:
This tutorial assumes some prior knowledge about MALECOT, so if you are completely new to the program we recommend working through the simpler bi-allelic tutorial first.
A typical MALECOT analysis involves running the same MCMC repeatedly over a range of values of \(K\). This falls into the class of embarrassingly parallel problems, as it requires no communication between the different MCMCs until they are all finished. It therefore makes sense to distribute MCMCs over different available cores where possible. Fortunately all the work of sending jobs to different cores has been done for us in the “parallel” package.
Ensure the “parallel” package is loaded:
# load the parallel package
library(parallel)
We will need some data to work with, as well as a project and simple parameter set:
# simulate data
mysim <- sim_data(data_format = "biallelic", n = 100, L = 24, K = 3)
# create a new project and bind data
myproj <- malecot_project()
myproj <- bind_data_biallelic(myproj, df = mysim$data, ID_col = 1, pop_col = 2)
# create parameter set
myproj <- new_set(myproj, name = "simple model")
Before running anything in parallel we need to know how many cores our machine has. You may know this number already, but if you don’t then the parallel package has a handy function for detecting the number of cores for you:
cores <- detectCores()
Next we make a cluster object, which creates multiple copies of R running in parallel over different cores. Here we are using all available cores, but if you want to hold some back for other intensive tasks then simply use a smaller number of cores when specifying this cluster.
cl <- makeCluster(cores)
We then run the usual run_mcmc()
function, this time passing in the cluster as an argument. This causes MALECOT to use a clusterApplyLB()
call rather than an ordinary lapply()
call over different values of \(K\). Each value of \(K\) is added to a queue over the specified number of cores - when the first job completes, the next job is placed on the node that has become free and this continues until all jobs are complete.
Note that output is supressed when running in parallel to avoid sending print commands to multiple cores, so you will not see the usual progress bars.
# run MCMC with cluster object
myproj <- run_mcmc(myproj, K = 1:5, burnin = 1e4, converge_test = 1e2, samples = 1e4, cluster = cl)
This should produce identical output to the equivalent MCMC run in serial, but will be much faster.
Finally, it is good practice to shut down the workers once we are finished:
stopCluster(cl)