Simulate biallelic data from a simple statistical model. Inputs include the complexity of infection (COI), population-level allele frequencies (PLAF) and some parameters dicating skew and error distributions. Outputs include the phased haplotypes and the un-phased read count and coverage data.

sim_biallelic(COI = 3, PLAF = runif(10, 0, 0.5), coverage = 100,
  prob_missing = 0.1, alpha = 1, overdispersion = 0, epsilon = 0)

Arguments

COI

complexity of infection.

PLAF

vector of population-level allele frequencies at each locus.

coverage

coverage at each locus. If a single value then the same coverage is applied over all loci.

prob_missing

the probabilty of each locus failing. If a locus fails then coverage is set to NA. Can be a vector of values over all loci, or a single value in which case the same value applies over all loci.

alpha

shape parameter of the symmetric Dirichlet prior on strain proportions.

overdispersion

the extent to which counts are over-dispersed relative to the binomial distribution. Counts are Beta-binomially distributed, with the beta distribution having shape parameters p/overdispersion and (1-p)/overdispersion.

epsilon

the probability of a single read being mis-called as the other allele. Applies in both directions.

Details

Simulated data are drawn from a simple statistical model:

  1. Strain proportions are drawn from a symmetric Dirichlet distribution with shape parameter alpha.

  2. Phased haplotypes are drawn at every locus, one for each COI. The allele at each locus is drawn from a Bernoulli distribution with probability given by the PLAF.

  3. The "true" within-sample allele frequency at every locus is obtained by multiplying haplotypes by their strain proportions, and summing over haplotypes. Errors are introduced through the equation $$wsaf_error = wsaf*(1-e) + (1-wsaf)*e$$where \(wsaf\) is the WSAF without error and \(e\) is the error parameter epsilon.

  4. Final read counts are drawn from a beta-binomial distribution with expectation \(w_error\). The raw number of draws is given by the coverage, and the skew of the distribution is given by the overdispersion parameter. If overdispersion = 0 then the distribution is binomial, rather than beta-binomial.