Given a table of
$n$ variants$v_i$ with associated minor allele frequencies (MAF)$P(v_i) = p_i$ , is it possible to model the distribution of the total number of variants any certain individual is expected to carry?
The "ground-truth" distribution can be simulated by generating individuals, iterating through the entire list of variants, sampling a uniform distribution for each variant to determine whether the variant is present in the individual (individual Bernoulli tests) and creating a histogram of the total sum of variants with successful tests.
Considering each variant
As the Lyapunov condition is met, the Central Limit Theorem
(CTL) is applicable. The total number of variants any individual carries is
normally distributed with mean
The program runs two simulations: (1) the "ground-truth" with repeated Bernoulli
tests for each individual variant and a KS-test for normality against the
normal distribution with mean and variance from the "ground-truth" simulated
distribution, and (2) a normal distribution with mean and variance from the
proposed solution above and a KS-test for similarity against the "ground-truth"
distribution. The first KS-test checks whether the Central Limit Theorem is
applicable. The second KS-test checks whether the proposed solution is similar
to the "ground-truth". With a large enough sample size
To retrive ten thousand minor allele frequencies from Ensembl BioMart, use
python get_mafs.py 10000 --outfile mafs.tsv
To perform simulations with ten thousand patients, use
python simulate.py 10000 --mafs mafs.tsv