Epigen is a Python module that is used to generate genotype and/or phenotype data from an epistatic model for the phenotype. The output format is in plink format to allow for speed and easy integration with other tools.
Epigen is installed by:
pip install epigen
or
pip install --user epigen
if you want to install the module locally.
Epigen generally describes different models as follows. A mean phenotype level is defined for each genotype. If we have the following mean level model:
AA Aa aa
BB mu_AABB mu_AaBB mu_aaBB
Bb mu_AABb mu_AaBb mu_aaBb
bb mu_AAbb mu_Aabb mu_aabb
then it would be passed to the program in a single line by collapsing the matrix row-wise:
--mu mu_AABB mu_AaBB mu_aaBB mu_AABb mu_AaBb mu_aaBb mu_AAbb mu_Aabb mu_aabb
For example, if we have the following double dominant model for a binary trait:
AA Aa aa
BB 0.5 0.5 0.5
Bb 0.5 0.7 0.7
bb 0.5 0.7 0.7
we pass it to the program by:
--mu 0.5 0.5 0.5 0.5 0.7 0.7 0.5 0.7 0.7
It is not convenient if you have a regression model and want to convert it to this form. Because you need to evaluate the mean level for each cell, instead of just supplying the betas. Therefore, if you want to parameterize models by beta use the *-glm options instead of *-general options.
Case control data can be generated in two ways. Either we do not up-sample the cases and use the population distribution of the effects, or we up-sample.
In the first case, a phenotype file can be generated for a given genotype file (plink file) by:
epigen pheno-general --model binomial --mu 0.5 0.5 0.5 0.5 0.7 0.7 0.5 0.7 0.7 --out plink.pheno plink
This will select a random pair of variants, and generate the phenotype from these according to the model. The ratio of cases to controls is defined directly from the model.
In the second case, both genotypes and phenotypes are generated. The number of cases and controls are prespecified and the genotypes are generated conditional on these numbers. In this case, several pairs of variants can have exactly the same model to the phenotype, which makes this convenient for evaluating power or false positive rates of the test. Data is generated by the following command
epigen pair-general --model binomial --mu 0.5 0.5 0.5 0.5 0.7 0.7 0.5 0.7 0.7 --maf 0.3 0.3\
--sample-size 2000 3000 --npairs 100 --out plink
We get 2000 cases and 3000 controls, the two variants in each pair both have 0.3 in minor allele frequency, the resulting plink file will have 100 pairs (i.e. 200 variants), and plink.pairs will contain the names of these pairs.
In a third scenario it can sometimes be convenient to have some pairs use one model and some use another, to model a more realistic situation. In this case we first define a file models.txt that contains the number of pairs, if it is an interaction, and the model on each line.
> cat models.txt
10 1 0.5 0.5 0.5 0.5 0.7 0.7 0.5 0.7 0.7
20 0 0.5 0.5 0.5 0.6 0.6 0.6 0.7 0.7 0.7
1000 0 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
The above file contains 10 double dominant models, 20 pairs in which only one variant is associated, and 1000 pairs in which have no association. The data is then generated by
epigen pair-mixed --model-file models.txt --maf 0.3 0.3 --sample-size 2000 3000 --out plink
this command only supports binary phenotypes.