Skip to content

Generate plink and/or phenotype data for interaction (epistasis) models

Notifications You must be signed in to change notification settings

mfranberg/epigen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A tool for generating genetic data under epistatic models

Epigen is a Python module that is used to generate genotype and/or phenotype data from an epistatic model for the phenotype. The output format is in plink format to allow for speed and easy integration with other tools.

Installing epigen

Epigen is installed by:

pip install epigen

or

pip install --user epigen

if you want to install the module locally.

Generating data

Epigen generally describes different models as follows. A mean phenotype level is defined for each genotype. If we have the following mean level model:

      AA      Aa      aa
BB  mu_AABB mu_AaBB mu_aaBB 
Bb  mu_AABb mu_AaBb mu_aaBb
bb  mu_AAbb mu_Aabb mu_aabb

then it would be passed to the program in a single line by collapsing the matrix row-wise:

--mu mu_AABB mu_AaBB mu_aaBB mu_AABb mu_AaBb mu_aaBb mu_AAbb mu_Aabb mu_aabb

For example, if we have the following double dominant model for a binary trait:

    AA  Aa  aa
BB  0.5 0.5 0.5 
Bb  0.5 0.7 0.7
bb  0.5 0.7 0.7

we pass it to the program by:

--mu 0.5 0.5 0.5 0.5 0.7 0.7 0.5 0.7 0.7

It is not convenient if you have a regression model and want to convert it to this form. Because you need to evaluate the mean level for each cell, instead of just supplying the betas. Therefore, if you want to parameterize models by beta use the *-glm options instead of *-general options.

Generating case/control genotypes and phenotypes

Case control data can be generated in two ways. Either we do not up-sample the cases and use the population distribution of the effects, or we up-sample.

In the first case, a phenotype file can be generated for a given genotype file (plink file) by:

epigen pheno-general --model binomial --mu 0.5 0.5 0.5 0.5 0.7 0.7 0.5 0.7 0.7 --out plink.pheno plink

This will select a random pair of variants, and generate the phenotype from these according to the model. The ratio of cases to controls is defined directly from the model.

In the second case, both genotypes and phenotypes are generated. The number of cases and controls are prespecified and the genotypes are generated conditional on these numbers. In this case, several pairs of variants can have exactly the same model to the phenotype, which makes this convenient for evaluating power or false positive rates of the test. Data is generated by the following command

epigen pair-general --model binomial --mu 0.5 0.5 0.5 0.5 0.7 0.7 0.5 0.7 0.7 --maf 0.3 0.3\
                   --sample-size 2000 3000 --npairs 100 --out plink

We get 2000 cases and 3000 controls, the two variants in each pair both have 0.3 in minor allele frequency, the resulting plink file will have 100 pairs (i.e. 200 variants), and plink.pairs will contain the names of these pairs.

In a third scenario it can sometimes be convenient to have some pairs use one model and some use another, to model a more realistic situation. In this case we first define a file models.txt that contains the number of pairs, if it is an interaction, and the model on each line.

> cat models.txt
10 1 0.5 0.5 0.5 0.5 0.7 0.7 0.5 0.7 0.7
20 0 0.5 0.5 0.5 0.6 0.6 0.6 0.7 0.7 0.7
1000 0 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5

The above file contains 10 double dominant models, 20 pairs in which only one variant is associated, and 1000 pairs in which have no association. The data is then generated by

epigen pair-mixed --model-file models.txt --maf 0.3 0.3 --sample-size 2000 3000 --out plink

this command only supports binary phenotypes.

About

Generate plink and/or phenotype data for interaction (epistasis) models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages