Run MLMM GWAS on large Setaria genotype file
- Initial filter and parsing of genotype file with vcftools
- Do a simple inital filter by phenotyped lines, missingness, maf, etc
- Do additional filtering for lines and MAF on the fly with R tools after splitting into batches
- Additional filtering to keep only high-quality SNPs
- LD prune high-quality SNP set for a set of non-redundant SNPs
All paths are relative to the ./src/ folder.
- Quick script that:
- Downloads and extracts 12.Setaria_598g_8.58M_withRef_imp_phased_maf0.01_FINAL.vcf.bz2 from dropbox into ../data/genotype
- This step is specific to the IR dataset and it generates a file to narrow down the genotype file to just phenotyped lines
- Contains code for QC analysis of traits
- Does a mixed model analysis to determine influence of different variables on traits.
- Calculates BLUEs and BLUPs of the traits based upon models selected by analysis in step 2.
- BLUEs are calculated from this equation:
Trait ~ Genotype:Treatment + (1|Awning) - BLUPs are calculated from this equation:
Trait ~ Treatment + (1|Awning) + (1|Genotype)
- New data file is written to:
../data/2.Setaria_IR_2016_datsetset_GWAS.BLUPsandBLUEs.csv
- Trait heritabilities are written to:
../results/2.setariaIR.H2.csv
- Set of lines present in the phenotype file for genotype screening is in:
../data/genotype/keepLines.txt
- Diagnostic plots are written to:
../results/2.setariaIR.BLUEsAndBlupsVsOriginal.pdf
- Uses vcftools to parse each vcf chromosome file:
- Remove minor alleles at 0.1 with --maf 0.1 --max-maf 0.9
- Make sure there are only (and at least) 2 alleles --min-alleles 2 --max-alleles 2
- Keeps only lines present in
../data/genotype/keepLines.txt - Writes SNP file out in 012 formatted matrix with prefix
3.from12.setaria.maf0.1.maxMissing0.1, rownames (individuals) are in the .012.indv and positions are in .012.pos
- Convert 012 VCF files to a R matrix in format for MLMM
- First it filters out SNPs with a large number of heterzygous calls (>25%)
- This filters 147,824 SNPs and leaves 4,420,660 SNPs remaining on chromosomes 1 to 9
- It also calculates MAF and missing numbers, this file was already imputed by Sujan and the vcftools filtered for MAF, so no additional filtering is done
- The filtered genotype file and SNP info file is written to
../data/genotype/4.FilteredGenotypeFile.MatrixFormat.noscaffold.hetFilter0.25.maf0.1.rda
- Filters out bad quality SNPs based on LD to neighboring SNPs
- After this step there are 1,997,780 SNPs remaining
- The assumption of this step is that neighboring SNPs (those within 2000 base pairs) should be correlated (r2 > 0.5).
- The first step of the filtering is to find a stretch of good SNPs defined as 2 or more SNPs in a row correlated at >0.9 r2
- If a SNP is within 2000 bp of a good set of SNPs and doesn't have a correlation of r2>0.5 with those SNPs it is considered a badly called SNP and removed
- If a SNP is not within 2000 bp of another SNP it is still checked for LD to next closest SNP, if <0.5 it is moved to a list of possible missed SNPs
- Diagnostic plots of before and after correlations to neighboring SNPs are writen to the
../resultsfolder. - Output genotype file is written to
../data/genotype/5.filteredSNPs.2kbDistThresh.0.5neighborLD.rda
- Filters out highly correlated SNPs to further narrow down genotype file and prevent redundant testing
- After this step there are 1,253,863 SNPs
- Final file has an average distance between SNPs of 314 bp and and average correlation between neighboring SNPs of 0.77.
- Tests pairs of SNPs, if correlation between two neighboring SNPs is r2>0.975 then only 1 SNP is kept
- Process is iterated so no two nieghboring SNPs have a r2 correlation >0.975.
- Diagnostic plots are written to
../results/6.FilteredHighSNP.ChromsomeWideNeighboringSNP.LD.subset.pdf - Output is written to
../data/genotype/6.filteredSNPs.noHighCorSNPs.2kbDistThresh.0.5neighborLD.0.975LDfilter.rda