SetariaGWASPipeline

Run MLMM GWAS on large Setaria genotype file

Initial filter and parsing of genotype file with vcftools

Do a simple inital filter by phenotyped lines, missingness, maf, etc
- Do additional filtering for lines and MAF on the fly with R tools after splitting into batches

Additional filtering to keep only high-quality SNPs
LD prune high-quality SNP set for a set of non-redundant SNPs

Scripts

All paths are relative to the ./src/ folder.

1.getSetariaHapmap.bash

Quick script that:
1. Downloads and extracts 12.Setaria_598g_8.58M_withRef_imp_phased_maf0.01_FINAL.vcf.bz2 from dropbox into ../data/genotype

2.setariaIRdataAnalysis.R

This step is specific to the IR dataset and it generates a file to narrow down the genotype file to just phenotyped lines

Contains code for QC analysis of traits
Does a mixed model analysis to determine influence of different variables on traits.
Calculates BLUEs and BLUPs of the traits based upon models selected by analysis in step 2.

BLUEs are calculated from this equation: Trait ~ Genotype:Treatment + (1|Awning)
BLUPs are calculated from this equation: Trait ~ Treatment + (1|Awning) + (1|Genotype)

New data file is written to:

../data/2.Setaria_IR_2016_datsetset_GWAS.BLUPsandBLUEs.csv

Trait heritabilities are written to:

../results/2.setariaIR.H2.csv

Set of lines present in the phenotype file for genotype screening is in:

../data/genotype/keepLines.txt

Diagnostic plots are written to:

../results/2.setariaIR.BLUEsAndBlupsVsOriginal.pdf

3.runVCFhapmapfilter.condor

Uses vcftools to parse each vcf chromosome file:
1. Remove minor alleles at 0.1 with --maf 0.1 --max-maf 0.9
2. Make sure there are only (and at least) 2 alleles --min-alleles 2 --max-alleles 2
3. Keeps only lines present in ../data/genotype/keepLines.txt
4. Writes SNP file out in 012 formatted matrix with prefix 3.from12.setaria.maf0.1.maxMissing0.1, rownames (individuals) are in the .012.indv and positions are in .012.pos

4.convertGenotypeFile.R

Convert 012 VCF files to a R matrix in format for MLMM
1. First it filters out SNPs with a large number of heterzygous calls (>25%)
- This filters 147,824 SNPs and leaves 4,420,660 SNPs remaining on chromosomes 1 to 9
1. It also calculates MAF and missing numbers, this file was already imputed by Sujan and the vcftools filtered for MAF, so no additional filtering is done
2. The filtered genotype file and SNP info file is written to ../data/genotype/4.FilteredGenotypeFile.MatrixFormat.noscaffold.hetFilter0.25.maf0.1.rda

5.filterGenoforLowQualitySNPs.R

Filters out bad quality SNPs based on LD to neighboring SNPs
After this step there are 1,997,780 SNPs remaining
The assumption of this step is that neighboring SNPs (those within 2000 base pairs) should be correlated (r² > 0.5).
1. The first step of the filtering is to find a stretch of good SNPs defined as 2 or more SNPs in a row correlated at >0.9 r²
2. If a SNP is within 2000 bp of a good set of SNPs and doesn't have a correlation of r²>0.5 with those SNPs it is considered a badly called SNP and removed
3. If a SNP is not within 2000 bp of another SNP it is still checked for LD to next closest SNP, if <0.5 it is moved to a list of possible missed SNPs
4. Diagnostic plots of before and after correlations to neighboring SNPs are writen to the ../results folder.
5. Output genotype file is written to ../data/genotype/5.filteredSNPs.2kbDistThresh.0.5neighborLD.rda

6.performLDfilteringOfSNPs.R

Filters out highly correlated SNPs to further narrow down genotype file and prevent redundant testing
After this step there are 1,253,863 SNPs
Final file has an average distance between SNPs of 314 bp and and average correlation between neighboring SNPs of 0.77.

Tests pairs of SNPs, if correlation between two neighboring SNPs is r²>0.975 then only 1 SNP is kept

Process is iterated so no two nieghboring SNPs have a r² correlation >0.975.

Diagnostic plots are written to ../results/6.FilteredHighSNP.ChromsomeWideNeighboringSNP.LD.subset.pdf
Output is written to ../data/genotype/6.filteredSNPs.noHighCorSNPs.2kbDistThresh.0.5neighborLD.0.975LDfilter.rda

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
GWASresults		GWASresults
data		data
results		results
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SetariaGWASPipeline

Scripts

1.getSetariaHapmap.bash

2.setariaIRdataAnalysis.R

3.runVCFhapmapfilter.condor

4.convertGenotypeFile.R

5.filterGenoforLowQualitySNPs.R

6.performLDfilteringOfSNPs.R

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SetariaGWASPipeline

Scripts

1.getSetariaHapmap.bash

2.setariaIRdataAnalysis.R

3.runVCFhapmapfilter.condor

4.convertGenotypeFile.R

5.filterGenoforLowQualitySNPs.R

6.performLDfilteringOfSNPs.R

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages