This pipeline performs comprehensive Genome-Wide Association Study (GWAS) analysis, including quality control, population stratification analysis, and association testing for case-control studies.
The pipeline requires PLINK binary format files (.bed, .bim, .fam). If you have VCF files, convert them first:
plink --vcf yourfile.vcf.gz --make-bed --out yourdataFor ~900K SNPs and ~2.6K individuals: 1-4 hours on a cluster with 16 CPUs and 64GB RAM. Smaller datasets run faster.
Yes, but for large datasets (>500K SNPs, >1K individuals), a cluster is recommended. Reduce the dataset or adjust memory settings for local runs.
In the .fam file:
- 1 = Control/Unaffected
- 2 = Case/Affected
- 0 or -9 = Missing phenotype
- 1 = Male
- 2 = Female
- 0 = Unknown
Yes, modify the pipeline to skip steps 3-9, or run only the association analysis (Steps 13-14).
The pipeline works with any build (hg19/GRCh37, hg38/GRCh38), but ensure consistency across all files.
Cause: Too stringent QC thresholds removed all SNPs. Solution: Lower the MAF threshold or increase GENO threshold in the script.
Cause: All individuals failed missingness or sex check. Solution: Check your .fam file coding and lower MIND threshold.
Normal: This occurs for Y chromosome SNPs in females or mitochondrial SNPs. Usually safe to ignore unless excessive.
Fixed in v1.0: The pipeline now uses Python-based sorting with fallbacks. Update to latest version.
To compare how QC affects associations. Use QC'd results for final conclusions.
Different levels of population stratification correction. 3 PCs is standard; use 10 PCs if strong population structure exists.
Varies by trait and sample size. Many studies find 0-10 genome-wide significant hits (p<5e-8).
Genomic inflation factor. Values 1.0-1.05 are acceptable. >1.1 suggests population stratification or cryptic relatedness.
Proportion of identity-by-descent. Values >0.185 suggest relatedness (e.g., 0.5 = siblings, 0.25 = half-siblings).
Edit these variables in gwas_analysis_pipeline.sh (lines 38-48):
GENO_THRESHOLD=0.02
MIND_THRESHOLD=0.02
MAF_THRESHOLD=0.01
HWE_THRESHOLD=1e-6Partially. The pipeline is optimized for case-control. For quantitative traits, modify Step 13-14 to use --linear instead of --logistic.
Create a covariate file and add --covar yourfile.cov to the association commands in Steps 13-14.
All results are in analysis_results/ directory with subdirectories:
qc/- Quality control resultsassociation/- Association test resultsplots/- Visualization plotsreports/- Summary reports
Lists of SNPs to keep (.prune.in) and remove (.prune.out) after LD pruning for PCA.
Yes, the pipeline exports VCF, PED/MAP, and TPED/TFAM formats (Step 16).
Cause: Missing Python packages. Solution: Install required packages:
pip install --user pandas numpy matplotlib seabornYes, edit generate_plots.py to change colors, sizes, DPI, etc.
Modify the manhattan_plot() function in generate_plots.py to filter by chromosome.
- Use more CPUs (increase
--cpus-per-taskin SLURM header) - Increase memory (increase
--mem) - Run on a subset first to test
- Skip export steps if not needed (Step 16)
- Reduce the dataset size
- Lower memory in SLURM header (but may cause failures)
- Process chromosomes separately
Not directly within the pipeline, but you can run separate pipelines per chromosome and combine results.
The current pipeline doesn't include imputation. You can pre-impute your data with Michigan Imputation Server or IMPUTE2, then run the pipeline.
After running the pipeline on multiple cohorts, use METAL or GWAMA to meta-analyze the association results.
Not directly. Use the association results with PRSice-2 or LDpred2 for PRS calculation.
Use tools like MAGMA or VEGAS2 with the association summary statistics from this pipeline.
- Check the log file in
logs/directory - Look at the error log (
.errfile) - Ensure PLINK is loaded/available
- Check disk space and memory
- Check λ (lambda) - should be ~1.0
- Review Q-Q plot for deviation
- Verify phenotype coding in .fam file
- Check for population stratification (PCA plot)
- Review related individuals (IBD results)
- GitHub Issues: https://github.com/yourusername/AA-GWAS/issues
- Email: muhammad.muzammal@bs.qau.edu.pk
- Documentation: See README.md and QUICKSTART.md
Have a question not answered here? Please:
- Open a GitHub Discussion
- Or submit a PR to add it to this FAQ
Last Updated: December 11, 2025