A complete, end-to-end Nextflow DSL2 pipeline that takes a patient VCF, performs quality control, normalizes to a common genome build, imputes missing genotypes, computes Polygenic Risk Scores using GWAS summary statistics, stores results in a database, and generates publication-quality visualizations.
Why Polygenic Risk Scores matter:
Most common diseases β coronary artery disease, type 2 diabetes, breast cancer, Alzheimer's β aren't caused by a single gene. They're driven by the combined effects of thousands of common genetic variants, each contributing a tiny amount of risk. A Polygenic Risk Score (PRS) aggregates all of these small effects into a single number that estimates how much genetic risk a person carries.
The clinical impact is real:
- A person in the top 8% of PRS for coronary artery disease has the same risk as someone with a monogenic mutation in LDLR (familial hypercholesterolemia) β roughly 3x the population average
- Unlike family history, PRS is quantitative and actionable from birth
- The UK Biobank, Genomics England, and the NIH All of Us program are all integrating PRS into their research platforms
Why a pipeline is needed:
- Genotype data is massive: millions of variants Γ thousands of patients
- Different datasets use different genome builds (hg19 vs hg38) β coordinates must be harmonized
- Missing genotypes must be imputed using reference panels
- The scoring itself requires matching, aligning, and weighting thousands of variants
- Results need to be stored in a queryable database and visualized for clinical interpretation
Without a reproducible pipeline, every step is a manual, error-prone process. This project automates the entire workflow.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PRS PIPELINE (Nextflow DSL2) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββ
βΌ βΌ βΌ
βββββββββββββ ββββββββββββββ βββββββββββββ
β Input β β GWAS β β Reference β
β VCF β β Summary β β Panel β
β (Patient) β β Stats β β (1000G) β
βββββββ¬ββββββ βββββββ¬βββββββ βββββββ¬ββββββ
β β β
βΌ β β
βββββββββββββ β β
β Step 1 β β β
β QC ββββΊ call rate, MAF, β β
β BCFtools β depth filtering β β
βββββββ¬ββββββ β β
β β β
βΌ β β
βββββββββββββ β β
β Step 2 β β β
β Normalize ββββΊ liftover hg19βhg38 β β
β CrossMap β + allele alignment β β
βββββββ¬ββββββ β β
β β β
βΌ β β
βββββββββββββ β β
β Step 3 βββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββ
β Impute ββββΊ fill missing calls
β Beagle β using LD patterns
βββββββ¬ββββββ
β β
βΌ β
βββββββββββββ β
β Step 4 βββββββββββββββββββββββββββ
β Score ββββΊ PRS = Ξ£(dosage Γ beta)
β Python β z-score normalization
βββββββ¬ββββββ
β
ββββββββββββββββββββββββ
βΌ βΌ
βββββββββββββ βββββββββββββ
β Step 5 β β Step 6 β
β Database β β Visualize β
β SQLite β β Python β
βββββββββββββ βββββββββββββ
β β
βΌ βΌ
βββββββββββββ βββββββββββββ
β 3 Tables β β 5 Plots β
β β’ samples β β β’ QC β
β β’ variantsβ β β’ Distrib β
β β’ qc_meta β β β’ Risk β
βββββββββββββ β β’ Manhtn β
β β’ Ranked β
βββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AWS Cloud β
β β
β ββββββββββββ ββββββββββββ βββββββββββββββββββββββββββββ β
β β S3 ββββββΊβ Lambda ββββββΊβ AWS Batch β β
β β Input β β Trigger β β β β
β β Bucket β β β β ββββββββββββββββββββββββ β β
β β β ββββββββββββ β β Nextflow Pipeline β β β
β β patient β β β β β β
β β .vcf β β β QC β Norm β Impute β β β
β ββββββββββββ β β β Score β DB β β β
β β ββββββββββββββββββββββββ β β
β β β β
β β EC2 On-Demand EC2 Spot β β
β β (scoring) (impute) β β
β ββββββββββββ¬βββββββββββββββββ β
β β β
β βββββββββββββββββββββ¬βββββββββββββββββββββ€ β
β βΌ βΌ βΌ β
β ββββββββββββ ββββββββββββββ ββββββββββββββ β
β β S3 β β RDS β β CloudWatch β β
β β Output β β PostgreSQL β β Dashboard β β
β β Bucket β β (or SQLite β β + Alarms β β
β β β β /DynamoDB)β β β β
β β results/ β ββββββββββββββ ββββββββββββββ β
β β figures/ β β
β ββββββββββββ ββββββββββββββ β
β β β AWS Budget β β
β β Lifecycle β $100/month β β
β β Rules β + alerts β β
β βΌ ββββββββββββββ β
β ββββββββββββ β
β β S3 β β
β β Glacier β βββ auto-archive after 90 days β
β β Archive β HIPAA: retain β₯ 6 years β
β ββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Key Design Decisions:
| Component | Choice | Why |
|---|---|---|
| Compute | AWS Batch | Auto-scales EC2 instances, shuts down when done β no idle costs |
| Spot Instances | Imputation step only | 60-90% cheaper; Nextflow retries on interruption |
| Storage | S3 + Lifecycle Rules | Standard β IA (30d) β Glacier (90d) β Deep Archive (365d) |
| Database | RDS PostgreSQL / DynamoDB | Variant weights in DynamoDB for O(1) lookups; scores in RDS for SQL queries |
| Trigger | S3 β Lambda | Event-driven: drop a VCF, pipeline runs automatically |
| Monitoring | CloudWatch + Budget | Dashboards for memory/CPU, $100/month budget with email alerts |
Not all GWAS variants should go into a PRS. The selection strategy matters:
P-value Thresholding (C+T Method):
- Include only variants below a significance threshold (e.g., p < 5Γ10β»βΈ for genome-wide significant, or p < 0.05 for more liberal)
- Simple but effective. This pipeline uses p-value filtering by default.
LD Clumping:
- Nearby variants are often correlated (linkage disequilibrium)
- Including both would double-count the same genetic signal
- PLINK
--clumpselects the most significant variant in each LD block - In production, I'd add a PLINK clumping step before scoring
Advanced Methods (not implemented, but worth knowing):
- LDpred2: Bayesian method that adjusts betas for LD structure
- PRS-CS: Uses continuous shrinkage priors on effect sizes
- These give better prediction than simple C+T but are computationally heavier
Normalization happens at two levels:
Different datasets are built on different reference genomes (hg19 vs hg38). If my patient data is on hg19 and the GWAS summary stats are on hg38, the positions won't match. CrossMap converts coordinates using a chain file that maps old positions to new positions.
A raw PRS of 0.47 means nothing by itself. It only has meaning relative to a population:
- A European with PRS = 0.47 might be at the 60th percentile in a European reference
- An African American with PRS = 0.47 might be at the 90th percentile in an African reference
Z-score normalization within ancestry groups fixes this:
z_score = (raw_prs - ancestry_mean) / ancestry_std
This pipeline computes z-scores across all input samples. In production with known ancestry labels, you'd normalize within each group separately.
This addresses the feedback about discussing archiving:
| Time After Pipeline Run | S3 Storage Class | Cost (per TB/month) | Access Time |
|---|---|---|---|
| 0β30 days | STANDARD | $23 | Instant |
| 30β90 days | STANDARD_IA | $12.50 | Instant |
| 90β365 days | GLACIER | $4 | 3-5 hours |
| 365+ days | DEEP_ARCHIVE | $1 | 12 hours |
Why this matters:
- Genomic data is large β a single whole-genome VCF is ~100GB
- HIPAA requires retaining patient data for β₯6 years
- Active results need fast access, but year-old runs can be archived
- Lifecycle rules automate this β no manual intervention needed
The lifecycle policy is defined in nextflow.config and conf/aws.config.
prs_pipeline/
βββ main.nf # Pipeline entry point (Nextflow DSL2)
βββ nextflow.config # Parameters, profiles, S3 lifecycle config
βββ README.md # This file
β
βββ modules/ # One Nextflow process per file
β βββ qc.nf # BCFtools-based QC filtering
β βββ normalize.nf # CrossMap liftover + batch normalization
β βββ impute.nf # Beagle genotype imputation
β βββ score.nf # PRS weighted sum calculation
β βββ database.nf # SQLite database storage
β βββ visualize.nf # Matplotlib visualization
β
βββ subworkflows/
β βββ prs_workflow.nf # Chains all modules: QC β Norm β Impute β Score β DB β Viz
β
βββ bin/ # Python scripts (auto-added to PATH in processes)
β βββ generate_test_data.py # Creates simulated VCF + GWAS stats for demo
β βββ qc_filter.py # QC: call rate, MAF, depth filtering
β βββ calculate_prs.py # Core PRS scoring engine
β βββ store_results.py # SQLite database operations
β βββ visualize_prs.py # 5 publication-quality plots
β
βββ conf/
β βββ resources.config # Per-process CPU/memory allocation
β βββ aws.config # AWS Batch, S3, CloudWatch configuration
β
βββ data/ # Test data (generated by generate_test_data.py)
β βββ sample_input.vcf # 500 variants Γ 20 patients (~5% missing)
β βββ gwas_summary_stats.tsv # Effect sizes + p-values for 500 variants
β βββ reference_panel.vcf # 50-sample reference for imputation
β βββ chain_file.chain # Placeholder liftover chain
β
βββ results/ # Pipeline outputs (generated by running the pipeline)
βββ qc/ # Filtered VCF + QC stats
βββ normalized/ # Build-normalized VCF
βββ imputed/ # Imputed VCF
βββ scores/ # PRS scores per patient
βββ database/ # SQLite database
βββ figures/ # 5 visualization PNGs
- Python 3.8+ with
matplotlibandnumpy - Nextflow 22.10+ (for running the full pipeline)
# 1. generate test data
cd prs_pipeline
python3 bin/generate_test_data.py
# 2. run QC
python3 bin/qc_filter.py \
--vcf data/sample_input.vcf \
--out-vcf results/qc/filtered.vcf \
--out-stats results/qc/qc_stats.tsv
# 3. calculate PRS (skipping normalize/impute for simplicity)
python3 bin/calculate_prs.py \
--vcf results/qc/filtered.vcf \
--gwas data/gwas_summary_stats.tsv \
--output results/scores/prs_scores.tsv
# 4. store in database
python3 bin/store_results.py \
--scores results/scores/prs_scores.tsv \
--gwas data/gwas_summary_stats.tsv \
--qc-stats results/qc/qc_stats.tsv \
--db results/database/prs_results.db
# 5. generate visualizations
python3 bin/visualize_prs.py \
--scores results/scores/prs_scores.tsv \
--qc-stats results/qc/qc_stats.tsv \
--gwas data/gwas_summary_stats.tsv \
--outdir results/figures# local execution
nextflow run main.nf -profile local
# docker execution (more reproducible)
nextflow run main.nf -profile docker
# on AWS
nextflow run main.nf -profile aws_batch \
--input_vcf s3://my-bucket/patient.vcf \
--outdir s3://my-bucket/results| Output | Location | Description |
|---|---|---|
| Filtered VCF | results/qc/filtered.vcf |
Variants passing QC thresholds |
| QC Report | results/qc/qc_stats.tsv |
Per-variant and per-sample QC metrics |
| Normalized VCF | results/normalized/normalized.vcf |
Genome buildβharmonized variants |
| Imputed VCF | results/imputed/imputed.vcf |
Missing genotypes filled in |
| PRS Scores | results/scores/prs_scores.tsv |
Per-patient raw score, z-score, percentile, risk category |
| Database | results/database/prs_results.db |
SQLite with samples, variants, qc_metrics tables |
| QC Summary Plot | results/figures/01_qc_summary.png |
Call rates and depth per sample |
| PRS Distribution | results/figures/02_prs_distribution.png |
Histogram with risk thresholds |
| Risk Stratification | results/figures/03_risk_stratification.png |
Patients per risk category |
| Variant Effects | results/figures/04_variant_effects.png |
Manhattan-style effect size plot |
| Score Comparison | results/figures/05_score_comparison.png |
Ranked lollipop chart |
-- patient PRS scores (what clinicians query)
SELECT sample_id, prs_zscore, percentile, risk_category
FROM samples
WHERE risk_category = 'HIGH_RISK';
-- most significant variants in the model
SELECT rsid, chromosome, beta, p_value
FROM variants
ORDER BY p_value ASC
LIMIT 10;
-- samples with questionable QC
SELECT s.sample_id, s.prs_zscore, q.call_rate
FROM samples s
JOIN qc_metrics q ON s.sample_id = q.sample_id
WHERE q.call_rate < 0.95;| Strategy | Tool | Details |
|---|---|---|
| Budget alerts | AWS Budgets | $100/month cap with email notifications |
| Spot instances | AWS Batch | 60-90% savings on imputation compute |
| Auto-shutdown | AWS Batch | EC2 instances terminate when pipeline finishes |
| Data archiving | S3 Lifecycle | Auto-transition to Glacier after 90 days |
| Monitoring | CloudWatch | CPU/memory dashboards + duration alarms |
Estimated cost per run (1M variants Γ 100 samples): ~$0.36 with spot instances.
| Tool | Purpose | Reference |
|---|---|---|
| Nextflow | Workflow orchestration | nextflow.io |
| BCFtools | VCF quality control | samtools.github.io/bcftools |
| CrossMap | Genome build liftover | crossmap.sourceforge.net |
| Beagle | Genotype imputation | faculty.washington.edu/browning/beagle |
| PGS Catalog | Published PRS models | pgscatalog.org |
| pgsc_calc | Reference PRS pipeline | github.com/PGScatalog/pgsc_calc |
| matplotlib | Visualization | matplotlib.org |
| SQLite | Local database | sqlite.org |
This project was built for BINFX 410 β Final Project.