The repository implements the Illumina whole-capsid approach after culture and provides comparable outputs to conventional GPLN.
Associated manuscript: Comparative Evaluation of Whole-Capsid Next-Generation Sequencing and Traditional Methods in Post-Culture Polio Surveillance: Pakistan and Afghanistan Study (2021-2024).
| Term | Meaning |
|---|---|
| NGS | Next-generation sequencing |
| VP1 | Viral protein 1 |
| VP2 | Viral protein 2 |
| VP3 | Viral protein 3 |
| FASTQ | Sequencing-read file format with base calls and quality scores |
| BAM | Binary Alignment/Map file for aligned sequencing reads |
| QC | Quality control |
| TSV | Tab-separated values table |
| MAFFT | Multiple-sequence alignment program |
| IQ-TREE | Maximum-likelihood phylogenetic tree program |
This repository contains one end to end pipeline that can reproduce the analyses reported in the manuscript. It supports the following components under one roof:
- Inputs
- Paired end FASTQ from cultured isolates prepared for whole capsid sequencing on Illumina MiSeq 2x150.
- Quality control and trimming
- fastp for adapter and quality trimming with configurable minimum base quality, default Q20, and minimum length 50.
- Mapping and primer trimming
- BWA MEM mapping to references for WPV1 and Sabin strains.
- BAMClipper to remove primer sequences when amplicon primers are used.
- Picard MarkDuplicates to mark PCR duplicates.
- Coverage and consensus
- BEDTools genomecov to compute depth profiles.
- FreeBayes variant calling followed by BCFtools consensus with a depth mask. Default minimum depth 10.
- Antigenic site analysis
- Translate VP1, VP2, VP3. Summarize amino acid changes in canonical antigenic sites Ag1, Ag2, Ag3. Output tidy TSV reports per sample.
- Phylogeny and context
- Fetch selected context references from GenBank.
- MAFFT alignment of masked consensuses plus context.
- IQ-TREE ML tree with 1000 ultrafast bootstraps. ModelFinder can be toggled on.
- Statistics
- Optional scripts to compute McNemar test and paired t test when you provide detection tables and per site identities from Sanger vs NGS.
- Optional de novo assembly
- SPAdes assembly and contig QC, BLAST for sanity checking and for identifying best reference when needed.
The workflow is implemented in Snakemake and can be run on a laptop or a small server. No clinical data are included.
Current software version: 1.0.3. See CHANGELOG.md and CITATION.cff for release metadata.
This repository presents an Illumina whole-capsid analysis path for poliovirus surveillance, from paired-end reads to masked consensus sequences, antigenic-site summaries, phylogeny, and optional statistical comparison with conventional methods. It is designed to be understandable to both bioinformaticians and public-health genomics supervisors.
flowchart LR
A["Cultured isolate FASTQ"] --> B["QC and trimming"]
B --> C["Reference mapping"]
C --> D["Depth and variant review"]
D --> E["Masked consensus"]
E --> F["Antigenic-site table"]
E --> G["Phylogeny"]
F --> H["Public-health report"]
G --> H
- Linux, macOS, or WSL with Bash
- Python 3.10 or newer for lightweight script checks
- Conda or mamba for the full workflow environment
- Snakemake 7 or newer for the full pipeline
- System tools installed through conda: fastp, bwa, samtools, bcftools, freebayes, bedtools, picard, bamclipper, mafft, iqtree, spades, seqkit, entrez-direct
Set a contact email once per shell for E-utilities. Optional API key improves rate limits.
export NCBI_EMAIL="you@example.com"
export NCBI_API_KEY="xxxxxxxxxxxxxxxxxxxxxxxxxxxx" # optionalThis check validates the Python utilities and example plotting without requiring private FASTQ files.
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -r env/requirements.txt
python -m compileall -q analysis/scripts
python analysis/scripts/example_qc_plot.py --in data-example/example_counts.tsv --out results-example/example_plot.pngThe same check can be run with:
make testNo raw FASTQ files are included in this repository. Before running the full workflow, edit config/config.yaml or pass a separate config file with paths to local paired-end FASTQ files.
export NCBI_EMAIL="you@example.com"
mamba env create -f env/environment.yml # or: conda env create -f env/environment.yml
conda activate polio-capsid-env
snakemake -s workflow/Snakefile --configfile config/config.yaml -n -c 4
snakemake -s workflow/Snakefile --configfile config/config.yaml -c 4 --printshellcmdsThe make wrappers accept THREADS, CONFIG, PYTHON, and SNAKEMAKE overrides:
make dry CONFIG=config/config.yaml THREADS=4
make run CONFIG=config/config.yaml THREADS=4Edit config/config.yaml. Minimal example:
pairs:
- sample: PV1_001
r1: data-private/PV1_001_R1.fastq.gz
r2: data-private/PV1_001_R2.fastq.gz
references:
wpv1: KY941935.1
sabin1: V01150.1
sabin2: AY184220.1
sabin3: AY184221.1
context_acc:
- KY941935.1
- V01150.1
- AY184220.1
- AY184221.1
params:
threads: 4
min_len: 50
min_qual: 20
min_depth_consensus: 10
use_model_finder: false
iqtree_model: GTR+G+I
bootstrap: 1000results/consensus/<sample>.fa- masked consensus per sampleresults/consensus/all_consensus.fasta- combined consensusesresults/aln/wg_alignment.fasta- alignment used for phylogenyresults/iqtree/wg.treefile- ML treeresults/coverage/<sample>.depth.txt- per base depthresults/mutations/<sample>_ag_sites.tsv- amino acid changes in Ag1, Ag2, Ag3- Optional de novo outputs under
results/spades/<sample>/with QC tables and BLAST hits
HPC-friendly Apptainer/Singularity support is available at containers/Apptainer.def. Build it from the repository root:
apptainer build containers/polio-capsid-ngs-analysis.sif containers/Apptainer.defUse the image on systems where Apptainer or Singularity is preferred over Docker.
See DATA_GOVERNANCE.md for public-data, restricted-data, and sample-identifier handling rules.
- Software: Haider SA. Polio Whole-Capsid NGS Surveillance Workflow. Version 1.0.3. Zenodo. https://doi.org/10.5281/zenodo.20257888
- All-version software DOI: https://doi.org/10.5281/zenodo.20257522
- Chen S, Zhou Y, Chen Y, et al. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018.
- Li H. Aligning sequence reads with BWA-MEM. 2013. arXiv:1303.3997.
- Danecek P, et al. Twelve years of SAMtools and BCFtools. GigaScience. 2021.
- Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. arXiv:1207.3907. FreeBayes.
- Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010.
- Picard Toolkit. Broad Institute. 2019.
- Au CH, Ho DN, Kwong A, et al. BAMClipper. Sci Rep. 2017.
- Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7. Mol Biol Evol. 2013.
- Minh BQ, et al. IQ-TREE 2. Mol Biol Evol. 2020.
- Bankevich A, et al. SPAdes. J Comput Biol. 2012.
- Cock PJ, et al. Biopython. Bioinformatics. 2009.
- Kans J. Entrez Programming Utilities Help. NCBI.
- Köster J, Rahmann S. Snakemake. Bioinformatics. 2012.
- Minor PD, Ferguson M, Evans DM, et al. Antigenic structure of polioviruses. J Gen Virol. 1986.
- Hogle J, Chow M, Filman D. Three-dimensional structure of poliovirus at 2.9 Å resolution. Science. 1985.
See CONTRIBUTING.md. Open an issue for questions. Do not commit restricted data.
MIT. See LICENSE for details.