A comprehensive Nextflow DSL2 pipeline for SARS-CoV-2 genomic surveillance using Oxford Nanopore Technologies (ONT) sequencing data.
This workflow was developed when the EPI2ME-lab ARTIC workflow was deprecated. It currently functions for internal use, but it is not yet ready for external release without modification on the users system. Several issues should be noted by potential users:
Known Installation Issues
- Nextflow Conda environment build failures
- proovframe may fail to install automatically. The Conda environment created by Nextflow must be manually activated, and proovframe installed manually.
- The ARTIC component requires its models to be downloaded manually. This can be done with the command XXXXX, but only after activating the relevant Conda environment.
Limitations
- Currently, only a single primer scheme is supported.
This pipeline processes ONT sequencing data to generate high-quality SARS-CoV-2 consensus genomes and performs downstream analyses including variant calling, lineage assignment, and coverage analysis. It's specifically designed for surveillance activities at Can Ruti Hospital.
- Quality Control: Guppyplex filtering and quality assessment
- Consensus Calling: ARTIC workflow for robust consensus genome generation
- Frameshift Correction: ProovFrame integration to correct sequencing-induced frameshifts
- Variant Analysis: Nextclade for clade assignment and mutation detection
- Lineage Assignment: Pangolin for PANGO lineage classification
- Coverage Analysis: Comprehensive coverage statistics and visualization
- Multi-sample Processing: Batch processing with sample metadata management
- Nextflow (≥ 22.04.0)
- Conda
- Required tools (typically containerized):
- Guppy/Guppyplex
- ARTIC workflow tools
- ProovFrame
- Nextclade
- Pangolin
- Coverage analysis tools
- Linux/macOS operating system
- Minimum 8 GB RAM (16+ GB recommended)
- 50+ GB free disk space for intermediate files
nextflow run HUGTiP-SARS-COV-2.nf/main.nf \
--runID 'run001' \
--outDir 'path/to/outDir' \
--workDir 'path/to/workDir' \
--dataDir 'path/dataDir' \
--metadata 'path/sample_sheet.csv' \
-profile conda_onYour metadata CSV file should contain the following headers:
sampleID,barcode
Sample001,barcode01
Sample002,barcode02
Sample003,barcode03The pipeline executes the following main steps:
- Guppyplex: Filters and demultiplexes ONT sequencing data
- Quality control and read filtering
- ARTIC: Generates consensus sequences using the ARTIC workflow
- Primer trimming and variant calling
- ProovFrame: Corrects frameshift mutations introduced during sequencing
- Maintains reading frame integrity
- Requires the translation of the reference genome
- Alignment: Aligns corrected consensus sequences with reference genome
- Nextclade: Performs clade assignment and mutation analysis
- Pangolin: Assigns PANGO lineages for epidemiological tracking
- Coverage: Calculates depth and breadth of coverage
- Generates coverage plots and statistics
- Produces summary coverage report (
coverage_mean.csv)
results/
├── guppyplex/
│ └── [sample_id]/
│ └── filtered_reads.fastq
├── artic/
│ ├── consensus/
│ │ └── [sample_id].consensus.fasta
│ └── coverage/
│ └── [sample_id].coverage.txt
├── proovframe/
│ └── [sample_id]/
│ ├── corrected.fasta
│ └── corrections.tsv
├── alignment/
│ └── aligned_consensus.fasta
├── nextclade/
│ └── nextclade_results.tsv
├── pangolin/
│ └── pangolin_lineages.csv
├── coverage/
│ └── coverage_plots/
└── concatenated_consensus.fasta
└── coverage_mean.csv
concatenated_consensus.fasta: All corrected consensus sequences in a single filecoverage_mean.csv: Summary coverage statistics for all samplesnextclade_results.tsv: Clade assignments and mutation profilespangolin_lineages.csv: PANGO lineage assignments
Create a nextflow.config file for your environment:
profiles {
docker {
docker.enabled = true
process {
withName: 'guppyplex' {
container = 'your-registry/guppyplex:latest'
}
withName: 'artic' {
container = 'your-registry/artic:latest'
}
// Add other container configurations
}
}
singularity {
singularity.enabled = true
// Singularity-specific configurations
}
}Adjust process resources based on your system:
process {
withName: 'artic' {
cpus = 4
memory = '8 GB'
time = '2h'
}
withName: 'coverage' {
cpus = 2
memory = '4 GB'
time = '1h'
}
}-
Missing metadata file:
Error: Please provide a samplesheet XLSX file with --samplesheetSolution: Ensure the
--metadataparameter points to a valid CSV file. -
Missing RunID:
Error: Please provide a RunID using --runIDSolution: Provide a numeric RunID with
--runID 12345. -
Data directory not found:
Error: Please provide full path to directory containing ONT resultsSolution: Verify the
--dataDirpath exists and contains ONT data.
If you use this pipeline in your research, please cite:
- Nextflow: Di Tommaso, P., et al. (2017). Nextflow enables reproducible computational workflows. Nature Biotechnology, 35(4), 316-319.
- ARTIC: Quick, J., et al. (2017). Multiplex PCR method for MinION and Illumina sequencing of Zika and other virus genomes directly from clinical samples. Nature Protocols, 12(6), 1261-1276. // https://artic.network/about
- ProovFrame: Hackl, S., et al. ProovFrame: Correcting frameshift errors in viral genome assemblies.
- Nextclade: Aksamentov, I., et al. (2021). Nextclade: clade assignment, mutation calling and quality control for viral genomes. Journal of Open Source Software, 6(67), 3773.
- Pangolin: O'Toole, Á., et al. (2021). Assignment of epidemiological lineages in an emerging pandemic using the pangolin tool. Virus Evolution, 7(2), veab064.
![[]](/phesketh-igtp/HuGTiP-SARS-CoV-2.nf/raw/main/png/workflow.png)