Skip to content

phesketh-igtp/HuGTiP-SARS-CoV-2.nf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

70 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SARS-CoV-2 Surveillance Pipeline

A comprehensive Nextflow DSL2 pipeline for SARS-CoV-2 genomic surveillance using Oxford Nanopore Technologies (ONT) sequencing data.

This workflow was developed when the EPI2ME-lab ARTIC workflow was deprecated. It currently functions for internal use, but it is not yet ready for external release without modification on the users system. Several issues should be noted by potential users:

Known Installation Issues

  • Nextflow Conda environment build failures
    • proovframe may fail to install automatically. The Conda environment created by Nextflow must be manually activated, and proovframe installed manually.
    • The ARTIC component requires its models to be downloaded manually. This can be done with the command XXXXX, but only after activating the relevant Conda environment.

Limitations

  • Currently, only a single primer scheme is supported.

Overview

This pipeline processes ONT sequencing data to generate high-quality SARS-CoV-2 consensus genomes and performs downstream analyses including variant calling, lineage assignment, and coverage analysis. It's specifically designed for surveillance activities at Can Ruti Hospital.

Features

  • Quality Control: Guppyplex filtering and quality assessment
  • Consensus Calling: ARTIC workflow for robust consensus genome generation
  • Frameshift Correction: ProovFrame integration to correct sequencing-induced frameshifts
  • Variant Analysis: Nextclade for clade assignment and mutation detection
  • Lineage Assignment: Pangolin for PANGO lineage classification
  • Coverage Analysis: Comprehensive coverage statistics and visualization
  • Multi-sample Processing: Batch processing with sample metadata management

Requirements

Software Dependencies

  • Nextflow (≥ 22.04.0)
  • Conda
  • Required tools (typically containerized):
    • Guppy/Guppyplex
    • ARTIC workflow tools
    • ProovFrame
    • Nextclade
    • Pangolin
    • Coverage analysis tools

System Requirements

  • Linux/macOS operating system
  • Minimum 8 GB RAM (16+ GB recommended)
  • 50+ GB free disk space for intermediate files

Usage

Quick Start

nextflow run HUGTiP-SARS-COV-2.nf/main.nf \​​
   --runID 'run001' \​​
   --outDir 'path/to/outDir'\​
   --workDir 'path/to/workDir' \ ​
   --dataDir  'path/dataDir' \​
   --metadata 'path/sample_sheet.csv' \​​
   -profile conda_on

Sample Metadata Format

Your metadata CSV file should contain the following headers:

sampleID,barcode
Sample001,barcode01
Sample002,barcode02
Sample003,barcode03

Workflow Steps

The pipeline executes the following main steps:

1. Data Preparation

  • Guppyplex: Filters and demultiplexes ONT sequencing data
  • Quality control and read filtering

2. Consensus Generation

  • ARTIC: Generates consensus sequences using the ARTIC workflow
  • Primer trimming and variant calling

3. Quality Correction

  • ProovFrame: Corrects frameshift mutations introduced during sequencing
  • Maintains reading frame integrity
  • Requires the translation of the reference genome

4. Sequence Analysis

  • Alignment: Aligns corrected consensus sequences with reference genome
  • Nextclade: Performs clade assignment and mutation analysis
  • Pangolin: Assigns PANGO lineages for epidemiological tracking

5. Coverage Analysis

  • Coverage: Calculates depth and breadth of coverage
  • Generates coverage plots and statistics
  • Produces summary coverage report (coverage_mean.csv)

[]

Output Structure

results/
├── guppyplex/
│   └── [sample_id]/
│       └── filtered_reads.fastq
├── artic/
│   ├── consensus/
│   │   └── [sample_id].consensus.fasta
│   └── coverage/
│       └── [sample_id].coverage.txt
├── proovframe/
│   └── [sample_id]/
│       ├── corrected.fasta
│       └── corrections.tsv
├── alignment/
│   └── aligned_consensus.fasta
├── nextclade/
│   └── nextclade_results.tsv
├── pangolin/
│   └── pangolin_lineages.csv
├── coverage/
│   └── coverage_plots/
└── concatenated_consensus.fasta
└── coverage_mean.csv

Key Output Files

  • concatenated_consensus.fasta: All corrected consensus sequences in a single file
  • coverage_mean.csv: Summary coverage statistics for all samples
  • nextclade_results.tsv: Clade assignments and mutation profiles
  • pangolin_lineages.csv: PANGO lineage assignments

Configuration

Profile Configuration

Create a nextflow.config file for your environment:

profiles {
    docker {
        docker.enabled = true
        process {
            withName: 'guppyplex' {
                container = 'your-registry/guppyplex:latest'
            }
            withName: 'artic' {
                container = 'your-registry/artic:latest'
            }
            // Add other container configurations
        }
    }
    
    singularity {
        singularity.enabled = true
        // Singularity-specific configurations
    }
}

Resource Configuration

Adjust process resources based on your system:

process {
    withName: 'artic' {
        cpus = 4
        memory = '8 GB'
        time = '2h'
    }
    
    withName: 'coverage' {
        cpus = 2
        memory = '4 GB'
        time = '1h'
    }
}

Troubleshooting

Common Issues

  1. Missing metadata file:

    Error: Please provide a samplesheet XLSX file with --samplesheet
    

    Solution: Ensure the --metadata parameter points to a valid CSV file.

  2. Missing RunID:

    Error: Please provide a RunID using --runID
    

    Solution: Provide a numeric RunID with --runID 12345.

  3. Data directory not found:

    Error: Please provide full path to directory containing ONT results
    

    Solution: Verify the --dataDir path exists and contains ONT data.

Citation

If you use this pipeline in your research, please cite:

  1. Nextflow: Di Tommaso, P., et al. (2017). Nextflow enables reproducible computational workflows. Nature Biotechnology, 35(4), 316-319.
  2. ARTIC: Quick, J., et al. (2017). Multiplex PCR method for MinION and Illumina sequencing of Zika and other virus genomes directly from clinical samples. Nature Protocols, 12(6), 1261-1276. // https://artic.network/about
  3. ProovFrame: Hackl, S., et al. ProovFrame: Correcting frameshift errors in viral genome assemblies.
  4. Nextclade: Aksamentov, I., et al. (2021). Nextclade: clade assignment, mutation calling and quality control for viral genomes. Journal of Open Source Software, 6(67), 3773.
  5. Pangolin: O'Toole, Á., et al. (2021). Assignment of epidemiological lineages in an emerging pandemic using the pangolin tool. Virus Evolution, 7(2), veab064.

About

Nextflow workflow for analysis of SARS-CoV-2 data using the Artic worklow, correction of frameshifts with proovframe, and taxonomic classification of SARS-CoV-2 lineages with NextClade and Pangolin.

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors