A Nextflow pipeline for downloading, cleaning, assembling, and annotating raw paired-end Illumina reads. This workflow is optimized for bacterial genome annotation with Prokka.
This pipeline was created for the "Workflow" exercise for BIOL7210.
This pipeline comes with a small sample dataset (located in Nextflow_Genomics_Pipeline/data) for testing:
- Sample:
SRR2584863, an E. coli isolate often used in tutorials - Data Source: NCBI SRA
Test data was generated using the following commands:
gunzip -c SRR2584863_1.fastq.gz | head -n 40000 | gzip > test_R1.fastq.gz
gunzip -c SRR2584863_2.fastq.gz | head -n 40000 | gzip > test_R2.fastq.gz- 📥 Auto-download reads from SRA or use paired datasets in the format *_R{1,2}.fastq.gz (located in the "/data" folder)
- 🧼 Quality filtering with
fastp - 🧬 Assembly with
SPAdes(ran in parallel withSeqKit) - 🧬 Annotation with
Prokka - 📊 Sequence length filtering with
SeqKit(ran in parallel withSPAdes)
To run this pipeline, you will need:
- Nextflow:
v24.10.5 - Conda:
v24.3.0⚠️ Due to a recent bug in newer Conda versions (specificallyv25.x), a downgraded, more stable version of Conda will be used for this pipeline - Operating System: macOS (tested on macOS Sonoma
14.6.1) - Architecture: x86_64
First, clone the git repository with the following command:
git clone https://github.com/binfwizard/Nexflow_Genomics_Pipeline.git
cd Nexflow_Genomics_PipelineNext, create a Conda environment (using Conda 24.3.0) and install Nextflow:
conda create -n nf -c bioconda nextflow -y
conda activate nf
conda install conda=24.3.0Finally, run the pipeline on the test dataset in one command:
nextflow run nextflow_pipeline.nf -profile conda --threads 8Alternatively, the pipeline can automatically download and process SRA reads using the --srr_id flag:
nextflow run nextflow_pipeline.nf -profile conda --threads 8 --srr_id "SRR2584863"Upon running the pipeline, all of the following tools will be automatically installed via Bioconda:
- fastp:
v0.23.4 - SPAdes:
v3.15.5 - SeqKit:
v2.6.1 - Prokka:
v1.14.6
Pipeline Output Structure:
├── results/
│ ├── fastp/
│ │ ├── cleaned_1.fastq.gz
│ │ ├── cleaned_2.fastq.gz
│ │ ├── fastp_report.html
│ │ ├── fastp_report.json
│ │ └── fastp.log
│ ├── spades/
│ │ ├── contigs.fasta
│ │ └── spades.log
│ ├── seqkit/
│ │ ├── filtered.fastq.gz
│ │ └── seqkit.log
│ └── prokka/
│ ├── annotation.gff
│ └── prokka.log
