🧬 Genome Assembly and Annotation Pipeline

A Nextflow pipeline for downloading, cleaning, assembling, and annotating raw paired-end Illumina reads. This workflow is optimized for bacterial genome annotation with Prokka.

This pipeline was created for the "Workflow" exercise for BIOL7210.

📂 Test Data

This pipeline comes with a small sample dataset (located in Nextflow_Genomics_Pipeline/data) for testing:

Sample: SRR2584863, an E. coli isolate often used in tutorials
Data Source: NCBI SRA

Test data was generated using the following commands:

gunzip -c SRR2584863_1.fastq.gz | head -n 40000 | gzip > test_R1.fastq.gz
gunzip -c SRR2584863_2.fastq.gz | head -n 40000 | gzip > test_R2.fastq.gz

🚀 Pipeline Features

📥 Auto-download reads from SRA or use paired datasets in the format *_R{1,2}.fastq.gz (located in the "/data" folder)
🧼 Quality filtering with fastp
🧬 Assembly with SPAdes (ran in parallel with SeqKit)
🧬 Annotation with Prokka
📊 Sequence length filtering with SeqKit (ran in parallel with SPAdes)

🛠️ Software Requirements

To run this pipeline, you will need:

Nextflow: v24.10.5
Conda: v24.3.0 ⚠️ Due to a recent bug in newer Conda versions (specifically v25.x), a downgraded, more stable version of Conda will be used for this pipeline
Operating System: macOS (tested on macOS Sonoma 14.6.1)
Architecture: x86_64

🏃‍♀️💨 Running the Pipeline

First, clone the git repository with the following command:

git clone https://github.com/binfwizard/Nexflow_Genomics_Pipeline.git
cd Nexflow_Genomics_Pipeline

Next, create a Conda environment (using Conda 24.3.0) and install Nextflow:

conda create -n nf -c bioconda nextflow -y
conda activate nf
conda install conda=24.3.0

Finally, run the pipeline on the test dataset in one command:

nextflow run nextflow_pipeline.nf -profile conda --threads 8

Alternatively, the pipeline can automatically download and process SRA reads using the --srr_id flag:

nextflow run nextflow_pipeline.nf -profile conda --threads 8 --srr_id "SRR2584863"

🔗 Dependencies

Upon running the pipeline, all of the following tools will be automatically installed via Bioconda:

fastp: v0.23.4
SPAdes: v3.15.5
SeqKit: v2.6.1
Prokka: v1.14.6

📝 Output Structure

Pipeline Output Structure:
├── results/
│   ├── fastp/
│   │   ├── cleaned_1.fastq.gz
│   │   ├── cleaned_2.fastq.gz
│   │   ├── fastp_report.html
│   │   ├── fastp_report.json
│   │   └── fastp.log
│   ├── spades/
│   │   ├── contigs.fasta
│   │   └── spades.log
│   ├── seqkit/
│   │   ├── filtered.fastq.gz
│   │   └── seqkit.log
│   └── prokka/
│       ├── annotation.gff
│       └── prokka.log

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
data		data
workflow_diagram		workflow_diagram
README.md		README.md
nextflow.config		nextflow.config
nextflow_pipeline.nf		nextflow_pipeline.nf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧬 Genome Assembly and Annotation Pipeline

📂 Test Data

🚀 Pipeline Features

🛠️ Software Requirements

🏃‍♀️💨 Running the Pipeline

🔗 Dependencies

📝 Output Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧬 Genome Assembly and Annotation Pipeline

📂 Test Data

🚀 Pipeline Features

🛠️ Software Requirements

🏃‍♀️💨 Running the Pipeline

🔗 Dependencies

📝 Output Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages