Bio-Gen: Automated Bacterial Assembly & Annotation Pipeline

Important

🎓 Teaching or taking a course? Check out our Students Quick Start Guide for step-by-step lab instructions.

A robust, containerized pipeline for the de novo assembly, iterative polishing, and functional annotation of bacterial genomes. Built on Ubuntu 24.04 (Noble), Bio-Gen automates the modern microbiology workflow, transforming raw sequencing reads into polished, annotated genomes and interactive reports.

🎯 Why Bio-Gen?

Traditional genomics practical courses often require students to manually install dozens of bioinformatics tools, resolve dependency conflicts, and configure complex Unix environments. This process is error-prone and often overshadows the actual scientific goal: understanding genome assembly and analysis.

Bio-Gen eliminates this technical overhead. It provides a reproducible, containerized environment with a single-command execution, allowing students and researchers to focus on biological interpretation instead of infrastructure.

🧬 Pipeline Overview

The pipeline integrates industry-standard tools into a seamless, automated sequence:

Raw Reads → FastQC (QC) → fastp (Trimming) → SPAdes (Assembly) → Filtering → Pilon (Polishing) → Prokka (Annotation) → MultiQC Report

Note: After trimming, all downstream steps automatically use the cleaned reads generated by fastp.

🛠️ Key Features

Modern Noble Base: Built on Ubuntu 24.04 LTS for long-term compatibility.
Intelligent Resource Management: Estimates available system memory and automatically adjusts tool usage to prevent crashes.
Adaptive Trimming: Uses fastp to automatically detect adapters and trim low-quality bases.
Iterative Accuracy: Polishes the genome using a BWA-Pilon loop until sequence convergence.
Analysis Ready: Generates standardized files optimized for SigmoID, IGV, or Artemis.

⚙️ Requirements

Docker: Installation Guide
Docker Compose: Installation Guide
RAM: 8GB minimum (16GB+ recommended for large datasets)
OS: macOS, Linux, or Windows (via Docker Desktop)

📥 Installation

Ensure you have Docker installed, then:

git clone https://github.com/pavelnovitsky/Bio-Gen.git
cd Bio-Gen
docker compose build

📈 Usage

Tip

New to bioinformatics? Follow our Step-by-Step Students Guide to complete your lab assignment easily.

⚡ Quick Start (Recommended)

Run the full pipeline on your data with a single command. This will automatically check inputs, trim reads, assemble the genome, polish the sequence, and annotate features:

./launch.sh -1 sample_R1.fastq -2 sample_R2.fastq

After completion, open the MultiQC report in the results/ folder.

🧪 Quick Test (Validation)

Verify the environment using the built-in 4MB test dataset (~2-5 min). This is the recommended first run to ensure everything is configured correctly:

./launch.sh --test

⏱️ Runtime Expectations

Test dataset: ~2–5 minutes.
Typical bacterial dataset: ~15–60 minutes (depending on available CPU and RAM). If the pipeline is running without errors, just wait—this is normal processing time.

🧠 What to Focus On

You do not need to analyze raw tool logs to complete your analysis. Focus on:

The Annotated Genome (.gbk) — explore genes and features.
The Final Assembly (.fasta) — the final genome sequence.
The MultiQC Report — overall quality and statistics. Logs are provided in the logs/ directory only for technical debugging.

📁 Results Directory

Each run produces a unique, timestamped folder in results/:

reports/: START HERE — Open the MultiQC HTML report for an interactive summary of the entire run.
annotation/: Contains the .gbk and .gff files for SigmoID or IGV.
pilon/: Contains the final polished consensus genome (final_genome.fasta).
trimmed/: Contains the cleaned FASTQ reads and adapter removal logs.
quast/: Detailed assembly quality metrics (N50, L50, GC content).

🎯 Final Outputs (What to Use)

After a successful run, focus on these primary files:

Annotated Genome: annotation/genome.gbk → Load into SigmoID.
Final Assembly: pilon/final_genome.fasta → The polished consensus sequence.
Interactive Summary: reports/*.html → The MultiQC report with all pipeline metrics.

⚠️ Common Issues

Files not found: Ensure your FASTQ files are located inside the project folder.
Docker not running: Make sure Docker Desktop is started.
Low memory errors: If the pipeline crashes during assembly, try reducing threads (e.g., -t 4).

⚖️ License & Acknowledgments

Distributed under the GNU GPLv3 License.

This pipeline utilizes expert-configured tool binaries provided by the State Public Health Bioinformatics (StaPH-B) consortium. We gratefully acknowledge their work in maintaining high-quality Docker builds.

🎓 Citations

Please cite the authors of the underlying tools:

SPAdes: Prjibelski et al., 2020.
Prokka: Seemann T, 2014.
Pilon: Walker et al., 2014.
fastp: Chen S, 2018.
MultiQC: Ewels et al., 2016.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
test_data		test_data
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
STUDENTS_GUIDE.md		STUDENTS_GUIDE.md
compose.yaml		compose.yaml
launch.sh		launch.sh
run_assembly.sh		run_assembly.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bio-Gen: Automated Bacterial Assembly & Annotation Pipeline

🎯 Why Bio-Gen?

🧬 Pipeline Overview

🛠️ Key Features

⚙️ Requirements

📥 Installation

📈 Usage

⚡ Quick Start (Recommended)

🧪 Quick Test (Validation)

⏱️ Runtime Expectations

🧠 What to Focus On

📁 Results Directory

🎯 Final Outputs (What to Use)

⚠️ Common Issues

⚖️ License & Acknowledgments

🎓 Citations

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Bio-Gen: Automated Bacterial Assembly & Annotation Pipeline

🎯 Why Bio-Gen?

🧬 Pipeline Overview

🛠️ Key Features

⚙️ Requirements

📥 Installation

📈 Usage

⚡ Quick Start (Recommended)

🧪 Quick Test (Validation)

⏱️ Runtime Expectations

🧠 What to Focus On

📁 Results Directory

🎯 Final Outputs (What to Use)

⚠️ Common Issues

⚖️ License & Acknowledgments

🎓 Citations

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages