Important
🎓 Teaching or taking a course? Check out our Students Quick Start Guide for step-by-step lab instructions.
A robust, containerized pipeline for the de novo assembly, iterative polishing, and functional annotation of bacterial genomes. Built on Ubuntu 24.04 (Noble), Bio-Gen automates the modern microbiology workflow, transforming raw sequencing reads into polished, annotated genomes and interactive reports.
Traditional genomics practical courses often require students to manually install dozens of bioinformatics tools, resolve dependency conflicts, and configure complex Unix environments. This process is error-prone and often overshadows the actual scientific goal: understanding genome assembly and analysis.
Bio-Gen eliminates this technical overhead. It provides a reproducible, containerized environment with a single-command execution, allowing students and researchers to focus on biological interpretation instead of infrastructure.
The pipeline integrates industry-standard tools into a seamless, automated sequence:
Raw Reads → FastQC (QC) → fastp (Trimming) → SPAdes (Assembly) → Filtering → Pilon (Polishing) → Prokka (Annotation) → MultiQC Report
Note: After trimming, all downstream steps automatically use the cleaned reads generated by fastp.
- Modern Noble Base: Built on Ubuntu 24.04 LTS for long-term compatibility.
- Intelligent Resource Management: Estimates available system memory and automatically adjusts tool usage to prevent crashes.
- Adaptive Trimming: Uses fastp to automatically detect adapters and trim low-quality bases.
- Iterative Accuracy: Polishes the genome using a BWA-Pilon loop until sequence convergence.
- Analysis Ready: Generates standardized files optimized for SigmoID, IGV, or Artemis.
- Docker: Installation Guide
- Docker Compose: Installation Guide
- RAM: 8GB minimum (16GB+ recommended for large datasets)
- OS: macOS, Linux, or Windows (via Docker Desktop)
Ensure you have Docker installed, then:
git clone https://github.com/pavelnovitsky/Bio-Gen.git
cd Bio-Gen
docker compose buildTip
New to bioinformatics? Follow our Step-by-Step Students Guide to complete your lab assignment easily.
Run the full pipeline on your data with a single command. This will automatically check inputs, trim reads, assemble the genome, polish the sequence, and annotate features:
./launch.sh -1 sample_R1.fastq -2 sample_R2.fastqAfter completion, open the MultiQC report in the results/ folder.
Verify the environment using the built-in 4MB test dataset (~2-5 min). This is the recommended first run to ensure everything is configured correctly:
./launch.sh --test- Test dataset: ~2–5 minutes.
- Typical bacterial dataset: ~15–60 minutes (depending on available CPU and RAM). If the pipeline is running without errors, just wait—this is normal processing time.
You do not need to analyze raw tool logs to complete your analysis. Focus on:
- The Annotated Genome (
.gbk) — explore genes and features. - The Final Assembly (
.fasta) — the final genome sequence. - The MultiQC Report — overall quality and statistics.
Logs are provided in the
logs/directory only for technical debugging.
Each run produces a unique, timestamped folder in results/:
reports/: START HERE — Open the MultiQC HTML report for an interactive summary of the entire run.annotation/: Contains the.gbkand.gfffiles for SigmoID or IGV.pilon/: Contains the final polished consensus genome (final_genome.fasta).trimmed/: Contains the cleaned FASTQ reads and adapter removal logs.quast/: Detailed assembly quality metrics (N50, L50, GC content).
After a successful run, focus on these primary files:
- Annotated Genome:
annotation/genome.gbk→ Load into SigmoID. - Final Assembly:
pilon/final_genome.fasta→ The polished consensus sequence. - Interactive Summary:
reports/*.html→ The MultiQC report with all pipeline metrics.
- Files not found: Ensure your FASTQ files are located inside the project folder.
- Docker not running: Make sure Docker Desktop is started.
- Low memory errors: If the pipeline crashes during assembly, try reducing threads (e.g.,
-t 4).
Distributed under the GNU GPLv3 License.
This pipeline utilizes expert-configured tool binaries provided by the State Public Health Bioinformatics (StaPH-B) consortium. We gratefully acknowledge their work in maintaining high-quality Docker builds.
Please cite the authors of the underlying tools:
- SPAdes: Prjibelski et al., 2020.
- Prokka: Seemann T, 2014.
- Pilon: Walker et al., 2014.
- fastp: Chen S, 2018.
- MultiQC: Ewels et al., 2016.