This project provides a C++ and Python-based pipeline for reconstructing and evaluating genomes from colored, compacted de Bruijn graphs (CCDBGs) generated by Bifrost. This document outlines the core workflow and how to use the provided tools.
├── CMakeLists.txt # Build configuration for CMake
├── README.md # This file
├── build/ # Build output directory
├── graphs/ # Default location for generated graph files
├── references/ # Directory for reference genome files
├── results/ # Default location for outputs
│ ├── evaluation_reports/ # Detailed reports from QUAST/FastANI
│ ├── evaluation_tables/ # Summary .tsv evaluation reports
│ └── reconstructed_genomes/ # Reconstructed sequences in FASTA format
├── scripts/ # Helper and analysis scripts
│ ├── evaluate.py # Main evaluation script (QUAST & FastANI)
│ ├── makeDiagram.py # Parses log files to create diagrams
│ ├── scaffold_from_paf.py # Scaffolds contigs using a PAF file
│ └── visualize.py # Visualizes synteny from FastANI output
└── src/ # Source code
├── assemblers/ # Assembler strategy implementations
├── scaffolders/ # Scaffolder strategy implementations
├── main.cpp # Main executable for the reconstruction pipeline
├── graph_generator.cpp # Executable for building Bifrost graphs
├── genome_reconstructor.cpp # Executable to reconstruct from annotated paths
├── branch_counter.cpp # Executable to analyze graph complexity
├── PathAnnotator.hpp/.cpp # Class to annotate reference paths in the graph
├── TraversalState.hpp/.cpp # Custom data for tracking unitig state
├── ReconstructionCommon.hpp # Common type aliases and helper functions
└── ReconstructionStrategy.hpp/.cpp # The strategy registry framework
First clone this repository and then follow these steps in the same directory.
git clone https://gitlab.rlp.net/mgawron/bachelorarbeit-bifrost-genome-reconstruction.git
Compile Bifrost with a maximum k-mer size of 256
git clone https://github.com/pmelsted/bifrost.git
mkdir bifrost/build && cd bifrost/build
cmake .. -DMAX_KMER_SIZE=256 -DCMAKE_POLICY_VERSION_MINIMUM=3.5 -DCMAKE_INSTALL_PREFIX=../../bifrost_build
make -j 4
make install
cd ../.. # return to root directory
Compile the main project
mkdir build
cd build
cmake ..
make
After a successful build, all assembly tools can be run from the build directory.
To run the evaluation scripts, execute the following commands from the project's root directory.
For x86_64:
git clone https://github.com/ablab/quast.git
cd quast && ./setup.py install
For ARM:
git clone https://github.com/megawron/quast.git
cd quast && ./setup.py install
Install the remaining dependencies.
conda env create -f environment.yml
conda activate bifrost-genome-reconstruction
The pipeline is composed of several command-line tools designed to be run in sequence.
This tool generates de Bruijn graphs across a range of k-mer values, offering the option to annotate the corresponding unitig data. This serves as the crucial starting point for the reconstruction workflow.
Usage: ./graph_generator <base_name> [OPTIONS]
Arguments:
<base_name> Prefix for output graph files (e.g., 'graphs/salmonella').
Options:
-g, --genome <path> One or more input genome files (FASTA/FASTQ, gzipped).
(Required, can be specified multiple times)
-k, --kmer <size> One or more integer k-mer sizes for graph construction.
(Required, can be specified multiple times)
--annotate <type> (Optional) Annotate the graph with custom data.
Supported types: 'PathAnnotationData', 'TraversalState'.
Example:
./graph_generator graphs/salmonella -g ../testdata/salmonella/*.fasta.gz -k 91 127 255 --annotate TraversalState
This is the main pipeline tool. It takes a pre-built graph and runs the selected assembler and scaffolder strategies to generate the final genome sequence(s).
Usage: ./bifrostasm [OPTIONS]
Required Options:
--assembler <name|*> Assembler strategy. Use '*' for all available.
--scaffolder <name|*|none> Scaffolding strategy. Use '*' for all, or 'none' to skip.
-g, --graph <path> Graph prefix or directory containing graphs.
-q, --query <name> Name of the genome/color to reconstruct.
-o, --output <dir> Base directory for reconstructed genome outputs.
Optional Options:
--output-contigs-dir <dir> Optional. Specify a directory for raw contigs.
-l, --log Enable detailed logging for strategies.
-t, --threads <num> Number of threads for graph loading.
-h, --help Print this help message.
Example:
./bifrostasm \
--assembler basic \
--scaffolder simple_concat \
-g ../graphs/salmonella_TraversalState_k91 \
-q SAL_AD6890AA_AS \
-o ./results/reconstructed_genomes/ \
--log
WARNING: The PathAnnotator for annotating the graph is not functioning correctly and this reconstruction way is intended for debugging purposes only.
This tool reconstructs a genome from a graph that has been pre-annotated with the correct path using a reference genome. It requires a graph that was generated using the --annotate PathAnnotationData flag.
./genome_reconstructor \
-g ./graphs/salmonella_PathAnnotationData_k51 \
-q SAL_AD6890AA_AS \
-o ./results/pa_reconstruction.fasta
This script automates the evaluation of your reconstructed genomes against a reference using QUAST and FastANI. It produces a concise summary report.
Usage: python scripts/evaluate.py [OPTIONS]
Required Options:
-r, --reference <path> Path to the reference genome FASTA file.
-q, --reconstructed <paths...> Paths to one or more reconstructed genome FASTA files.
-o, --output_dir <path> Directory to store all evaluation outputs.
Optional Options:
--mode <all|quast|fastani> Execution mode (Default: all).
--threads <num> Number of threads for QUAST/FastANI (Default: 8).
--timeout <seconds> Timeout for each external tool call.
--visualize Enable creation of synteny plots (requires FastANI).
--full_mode Save all intermediate files from QUAST and FastANI.
Example:
python scripts/evaluate.py \
-r ./references/SAL_AD6890AA_AS.fasta \
-q ./results/reconstructed_genomes/SAL_AD6890AA_AS/*.fasta.gz \
-o ./results/evaluation_tables/ \
--visualize
- branch_counter: Analyzes the branching complexity (in- and out-degree) for each color in a graph, helping you understand its structure.
- genome_reconstructor: Reconstructs a genome sequence from a graph that was annotated with PathAnnotationData, which can be useful for debugging the annotation process.
- scripts/visualize.py: A helper script called by evaluate.py to create synteny plot images from FastANI alignment data.
For advanced use, you can choose specific assembler and scaffolder strategies.
Using TraversalState Annotation
basic: A fundamental strategy that extends contigs only along unambiguous paths of the target color. It terminates extension at forks or path ends, prioritizing certainty over contig length.
best_score_path: A greedy assembler that extends paths by selecting the successor that maximizes a scoring function. The score prioritizes target k-mer coverage, followed by total color count and unitig length as tie-breakers.
guided_linear: Forms linear chains by choosing the 'best' successor at each step. "Best" is determined by target k-mer count and downstream ambiguity (preferring successors with fewer of their own successors).
Using no Annotation
explore: When encountering a branch, this assembler performs a limited-depth, breadth-first search to evaluate the branching paths. It scores paths based on average k-mer coverage and length to resolve local complexities.
explore_hc: An extension of explore that is aware of high-coverage (HC) unitigs. It penalizes paths that enter or stay in HC regions, aiming to avoid repetitive elements.
beam_context: A beam search assembler that uses a rich heuristic score including target k-mer coverage, path length, coverage variance, and the traversal of high-coverage regions to guide the search.
beam_adaptive: A beam search assembler where parameters like beam width and exploration depth are adapted based on the graph's k-mer size. It uses a lookahead mechanism in its heuristic score to evaluate the quality of potential next steps.
beam_loop: An advanced beam search assembler that can detect and "unroll" simple tandem repeats based on k-mer coverage heuristics, allowing it to resolve and traverse short, repetitive loops.
none: Skips the scaffolding step and outputs the raw contigs generated by the assembler.
simple_concat: Sorts the input contigs by ID and joins them into a single sequence, separated by a standard number of 'N' characters.