Skip to content

Latest commit

 

History

History
221 lines (173 loc) · 9.73 KB

File metadata and controls

221 lines (173 loc) · 9.73 KB

Genome Reconstruction from Bifrost Graphs

This project provides a C++ and Python-based pipeline for reconstructing and evaluating genomes from colored, compacted de Bruijn graphs (CCDBGs) generated by Bifrost. This document outlines the core workflow and how to use the provided tools.

Project Structure

├── CMakeLists.txt              # Build configuration for CMake
├── README.md                     # This file
├── build/                        # Build output directory
├── graphs/                       # Default location for generated graph files
├── references/                   # Directory for reference genome files
├── results/                      # Default location for outputs
│   ├── evaluation_reports/       # Detailed reports from QUAST/FastANI
│   ├── evaluation_tables/        # Summary .tsv evaluation reports
│   └── reconstructed_genomes/    # Reconstructed sequences in FASTA format
├── scripts/                      # Helper and analysis scripts
│   ├── evaluate.py               # Main evaluation script (QUAST & FastANI)
│   ├── makeDiagram.py            # Parses log files to create diagrams
│   ├── scaffold_from_paf.py      # Scaffolds contigs using a PAF file
│   └── visualize.py              # Visualizes synteny from FastANI output
└── src/                          # Source code
├── assemblers/               # Assembler strategy implementations
├── scaffolders/              # Scaffolder strategy implementations
├── main.cpp                  # Main executable for the reconstruction pipeline
├── graph_generator.cpp       # Executable for building Bifrost graphs
├── genome_reconstructor.cpp  # Executable to reconstruct from annotated paths
├── branch_counter.cpp        # Executable to analyze graph complexity
├── PathAnnotator.hpp/.cpp    # Class to annotate reference paths in the graph
├── TraversalState.hpp/.cpp   # Custom data for tracking unitig state
├── ReconstructionCommon.hpp  # Common type aliases and helper functions
└── ReconstructionStrategy.hpp/.cpp # The strategy registry framework

Installation

First clone this repository and then follow these steps in the same directory.

git clone https://gitlab.rlp.net/mgawron/bachelorarbeit-bifrost-genome-reconstruction.git

1. Bifrost

Compile Bifrost with a maximum k-mer size of 256

git clone https://github.com/pmelsted/bifrost.git
mkdir bifrost/build && cd bifrost/build
cmake .. -DMAX_KMER_SIZE=256 -DCMAKE_POLICY_VERSION_MINIMUM=3.5 -DCMAKE_INSTALL_PREFIX=../../bifrost_build
make -j 4
make install
cd ../..      # return to root directory

2. Project

Compile the main project

mkdir build
cd build
cmake ..
make

After a successful build, all assembly tools can be run from the build directory.

To run the evaluation scripts, execute the following commands from the project's root directory.

3. QUAST

For x86_64:

git clone https://github.com/ablab/quast.git
cd quast && ./setup.py install

For ARM:

git clone https://github.com/megawron/quast.git
cd quast && ./setup.py install

4. Conda Environment

Install the remaining dependencies.

conda env create -f environment.yml
conda activate bifrost-genome-reconstruction



Documentation

The pipeline is composed of several command-line tools designed to be run in sequence.

Graph Generation (graph_generator)

This tool generates de Bruijn graphs across a range of k-mer values, offering the option to annotate the corresponding unitig data. This serves as the crucial starting point for the reconstruction workflow.

Usage: ./graph_generator <base_name> [OPTIONS]

Arguments:
  <base_name>         Prefix for output graph files (e.g., 'graphs/salmonella').

Options:
  -g, --genome <path>   One or more input genome files (FASTA/FASTQ, gzipped).
                        (Required, can be specified multiple times)
  -k, --kmer <size>     One or more integer k-mer sizes for graph construction.
                        (Required, can be specified multiple times)
  --annotate <type>     (Optional) Annotate the graph with custom data.
                        Supported types: 'PathAnnotationData', 'TraversalState'.

Example:

./graph_generator graphs/salmonella -g ../testdata/salmonella/*.fasta.gz -k 91 127 255 --annotate TraversalState

De-Novo Genome Reconstruction (bifrostasm)

This is the main pipeline tool. It takes a pre-built graph and runs the selected assembler and scaffolder strategies to generate the final genome sequence(s).

Usage: ./bifrostasm [OPTIONS]

Required Options:
  --assembler <name|*>      Assembler strategy. Use '*' for all available.
  --scaffolder <name|*|none>  Scaffolding strategy. Use '*' for all, or 'none' to skip.
  -g, --graph <path>        Graph prefix or directory containing graphs.
  -q, --query <name>        Name of the genome/color to reconstruct.
  -o, --output <dir>        Base directory for reconstructed genome outputs.

Optional Options:
  --output-contigs-dir <dir>  Optional. Specify a directory for raw contigs.
  -l, --log                   Enable detailed logging for strategies.
  -t, --threads <num>         Number of threads for graph loading.
  -h, --help                  Print this help message.

Example:

./bifrostasm \
    --assembler basic \
    --scaffolder simple_concat \
    -g ../graphs/salmonella_TraversalState_k91 \
    -q SAL_AD6890AA_AS \
    -o ./results/reconstructed_genomes/ \
    --log

Reference-Guided Reconstruction (genome_reconstructor)

WARNING: The PathAnnotator for annotating the graph is not functioning correctly and this reconstruction way is intended for debugging purposes only.

This tool reconstructs a genome from a graph that has been pre-annotated with the correct path using a reference genome. It requires a graph that was generated using the --annotate PathAnnotationData flag.

./genome_reconstructor \
    -g ./graphs/salmonella_PathAnnotationData_k51 \
    -q SAL_AD6890AA_AS \
    -o ./results/pa_reconstruction.fasta

Evaluation (evaluate.py)

This script automates the evaluation of your reconstructed genomes against a reference using QUAST and FastANI. It produces a concise summary report.

Usage: python scripts/evaluate.py [OPTIONS]

Required Options:
  -r, --reference <path>        Path to the reference genome FASTA file.
  -q, --reconstructed <paths...>  Paths to one or more reconstructed genome FASTA files.
  -o, --output_dir <path>       Directory to store all evaluation outputs.

Optional Options:
  --mode <all|quast|fastani>    Execution mode (Default: all).
  --threads <num>               Number of threads for QUAST/FastANI (Default: 8).
  --timeout <seconds>           Timeout for each external tool call.
  --visualize                   Enable creation of synteny plots (requires FastANI).
  --full_mode                   Save all intermediate files from QUAST and FastANI.

Example:

python scripts/evaluate.py \
    -r ./references/SAL_AD6890AA_AS.fasta \
    -q ./results/reconstructed_genomes/SAL_AD6890AA_AS/*.fasta.gz \
    -o ./results/evaluation_tables/ \
    --visualize

Auxiliary Tools

  • branch_counter: Analyzes the branching complexity (in- and out-degree) for each color in a graph, helping you understand its structure.
  • genome_reconstructor: Reconstructs a genome sequence from a graph that was annotated with PathAnnotationData, which can be useful for debugging the annotation process.
  • scripts/visualize.py: A helper script called by evaluate.py to create synteny plot images from FastANI alignment data.

Reference: Available Strategies

For advanced use, you can choose specific assembler and scaffolder strategies.

Assembler Strategies (--assembler)

Using TraversalState Annotation

basic: A fundamental strategy that extends contigs only along unambiguous paths of the target color. It terminates extension at forks or path ends, prioritizing certainty over contig length.

best_score_path: A greedy assembler that extends paths by selecting the successor that maximizes a scoring function. The score prioritizes target k-mer coverage, followed by total color count and unitig length as tie-breakers.

guided_linear: Forms linear chains by choosing the 'best' successor at each step. "Best" is determined by target k-mer count and downstream ambiguity (preferring successors with fewer of their own successors).

Using no Annotation

explore: When encountering a branch, this assembler performs a limited-depth, breadth-first search to evaluate the branching paths. It scores paths based on average k-mer coverage and length to resolve local complexities.

explore_hc: An extension of explore that is aware of high-coverage (HC) unitigs. It penalizes paths that enter or stay in HC regions, aiming to avoid repetitive elements.

beam_context: A beam search assembler that uses a rich heuristic score including target k-mer coverage, path length, coverage variance, and the traversal of high-coverage regions to guide the search.

beam_adaptive: A beam search assembler where parameters like beam width and exploration depth are adapted based on the graph's k-mer size. It uses a lookahead mechanism in its heuristic score to evaluate the quality of potential next steps.

beam_loop: An advanced beam search assembler that can detect and "unroll" simple tandem repeats based on k-mer coverage heuristics, allowing it to resolve and traverse short, repetitive loops.

Scaffolder Strategies (--scaffolder)

none: Skips the scaffolding step and outputs the raw contigs generated by the assembler.

simple_concat: Sorts the input contigs by ID and joins them into a single sequence, separated by a standard number of 'N' characters.