Skip to content

Roadmap

Adam Novak edited this page Jan 6, 2026 · 78 revisions

This document sets out the high-level tasks which the vg development team hopes to accomplish in the next few versions of vg and beyond.

Roadmap 2026

Themes

  • Off-linear-reference-backbone pangenome-aware DV (and supporting vg ecosystem libraries and docs)
  • Make short-read Giraffe work as well as DRAGEN with re-optimization
  • Long read Giraffe supplementary alignments, for assembly mapping
  • API stability guarantees for vg libraries (like htslib, 1-2 releases per year requiring a code change)
  • Index changes only in major releases, dev vs. maintenance branch separation
  • Completely new paradigm for pangenome variant calling
  • Engineering tasks we actually do need to do
  • Integrated indel realignment in vg

Tasks

  • Drop pinchesAndCacti and sonlib (Adam)
    • Drop Cactus-library-based snarl finder
  • User-facing, under-test Giraffe docs for HPRC graphs (Faith)
  • Removal of non-chaining Giraffe codepath
    • Paired-end support for chaining Giraffe codepath

Wishlist

These are things we would like to do eventually.

Engineering

Documentation and Tutorials

  • Actual tutorials and examples for using libvgio/libbdsg for DV use cases (see pysam's progressively increasing examples)
    • Better developer documentation (model: pysam docs) for vg library
    • Links from libvgio Doxygen section to Protobuf-derived doc pages
  • Example tutorial under test for idiomatic Giraffe on long-node rGFA on long reads with auto-chopping (autoindex already does the work probably) and close #3126
    • Queries (vg find) in rGFA space

Usability

  • No cruft in vg index #3144
  • Front-end the autoindex system with a sensible CLI for single indexes in vg index.
    • Clear input vs. output distinction.
    • 3 levels of command?
    • Deprecation path for old commands

Libraries and APIs

  • Improved HandleGraph API
    • Abstract away node boundaries
    • View all sequence as C++17 string_views instead of sequence-owning strings
    • O(1) reverse complement DNAStringView
  • Releases for vg libraries (libhandlegraph, libbdsg, libvgio)
    • Enhanced release acceptance testing so a release means production quality, like we would do for a paper. Proven to be able to run some number of genomes. (Tag paper-validated hashes as release?)
    • API stability guarantees for vg libraries (like htslib, 1-2 releases per year requiring a code change)
  • Libraries free of cryptic error messages when the user or inputs are wrong (Faith made some progress)
  • More utilities/use cases in vg libraries (libhandlegraph/libvgio)
    • Query the graph (for nodes/edges)
    • GRCh38 location -> retrieve aligned reads near there
    • Get read attributes and know what they mean
  • Python bindings for libhandlegraph algorithms
    • Are they the right algorithms?
  • Figure out if we need a libsnarls actually (low priority)

I/O

  • Accept gzipped GFA if practical (can't mmap)
  • Replace GAM with GAF overall
    • Default everything to GAF instead of GAM
    • mpGAF
  • Eliminate VPKG framing if possible in favor of magic numbers everywhere
    • Resolve ensuing questions about GAM format. Just use GAF?
    • Handle things like GFA that don't have magic numbers and need to manually sniff the file
  • Just save from the object; no more save_handle_graph
  • Magic format registration for libvgio magic numbers for loading
  • Depend on libvgio in libbdsg to do the IO there and pick the right handle graph implementation, instead of having the format detection in vg itself where nobody else can use it.

Project Infrastructure

  • Use Rust implementations of GBZ, etc. in vg
  • CMake-ify the main vg build
  • Remove sdsl-lite AKA SDSL2 to ship Apache-licensed binaries
  • Delete old snarl manager (except as a backward-compatible load?) and use snarl-only DI2 everywhere (Glenn did a little)
  • Eliminate vg::VG
    • Steal all the things only it can do away from it
  • Replace Protobuf internal formats with faster ones (path_t, etc.)
  • Revise ID assignment logic to allow deterministic node breaking
  • Eliminate old systems and their associated submodules, or factor them out into their own projects
    • vg vectorize could be its own project
      • Update vg vectorize to modern, system Vowpal Wabbit
      • Or pull it out into its own submodule and remove Vowpal Wabbit dependency from vg
    • vg genotype
    • vg srpe

Research

Alignment

  • Giraffe optimizations/presets for Ultima single-end short reads
  • Adoption of the multipath alignment paradigm as the default
  • Graph-to-graph mapping, possibly with pgvf

Variant Calling

  • Completely new paradigm for pangenome variant calling so we don't need population-specific reference hacks
    • Variant calling against rGFA references, allowing new variants all over
    • Think about PanGenie fitting in here
    • Will need to involve non-independent read mapping
    • Solve CNVs with EM
  • vg deconstruct and Beagle to impute genotypes into partially-mapped and called data, as a PanGenie alternative (Erik, Andrea)
  • Implementation of an HHGA-like machine learning based variant caller
  • Integration of variant calling and assembly polishing processes
  • Use of MCMC techniques in the genotyper with multipath alignments
  • Prune the zoo of TraversalFinders, and expose the useful ones to Python

Visualization

  • Browser-free tube map
  • Better tube map handling of edge cases
    • No haplotypes on a node
    • Starting on a rare haplotype

Clone this wiki locally