Skip to content

Roadmap

Adam Novak edited this page Dec 15, 2025 · 78 revisions

This document sets out the high-level tasks which the vg development team hopes to accomplish in the next few versions of vg and beyond.

Roadmap 2026 Themes

  • Off-linear-reference-backbone pangenome-aware DV (and supporting vg ecosystem libraries and docs)
  • Make short-read Giraffe work as well as DRAGEN with re-optimization
  • Long read Giraffe supplementary alignments, for assembly mapping
  • API stability guarantees for vg libraries (like htslib, 1-2 releases per year requiring a code change)
  • Index changes only in major releases, dev vs. maintenance branch separation
  • Completely new paradigm for pangenome variant calling
  • Engineering tasks we actually do need to do
  • Integrated indel realignment in vg

Old 2023 Roadmap

These are the things we hoped to achieve on several planning horizons circa 2023:

End of Spring (6/23)

  • Actual tutorials and examples for using libvgio/libbdsg for DV use cases (see pysam's progressively increasing examples) (~2 weeks good, ~1/day bad) (Adam)
    • Better developer documentation (model: pysam docs) for vg library,
    • Links from libvgio Doxygen section to Protobuf-derived doc pages
  • Support DV use cases in vg libraries (libhandlegraph/libvgio) (Adam)
    • Query the graph (for nodes/edges)
    • GRCh38 location -> retrieve aligned reads near there
    • Get read attributes and know what they mean
  • Drop pinchesAndCacti and sonlib
    • Drop Cactus-library-based snarl finder (Adam)
  • Figure out if we need a libsnarls actually (Adam, Xian, Jordan) (low priority)
  • Giraffe optimizations/presets for Ultima single-end short reads (A student)

End of Summer (9/23)

  • Giraffe pretty good on long reads
  • Delete at least one index each from vg index #3144 (Adam, Jordan)
    • GCSA to its own command
  • Eliminate intermediate Alignment as surject output, go graph Alignment -> BAM record(s) to make supplementary alignments work (Jordan, Aleksis)
    • Generalize spliced alignment code to let vg surject handle long deletions vs. target path (and generate 30000D CIGARs)
    • Let surject generate supplementary alignments for e.g. mappings over inversions
  • Haplotype sampling to modified GBZ based on k-mers should be good (Jouni)
  • User-facing, under-test Giraffe docs for HPRC graphs (Faith)
  • Delete old snarl manager (except as a backward-compatible load?) and use snarl-only DI2 everywhere (Glenn did a little)
  • No cruft in vg index #3144 (Adam, Jordan)

End of Fall (12/23)

  • Releases for vg libraries (libhandlegraph, libbdsg, libvgio)
    • Enhanced release acceptance testing so a release means production quality, like we would do for a paper. Proven to be able to run some number of genomes. (Tag paper-validated hashes as release?)
    • API stability guarantees for vg libraries (like htslib, 1-2 releases per year requiring a code change)
  • Libraries free of cryptic error messages when the user or inputs are wrong (Faith made some progress)
  • Just front-end the autoindex system with a sensible CLI for single indexes in vg index. How do we deprecate things? 3 levels of command? Clear input vs. output distinction.
  • Use memory-mapped graphs (Adam) (memory-mapping in DI2 produced weird slowness on some network filesystems) (Ignore)
    • For tube map, to enable interactive whole-genome use (Future data vis enthusiast)
    • For Giraffe
  • Giraffe actually competitive on long reads

Later

  • Example tutorial under test for idiomatic Giraffe on long-node rGFA on long reads with auto-chopping (autoindex already does the work probably) and close #3126
    • Queries (vg find) in rGFA space
  • Completely new paradigm for pangenome variant calling so we don't need population-specific reference hacks
    • Variant calling against rGFA references, allowing new variants all over
    • Think about PanGenie fitting in here
    • Will need to involve non-independent read mapping
    • Solve CNVs with EM (Jordan)

Wishlist

These are things we would like to do eventually.

  • Eliminate vg::VG (Jordan)

    • Steal all the things only it can do away from it
  • Default everything to GAF instead of GAM

    • mpGAF (Jordan, Jonas)
    • Also pgvf (Graph to graph)
    • Calls and snarls in one of these?
  • Python bindings for libhandlegraph algorithms

    • Are they the right algorithms?
  • Use of MCMC techniques in the genotyper with multipath alignments

  • vg deconstruct and Beagle to impute genotypes into partially-mapped and called data, as a PanGenie alternative (Erik, Andrea)

  • Alignment

    • Adoption of the multipath alignment paradigm as the default
    • Graph-to-graph mapping (Xian)
  • Variant Calling

    • Implementation of an HHGA-like machine learning based variant caller
    • Integration of variant calling and assembly polishing processes
    • Prune the zoo of TraversalFinders, and expose the useful ones to Python
  • Visualization

    • Browser-free tube map
    • Better tube map handling of edge cases
      • No haplotypes on a node
      • Starting on a rare haplotype
  • Infrastructure

    • Destructively modernize and unify IO
      • Eliminate VPKG framing if possible in favor of magic numbers everywhere
        • Resolve ensuing questions about GAM format
          • Just use GAF?
        • Handle things like GFA that need to manually sniff
      • Just save from the object; no more save_handle_graph
      • Magic format registration for libvgio magic numbers for loading
      • Depend on libvgio in libbdsg to do the IO there and pick the right handle graph implementation
    • Replace Protobuf internal formats with faster ones
    • Revision of ID assignment logic to allow deterministic node breaking
    • Accept gzipped GFA if practical (can't mmap)
    • Improved HandleGraph API
      • Abstract away node boundaries
      • View all sequence as C++17 string_views instead of sequence-owning strings
      • O(1) reverse complement DNAStringView
    • CMake-ify the main vg build
    • Eliminate old systems and their associated submodules, or factor them out into their own projects
      • vg vectorize could be its own project
        • Update vg vectorize to modern, system Vowpal Wabbit
        • Or pull it out into its own submodule and remove Vowpal Wabbit dependency from vg
      • Eliminate RocksDB from vg; everybody using vg map uses GCSA indexes now.
      • vg genotype
      • vg srpe
    • More cross-language support
      • Interoperate with Rust handle graph users/providers
      • Interoperate with Java handle graph users/providers

Clone this wiki locally