-
Notifications
You must be signed in to change notification settings - Fork 215
Roadmap
Adam Novak edited this page Dec 15, 2025
·
78 revisions
This document sets out the high-level tasks which the vg development team hopes to accomplish in the next few versions of vg and beyond.
- Off-linear-reference-backbone pangenome-aware DV (and supporting vg ecosystem libraries and docs)
- Make short-read Giraffe work as well as DRAGEN with re-optimization
- Long read Giraffe supplementary alignments, for assembly mapping
- API stability guarantees for vg libraries (like htslib, 1-2 releases per year requiring a code change)
- Index changes only in major releases, dev vs. maintenance branch separation
- Completely new paradigm for pangenome variant calling
- Engineering tasks we actually do need to do
- Integrated indel realignment in vg
These are the things we hoped to achieve on several planning horizons circa 2023:
- Actual tutorials and examples for using libvgio/libbdsg for DV use cases (see pysam's progressively increasing examples) (~2 weeks good, ~1/day bad) (Adam)
- Better developer documentation (model: pysam docs) for vg library,
- Links from libvgio Doxygen section to Protobuf-derived doc pages
- Support DV use cases in vg libraries (libhandlegraph/libvgio) (Adam)
- Query the graph (for nodes/edges)
- GRCh38 location -> retrieve aligned reads near there
- Get read attributes and know what they mean
- Drop pinchesAndCacti and sonlib
- Drop Cactus-library-based snarl finder (Adam)
- Figure out if we need a
libsnarlsactually (Adam, Xian, Jordan) (low priority) - Giraffe optimizations/presets for Ultima single-end short reads (A student)
- Giraffe pretty good on long reads
- Delete at least one index each from
vg index#3144 (Adam, Jordan)- GCSA to its own command
- Eliminate intermediate
Alignmentas surject output, go graphAlignment-> BAM record(s) to make supplementary alignments work (Jordan, Aleksis)- Generalize spliced alignment code to let
vg surjecthandle long deletions vs. target path (and generate30000DCIGARs) - Let surject generate supplementary alignments for e.g. mappings over inversions
- Generalize spliced alignment code to let
- Haplotype sampling to modified GBZ based on k-mers should be good (Jouni)
- User-facing, under-test Giraffe docs for HPRC graphs (Faith)
- Delete old snarl manager (except as a backward-compatible load?) and use snarl-only DI2 everywhere (Glenn did a little)
- No cruft in
vg index#3144 (Adam, Jordan)
- Releases for vg libraries (libhandlegraph, libbdsg, libvgio)
- Enhanced release acceptance testing so a release means production quality, like we would do for a paper. Proven to be able to run some number of genomes. (Tag paper-validated hashes as release?)
- API stability guarantees for vg libraries (like htslib, 1-2 releases per year requiring a code change)
- Libraries free of cryptic error messages when the user or inputs are wrong (Faith made some progress)
- Just front-end the autoindex system with a sensible CLI for single indexes in
vg index. How do we deprecate things? 3 levels of command? Clear input vs. output distinction. - Use memory-mapped graphs (Adam) (memory-mapping in DI2 produced weird slowness on some network filesystems) (Ignore)
- For tube map, to enable interactive whole-genome use (Future data vis enthusiast)
- For Giraffe
- Giraffe actually competitive on long reads
- Example tutorial under test for idiomatic Giraffe on long-node rGFA on long reads with auto-chopping (autoindex already does the work probably) and close #3126
- Queries (
vg find) in rGFA space
- Queries (
- Completely new paradigm for pangenome variant calling so we don't need population-specific reference hacks
- Variant calling against rGFA references, allowing new variants all over
- Think about PanGenie fitting in here
- Will need to involve non-independent read mapping
- Solve CNVs with EM (Jordan)
These are things we would like to do eventually.
-
Eliminate
vg::VG(Jordan)- Steal all the things only it can do away from it
-
Default everything to GAF instead of GAM
- mpGAF (Jordan, Jonas)
- Also pgvf (Graph to graph)
- Calls and snarls in one of these?
-
Python bindings for
libhandlegraphalgorithms- Are they the right algorithms?
-
Use of MCMC techniques in the genotyper with multipath alignments
-
vg deconstruct and Beagle to impute genotypes into partially-mapped and called data, as a PanGenie alternative (Erik, Andrea)
-
Alignment
- Adoption of the multipath alignment paradigm as the default
- Graph-to-graph mapping (Xian)
-
Variant Calling
- Implementation of an HHGA-like machine learning based variant caller
- Integration of variant calling and assembly polishing processes
- Prune the zoo of TraversalFinders, and expose the useful ones to Python
-
Visualization
- Browser-free tube map
- Better tube map handling of edge cases
- No haplotypes on a node
- Starting on a rare haplotype
-
Infrastructure
- Destructively modernize and unify IO
- Eliminate VPKG framing if possible in favor of magic numbers everywhere
- Resolve ensuing questions about GAM format
- Just use GAF?
- Handle things like GFA that need to manually sniff
- Resolve ensuing questions about GAM format
- Just save from the object; no more
save_handle_graph - Magic format registration for
libvgiomagic numbers for loading - Depend on
libvgioinlibbdsgto do the IO there and pick the right handle graph implementation
- Eliminate VPKG framing if possible in favor of magic numbers everywhere
- Replace Protobuf internal formats with faster ones
- Revision of ID assignment logic to allow deterministic node breaking
- Accept gzipped GFA if practical (can't mmap)
- Improved HandleGraph API
- Abstract away node boundaries
- View all sequence as C++17 string_views instead of sequence-owning strings
- O(1) reverse complement DNAStringView
- CMake-ify the main vg build
- Eliminate old systems and their associated submodules, or factor them out into their own projects
-
vg vectorizecould be its own project- Update
vg vectorizeto modern, system Vowpal Wabbit - Or pull it out into its own submodule and remove Vowpal Wabbit dependency from vg
- Update
- Eliminate RocksDB from vg; everybody using
vg mapuses GCSA indexes now. -
vg genotype -
vg srpe
-
- More cross-language support
- Interoperate with Rust handle graph users/providers
- Interoperate with Java handle graph users/providers
- Destructively modernize and unify IO