Skip to content

Graph representations and refget pangenomes #103

@nsheff

Description

@nsheff

Today we discussed how refget pangenome should handle graphs. They could be:

  • Inherent to the pangenome object
  • A separate, top-level attribute of the pangenome, but not inherent
  • Not included at all in the standard

Context

  • A pangenome can be defined as a set of haplotypes.
  • A graph is a data structure that can be built on top of that set.

Pro graph inclusion:

  • Graphs are what many users will actually consume and analyze. Standardization here would improve reproducibility.

Con graph inclusion:

  • Graphs are derived data and will be tool- and parameter-specific. This parallels linear reference genomes, where the community standardizes on the underlying sequence data, not derived indexes (e.g., BWA/Bowtie). Graphs will be non-deterministic across implementations, so practically, it's much more difficult to come up with content-derived identifiers that can do what refget does for sequences and seqcols.

Position

I propose a 2-phase approach. First standardize pangenomes. Then, standardize graphs.

Phase 1: Pangenomes. To start, graphs should not be included in the core pangenome representation, at all. Standardizing pangenomes as sets of haplotypes is already a major step forward. This is achievable now and would provide immediate benefit. Then, individual graph files can reference the pangenome identifier in headers or metadata, maintaining reproducibility without making graphs part of the digest model.

Phase 2: Graphs. Then, phase 2 (long-term), we can explore standardizing graph formats and defining digests for graph structures as a separate project. This may be feasible but it may not be.

If we can't do phase 1, we almost certainly can't do phase 2, so let's just accomplish the easier one and see where we get.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions