Skip to content

Dealing with pangenome multi-level names #104

@nsheff

Description

@nsheff

Setting aside issue #103 about whether to include graphs, say we start with a standard for pangenomes alone, without their derived graphs.

A central design question is how names factor into pangenome identifiers.
This is not so different than the same question we asked about sequence collections, except now it's more complicated because there are multiple layers of naming:

  • Sequence names (e.g., chrX vs. X)
  • Collection names (e.g., genome build identifiers)
  • If the same sequences are labeled differently (chrX vs. X), should that affect the identifier?
  • If the same genomes are labeled differently (genome1 vs. hap1), should that affect the identifier?

Options

One option is to construct identifiers only from the sequences array, ignoring names altogether. Another is to make names part of the digest. Actually there are now combinatoric possibilities:

  1. Seqnames + colnames + sequences (the most complete digest)
  2. Seqnames + sequences, ignoring collection names
  3. Sequences + collection names, ignoring seqnames
  4. Sequences only, ignoring both types of names

Well, one question is: if the names differ, does that affect the resulting graph? Because if it does, then it should definitely affect the identifier.
If it doesn't, well, maybe it should still affect the identifier anyway... because I can imagine an analysis where the names of the sequences, or the names of the genomes, matter.

As with sequence collections, it may be useful to separate these concerns:

  • Additional optional or recommended level-1 identifiers that only captures identity at differnet levels, whatever is useful.
  • As with refget seqcol, These ancillary digests act like pre-indexed search keys. The fact that many are possible is not inherently a problem — it may even be useful. The real question is which digests are practically valuable for algorithm developers and tool builders.

Proposal

I suggest we follow the same logic we followed for sequence collections: The base scenario includes everything.
If we don't do this, a use case that uses names at one level or the other will not be able to use refget-pangenomes.
In contrast, if we do this, any use case that only needs one or the other or neither will still be able to work, using ancillary attributes.

Another benefit is that it means we can simply reuse all the existing logic, concepts, and even code we've written for the sequence collections.

Open design choices

  • What do you think of this proposal that the base model uses names at both levels as inherent?
  • Which optional digests should be defined to maximize utility?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions