-
Notifications
You must be signed in to change notification settings - Fork 9
Description
Setting aside issue #103 about whether to include graphs, say we start with a standard for pangenomes alone, without their derived graphs.
A central design question is how names factor into pangenome identifiers.
This is not so different than the same question we asked about sequence collections, except now it's more complicated because there are multiple layers of naming:
- Sequence names (e.g.,
chrX
vs.X
) - Collection names (e.g., genome build identifiers)
- If the same sequences are labeled differently (
chrX
vs.X
), should that affect the identifier? - If the same genomes are labeled differently (
genome1
vs.hap1
), should that affect the identifier?
Options
One option is to construct identifiers only from the sequences
array, ignoring names altogether. Another is to make names part of the digest. Actually there are now combinatoric possibilities:
- Seqnames + colnames + sequences (the most complete digest)
- Seqnames + sequences, ignoring collection names
- Sequences + collection names, ignoring seqnames
- Sequences only, ignoring both types of names
Well, one question is: if the names differ, does that affect the resulting graph? Because if it does, then it should definitely affect the identifier.
If it doesn't, well, maybe it should still affect the identifier anyway... because I can imagine an analysis where the names of the sequences, or the names of the genomes, matter.
As with sequence collections, it may be useful to separate these concerns:
- Additional optional or recommended level-1 identifiers that only captures identity at differnet levels, whatever is useful.
- As with refget seqcol, These ancillary digests act like pre-indexed search keys. The fact that many are possible is not inherently a problem — it may even be useful. The real question is which digests are practically valuable for algorithm developers and tool builders.
Proposal
I suggest we follow the same logic we followed for sequence collections: The base scenario includes everything.
If we don't do this, a use case that uses names at one level or the other will not be able to use refget-pangenomes.
In contrast, if we do this, any use case that only needs one or the other or neither will still be able to work, using ancillary attributes.
Another benefit is that it means we can simply reuse all the existing logic, concepts, and even code we've written for the sequence collections.
Open design choices
- What do you think of this proposal that the base model uses names at both levels as inherent?
- Which optional digests should be defined to maximize utility?