Dealing with pangenome multi-level names

Setting aside issue #103 about whether to include graphs, say we start with a standard for pangenomes alone, without their derived graphs.

A central design question is how *names* factor into pangenome identifiers. 
This is not so different than the same question we asked about sequence collections, except now it's more complicated because *there are multiple layers of naming*:
  * **Sequence names** (e.g., `chrX` vs. `X`)
  * **Collection names** (e.g., genome build identifiers)
* If the same sequences are labeled differently (`chrX` vs. `X`), should that affect the identifier?
* If the same genomes are labeled differently (`genome1` vs. `hap1`), should that affect the identifier?

## Options

One option is to construct identifiers only from the `sequences` array, ignoring names altogether. Another is to make names part of the digest. Actually there are now combinatoric possibilities:

1. Seqnames + colnames + sequences (the most complete digest)
2. Seqnames + sequences, ignoring collection names
3. Sequences + collection names, ignoring seqnames
4. Sequences only, ignoring both types of names

Well, one question is: if the names differ, does that affect the resulting graph? Because if it does, then it should definitely affect the identifier.
If it doesn't, well, maybe it should still affect the identifier anyway... because I can imagine an analysis where the names of the sequences, or the names of the genomes, matter.


As with sequence collections, it may be useful to separate these concerns:

* Additional optional or recommended level-1 identifiers that only captures identity at differnet levels, whatever is useful.
* As with refget seqcol, These ancillary digests act like pre-indexed search keys. The fact that many are possible is not inherently a problem — it may even be useful. The real question is which digests are practically valuable for algorithm developers and tool builders.

### Proposal

I suggest we follow the same logic we followed for sequence collections: The base scenario includes everything.
If we don't do this, a use case that uses names at one level or the other will not be able to use refget-pangenomes.
In contrast, if we do this, any use case that only needs one or the other or neither will still be able to work, using ancillary attributes.

Another benefit is that it means we can simply reuse all the existing logic, concepts, and even code we've written for the sequence collections.

### Open design choices

* What do you think of this proposal that the base model uses names at both levels as inherent?
* Which optional digests should be defined to maximize utility?




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dealing with pangenome multi-level names #104

Options

Proposal

Open design choices

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Dealing with pangenome multi-level names #104

Description

Options

Proposal

Open design choices

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions