Description
Context
The aspiration to add new species on demand without requiring a data corpus migration is blocked by the current design for detecting and generating unique gene names. Multiple solutions were reviewed. There is consensus to make a breaking change in schema 6.0 (Q2) and collaborate with Workstream (Nik) for potential impact on CXG conversion and Gene Expression.
Review @joyceyan's Handling of duplicate gene names
See conclusion of sci-data-eng thread.
5.3 (Q1) - continue the status quo
- Detect duplicate feature_names across gene references
- Continue to make duplicate feature_name(s) unique by suffixing the gene_name from the GTF with _{ENSEMBL ID}
- Require a gene migration for re-labeling due to new collisions
- Blocks aspiration to add of new species via schema patches
6.0 (Q2) - require downstream consumers to implement their own policies to disambiguate duplicate feature_names
- feature_id continues to be the unique identifier
- feature_name will be the gene_name straight from the GTF. For new species, there are also cases where the gene_name is missing from the GTF. For these, feature_name will be the ENSEMBL ID
- Requires discovery by (or for) downstream consumers like CXG conversion and Gene Expression. For example, CXG conversion changes the var.index to feature_name which will no longer be unique. Another example, Gene Expression may want to detect duplicate gene names and rewrite its version of feature_name to be feature_name (ENSEMBL Identifier).
The requirements for feature_name
must also be updated based on @jychien analysis of other cases.
See #sci-data-eng:
-
Organisms where all genes have an entry in gene_name: human, mouse, covid. All other organisms have at least 1 gene where there is not an associated gene_name.
-
Organisms that have Ensembl Version in gene_id: human, mouse. (example: gene_id “ENSG00000186092.7”; gene_type “protein_coding”; gene_name “OR4F5”;…)
Activity