Skip to content

Breaking Change: update the requirements for feature_name annotation #1254

Open
@brianraymor

Description

@brianraymor

Context

The aspiration to add new species on demand without requiring a data corpus migration is blocked by the current design for detecting and generating unique gene names. Multiple solutions were reviewed. There is consensus to make a breaking change in schema 6.0 (Q2) and collaborate with Workstream (Nik) for potential impact on CXG conversion and Gene Expression.

Review @joyceyan's Handling of duplicate gene names

See conclusion of sci-data-eng thread.

5.3 (Q1) - continue the status quo

  • Detect duplicate feature_names across gene references
  • Continue to make duplicate feature_name(s) unique by suffixing the gene_name from the GTF with _{ENSEMBL ID}
  • Require a gene migration for re-labeling due to new collisions
  • Blocks aspiration to add of new species via schema patches

6.0 (Q2) - require downstream consumers to implement their own policies to disambiguate duplicate feature_names

  • feature_id continues to be the unique identifier
  • feature_name will be the gene_name straight from the GTF. For new species, there are also cases where the gene_name is missing from the GTF. For these, feature_name will be the ENSEMBL ID
  • Requires discovery by (or for) downstream consumers like CXG conversion and Gene Expression. For example, CXG conversion changes the var.index to feature_name which will no longer be unique. Another example, Gene Expression may want to detect duplicate gene names and rewrite its version of feature_name to be feature_name (ENSEMBL Identifier).

The requirements for feature_name must also be updated based on @jychien analysis of other cases.

See #sci-data-eng:

  • Organisms where all genes have an entry in gene_name: human, mouse, covid. All other organisms have at least 1 gene where there is not an associated gene_name.

  • Organisms that have Ensembl Version in gene_id: human, mouse. (example: gene_id “ENSG00000186092.7”; gene_type “protein_coding”; gene_name “OR4F5”;…)


Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Labels

6.0Next major CELLxGENE schema versiondraftingdrafting schema requirementsschemaCELLxGENE Discover dataset schema

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions