Breaking Change: update the requirements for feature_name annotation

# Context

_The aspiration to add new species on demand without requiring a data corpus migration is blocked by the current design for detecting and generating unique gene names. Multiple solutions were reviewed. There is consensus to make a breaking change in schema 6.0 (Q2) and collaborate with Workstream (Nik) for potential impact on CXG conversion and Gene Expression._

Review @joyceyan's [Handling of duplicate gene names](https://docs.google.com/document/d/19KgSJJSFl-X0nyZI8nTdUG0fVAJC84FluReGQV7diMI/edit?tab=t.0#heading=h.mtvk8lfekkv6)

See conclusion of [**sci-data-eng**](https://czi-sci.slack.com/archives/C07AV4NU9D2/p1739397305294269?thread_ts=1738349859.811719&cid=C07AV4NU9D2) thread.

**5.3 (Q1) - continue the status quo**

- Detect duplicate feature_names across gene references
- Continue to make duplicate feature_name(s) unique by suffixing the gene_name from the GTF with _{ENSEMBL ID}
- Require a gene migration for re-labeling due to new collisions
- Blocks aspiration to add of new species via schema patches

**6.0 (Q2) - require downstream consumers to implement their own policies to disambiguate duplicate feature_names**
- feature_id continues to be the unique identifier
- feature_name will be the gene_name straight from the GTF. For new species, there are also cases where the gene_name is missing from the GTF. For these, feature_name will be the ENSEMBL ID
- Requires discovery by (or for) downstream consumers like CXG conversion and Gene Expression. For example, CXG conversion changes the var.index to feature_name which will no longer be unique. Another example, Gene Expression may want to detect duplicate gene names and rewrite its version of feature_name to be feature_name (ENSEMBL Identifier).


The requirements for `feature_name` must also be updated based on @jychien analysis of other cases.

See [**#sci-data-eng**](https://czi-sci.slack.com/archives/C07AV4NU9D2/p1739239034985149?thread_ts=1738885717.723039&cid=C07AV4NU9D2):
- Organisms where all genes have an entry in gene_name: human, mouse, covid. All other organisms have at least 1 gene where there is not an associated gene_name.

- Organisms that have Ensembl Version in gene_id: human, mouse. (example: gene_id “ENSG00000186092.7”; gene_type “protein_coding”; gene_name “OR4F5”;…)

---

# Design

### feature_name

<table><tbody>
    <tr>
      <th>Key</th>
      <td>feature_name</td>
    </tr>
    <tr>
      <th>Annotator</th>
      <td>CELLxGENE Discover MUST annotate.</td>
    </tr>
    <tr>
      <th>Value</th>
        <td><code>str</code>. If the <code>feature_biotype</code> is <code>"spike-in"</code> then this MUST be the ERCC Spike-In identifier appended with <code>" (spike-in control)"</code>.<br><br>If the <code>feature_biotype</code> is <code>"gene"</code> and a <code>gene_name</code> attribute is assigned to the <code>var.index</code> feature identifier in its corresponding gene reference, this MUST be the value of the <code>gene_name</code>. If a <code>gene_name</code> attribute is not assigned, then this MUST default to the <code>var.index</code> feature identifier. 
        </td>
    </tr>
</tbody></table>
<b

---


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Breaking Change: update the requirements for feature_name annotation #1254

Context

Design

feature_name

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Key	feature_name
Annotator	CELLxGENE Discover MUST annotate.
Value	`str`. If the `feature_biotype` is `"spike-in"` then this MUST be the ERCC Spike-In identifier appended with `" (spike-in control)"`. If the `feature_biotype` is `"gene"` and a `gene_name` attribute is assigned to the `var.index` feature identifier in its corresponding gene reference, this MUST be the value of the `gene_name`. If a `gene_name` attribute is not assigned, then this MUST default to the `var.index` feature identifier.

Breaking Change: update the requirements for feature_name annotation #1254

Description

Context

Design

feature_name

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions