Skip to content

cellxgene-schema CLI must update validation for feature_is_filtered #137

Open
@pablo-gar

Description

@pablo-gar

Design (@brianraymor and @jahilton)

See the Design in The requirements for feature_is_filtered must be clarified

Context

@jychien mentioned:

I'm currently thinking about what can be done to make sure that feature_is_filtered is being filled in correctly. The specific case I am thinking about is if the contributor have actually filtered genes from the normalized matrix as compared to the raw, but have not correctly filled out feature_is_filtered accordingly and is incorrectly set to False. After chatting with @pgarcia-nieto offline a little, it understandable that a complete check of the raw.X vs X matrices would be too computationally expensive. So, was wondering if it would be manageable to have a check to make sure that the number of features which have all zeros expression values is the same for raw.X and X.

Feature request
In the validator:

  1. Count the number of columns (genes) that are filled with 0s in both adata.X and adata.raw.X.
  2. If the count is different raise a warning

Why
A data contributor may have filtered out genes in X but did not flag them in adata.var["feature_is_filtered"]

Risks of new feature

  1. May be computationally expensive, as this will iterate over all genes in both X and raw.X
  2. The meaning of 0 in X is ambiguous -- it could mean the gene was filtered out or it could be a direct output of normalization

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions