Description
Design (@brianraymor and @jahilton)
See the Design in The requirements for feature_is_filtered must be clarified
Context
@jychien mentioned:
I'm currently thinking about what can be done to make sure that feature_is_filtered is being filled in correctly. The specific case I am thinking about is if the contributor have actually filtered genes from the normalized matrix as compared to the raw, but have not correctly filled out feature_is_filtered accordingly and is incorrectly set to False. After chatting with @pgarcia-nieto offline a little, it understandable that a complete check of the raw.X vs X matrices would be too computationally expensive. So, was wondering if it would be manageable to have a check to make sure that the number of features which have all zeros expression values is the same for raw.X and X.
Feature request
In the validator:
- Count the number of columns (genes) that are filled with 0s in both
adata.X
andadata.raw.X
. - If the count is different raise a warning
Why
A data contributor may have filtered out genes in X
but did not flag them in adata.var["feature_is_filtered"]
Risks of new feature
- May be computationally expensive, as this will iterate over all genes in both
X
andraw.X
- The meaning of
0
inX
is ambiguous -- it could mean the gene was filtered out or it could be a direct output of normalization