-
Notifications
You must be signed in to change notification settings - Fork 175
Description
Please describe your wishes and possible alternatives to achieve the desired result.
Long standing issue @felix0097 raised a while ago, but we should begin a spec "change" process to begin to allow writing out indices
of a CSR matrix (or CSC, although less valuable there) whose max value is checked + dtype automatically set optimally. Nowhere are dtypes specified in https://anndata.readthedocs.io/en/latest/fileformat-prose.html#sparse-array-specification-v0-1-0 so while this isn't a breaking change, it could potentially complicate things downstream.
In other words, the values in indices
of a CSR matrix are regularly less than max(uint16)
(because we often dont have more than 30000 or so genes) but are often written as {u}int32/64
so allowing users to write data optimized for this fact without breaking downstream pipelines is in our interest. The process for this would be
- Add a setting to allow this behavior via
anndata.settings.downcast_indices_in_sparse = True
or similar - I would guess the behavior would be "take the max of the incomingindices
optionally and then write out as the minimum needed dtype" - Write tests within
anndata
that ensure reading this data back in doesn't break cupy/scipy sparse - Release with this setting as
False
- Potentially in the future set to
True
The downside/complication has been traditionally scipy sparse handling of differing indptr
+ indices
dtypes but I think that is a manageable problem if we limit ourselves to just io here, and I imagine the performance would be better, even if the data has to be re-upcast into int32 for this compatibility issue, given the lessened io.
cc @lazappi @ivirshup @keller-mark, happy to produce some dummy data for this for y'all to test. Let me know if you are aware of any downsides here or if you have commments!