Downcast `indices` for CSR matrices if possible on-disk

### Please describe your wishes and possible alternatives to achieve the desired result.


Long standing issue @felix0097 raised a while ago, but we should begin a spec "change" process to begin to allow writing out `indices` of a CSR matrix (or CSC, although less valuable there) whose max value is checked + dtype automatically set optimally.  Nowhere are dtypes specified in https://anndata.readthedocs.io/en/latest/fileformat-prose.html#sparse-array-specification-v0-1-0 so while this isn't a breaking change, it could potentially complicate things downstream.

In other words, the values in `indices` of a CSR matrix are regularly less than `max(uint16)` (because we often dont have more than 30000 or so genes) but are often written as `{u}int32/64` so allowing users to write data optimized for this fact  without breaking downstream pipelines is in our interest. The process for this would be

1. Add a setting to allow this behavior via `anndata.settings.downcast_indices_in_sparse = True` or similar - I would guess the behavior would be "take the max of the incoming `indices` optionally and then write out as the minimum needed dtype"
2. Write tests within `anndata` that ensure reading this data back in doesn't break cupy/scipy sparse
3. Release with this setting as `False`
4. Potentially in the future set to `True`

The downside/complication has been traditionally scipy sparse handling of differing `indptr` + `indices` dtypes but I think that is a manageable problem if we limit ourselves to just io here, and I imagine the performance would be better, even if the data has to be re-upcast into int32 for this compatibility issue, given the lessened io. 

cc @lazappi @ivirshup @keller-mark, happy to produce some dummy data for this for y'all to test.  Let me know if you are aware of any downsides here or if you have commments!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Downcast `indices` for CSR matrices if possible on-disk #2153

Please describe your wishes and possible alternatives to achieve the desired result.

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Downcast indices for CSR matrices if possible on-disk #2153

Description

Please describe your wishes and possible alternatives to achieve the desired result.

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Downcast `indices` for CSR matrices if possible on-disk #2153