Skip to content

Downcast indices for CSR matrices if possible on-disk #2153

@ilan-gold

Description

@ilan-gold

Please describe your wishes and possible alternatives to achieve the desired result.

Long standing issue @felix0097 raised a while ago, but we should begin a spec "change" process to begin to allow writing out indices of a CSR matrix (or CSC, although less valuable there) whose max value is checked + dtype automatically set optimally. Nowhere are dtypes specified in https://anndata.readthedocs.io/en/latest/fileformat-prose.html#sparse-array-specification-v0-1-0 so while this isn't a breaking change, it could potentially complicate things downstream.

In other words, the values in indices of a CSR matrix are regularly less than max(uint16) (because we often dont have more than 30000 or so genes) but are often written as {u}int32/64 so allowing users to write data optimized for this fact without breaking downstream pipelines is in our interest. The process for this would be

  1. Add a setting to allow this behavior via anndata.settings.downcast_indices_in_sparse = True or similar - I would guess the behavior would be "take the max of the incoming indices optionally and then write out as the minimum needed dtype"
  2. Write tests within anndata that ensure reading this data back in doesn't break cupy/scipy sparse
  3. Release with this setting as False
  4. Potentially in the future set to True

The downside/complication has been traditionally scipy sparse handling of differing indptr + indices dtypes but I think that is a manageable problem if we limit ourselves to just io here, and I imagine the performance would be better, even if the data has to be re-upcast into int32 for this compatibility issue, given the lessened io.

cc @lazappi @ivirshup @keller-mark, happy to produce some dummy data for this for y'all to test. Let me know if you are aware of any downsides here or if you have commments!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions