Skip to content

Anndata and zarr #2145

@Amitbergman

Description

@Amitbergman

Question

Hey all,

I have a few anndata datasets with sparse csr X matrices (each is with ~10M cells and 40K genes, with parity of about 5%).

I want to be able to quickly load whole rows from these datasets (say given a query, load all rows based on a condition on the obs table).
Currently I am taking the anndata object and converting it to tileDB, but I recently encountered the zarr file format, and specifically the support of zarr v3 in anndata.

I have a few questions regarding zarr:

  1. Is Zarr v3 would be a good fit for our use case? Should I expect improvement over tileDB?
  2. Are there some guidelines on what codec to use? Chunk sizes?
  3. Are there some guidelines as to how to benefit from concurrency? I see dask being used in many places together with zarr.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions