Skip to content

make write_zarr respect the public chunks parameter when the input is sparse #2415

@mkarikom

Description

@mkarikom

Deliberate parameterization of the output zarr geometry should be possible when X is sparse.

I'm noticing that on huge sparse arrays, the chunks parameter does not seem to work based on this block in io.write_zarr, instead falling back to auto sharding:

    def callback(
        write_func, store, elem_name: str, elem, *, dataset_kwargs, iospec
    ) -> None:
        if (
            chunks is not None
            and not isinstance(elem, sparse.spmatrix)
            and elem_name.lstrip("/") == "X"
        ):
            dataset_kwargs = dict(dataset_kwargs, chunks=chunks)
        write_func(store, elem_name, elem, dataset_kwargs=dataset_kwargs)

The issue is that empirically, auto sharding often leads to tens of millions of inodes when huge sparse arrays are passed.

It appears that even when auto-sharding is turned off, we still can't set the chunk geometry explicitly for sparse stores (let alone the shard factor).

What is the reason for this deliberate dropping of the publicly facing chunks argument?

Autosharding is great, but it should not be the only way to write sparse arrays when the user wants to obtain artifacts with fewer chunks due to any number of real-world constraints, such as HPC inode quotas, faster writes, etc.
Right now the only way to do this seems to be to let anndata write whatever store geometry it guesses at and then re-export these manually with zarr-python api calls, effectively writing the entire artifact twice...

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions