Deliberate parameterization of the output zarr geometry should be possible when X is sparse.
I'm noticing that on huge sparse arrays, the chunks parameter does not seem to work based on this block in io.write_zarr, instead falling back to auto sharding:
def callback(
write_func, store, elem_name: str, elem, *, dataset_kwargs, iospec
) -> None:
if (
chunks is not None
and not isinstance(elem, sparse.spmatrix)
and elem_name.lstrip("/") == "X"
):
dataset_kwargs = dict(dataset_kwargs, chunks=chunks)
write_func(store, elem_name, elem, dataset_kwargs=dataset_kwargs)
The issue is that empirically, auto sharding often leads to tens of millions of inodes when huge sparse arrays are passed.
It appears that even when auto-sharding is turned off, we still can't set the chunk geometry explicitly for sparse stores (let alone the shard factor).
What is the reason for this deliberate dropping of the publicly facing chunks argument?
Autosharding is great, but it should not be the only way to write sparse arrays when the user wants to obtain artifacts with fewer chunks due to any number of real-world constraints, such as HPC inode quotas, faster writes, etc.
Right now the only way to do this seems to be to let anndata write whatever store geometry it guesses at and then re-export these manually with zarr-python api calls, effectively writing the entire artifact twice...
Deliberate parameterization of the output zarr geometry should be possible when X is sparse.
I'm noticing that on huge sparse arrays, the chunks parameter does not seem to work based on this block in io.write_zarr, instead falling back to auto sharding:
The issue is that empirically, auto sharding often leads to tens of millions of inodes when huge sparse arrays are passed.
It appears that even when auto-sharding is turned off, we still can't set the chunk geometry explicitly for sparse stores (let alone the shard factor).
What is the reason for this deliberate dropping of the publicly facing chunks argument?
Autosharding is great, but it should not be the only way to write sparse arrays when the user wants to obtain artifacts with fewer chunks due to any number of real-world constraints, such as HPC inode quotas, faster writes, etc.
Right now the only way to do this seems to be to let anndata write whatever store geometry it guesses at and then re-export these manually with zarr-python api calls, effectively writing the entire artifact twice...