feat: sensible chunk defaults for dask

### Please describe your wishes and possible alternatives to achieve the desired result.


#2033 at least makes it so that we are no longer creating arrays with very small chunks unnecessarily but from a performance perspective, mapping dask chunks to chunks on disk is not a good idea.  Especially with the codec pipeline in zarr being parallelized, the need for dask workers to be responsible for speeding up io is lessened and thus cramming more parallel reads within a worker is probably sensible.  I imagine the same is true for hdf5 even if it isn't parallel i.e., doing more/bigger reads per-dask-chunk.

See https://gist.github.com/lgarrison/ef2922fdb8ca78e8753c39758afedb56 for a related example of this issue where having few python process workers but utilizing the rust-based parallelization of `zarrs` improves throughput.

See https://github.com/zarrs/zarr_benchmarks/blob/9679f36ca795cce65adc603ae41147324208d3d9/scripts/zarr_dask_python_benchmark_roundtrip.py#L22 for using `shards` as the chunk size for dask (since shards contain chunks) as a starting point for implementing better defaults.

In general some todos:

- [ ] Use shards when applicable for zarr
- [ ] Create benchmarks for unsharded data in hdf5 and zarr to find the virtual vs. on-disk chunk size performance tradeoff
- [ ] Write some sort of general-purpose function that can be reused across zarr/hdf5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: sensible chunk defaults for dask #2036

Please describe your wishes and possible alternatives to achieve the desired result.

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feat: sensible chunk defaults for dask #2036

Description

Please describe your wishes and possible alternatives to achieve the desired result.

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions