Skip to content

feat: sensible chunk defaults for dask #2036

@ilan-gold

Description

@ilan-gold

Please describe your wishes and possible alternatives to achieve the desired result.

#2033 at least makes it so that we are no longer creating arrays with very small chunks unnecessarily but from a performance perspective, mapping dask chunks to chunks on disk is not a good idea. Especially with the codec pipeline in zarr being parallelized, the need for dask workers to be responsible for speeding up io is lessened and thus cramming more parallel reads within a worker is probably sensible. I imagine the same is true for hdf5 even if it isn't parallel i.e., doing more/bigger reads per-dask-chunk.

See https://gist.github.com/lgarrison/ef2922fdb8ca78e8753c39758afedb56 for a related example of this issue where having few python process workers but utilizing the rust-based parallelization of zarrs improves throughput.

See https://github.com/zarrs/zarr_benchmarks/blob/9679f36ca795cce65adc603ae41147324208d3d9/scripts/zarr_dask_python_benchmark_roundtrip.py#L22 for using shards as the chunk size for dask (since shards contain chunks) as a starting point for implementing better defaults.

In general some todos:

  • Use shards when applicable for zarr
  • Create benchmarks for unsharded data in hdf5 and zarr to find the virtual vs. on-disk chunk size performance tradeoff
  • Write some sort of general-purpose function that can be reused across zarr/hdf5

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions