Description
Describe the issue:
I have been observing a mild but consistent increase in memory when storing large arrays to Zarr. I have reproduced this with both s3fs (my real use case) and with a dummy "Dev Null" store of my own devising.
I would expect to be able to store arrays of essentially infinite size in streaming fashion. Instead, this memory leak means that eventually I will run out of memory.
I'm aware that the root of this issue may be hard to diagnose. The ultimate cause may be upstream, in Zarr, where I am a maintainer. Very happy to work with the developers here to isolate and resolve the underlying issue. 🙏
I do not know if this issue occurs with other schedulers, as I don't know how to diagnose memory usage as conveniently as with distributed.
Minimal Complete Verifiable Example:
import dask
import dask.array as da
from dask.distributed import LocalCluster
from distributed.diagnostics import MemorySampler
import zarr
class DevNullStore(dict):
"""Dummy store in which data just vanishes."""
def __setitem__(self, key, value):
# only store attributes
if key == '.zarray':
super().__setitem__(key, value)
shape = 10_000_000
chunks = 1_000
data = da.zeros(shape, chunks=chunks)
cluster = LocalCluster(
n_workers=8, threads_per_worker=1, host="*",
)
ms = MemorySampler()
store = zarr.storage.KVStore(DevNullStore())
with cluster.get_client() as client:
with ms.sample("nullstore-zarr_v2-compute_true"):
data.to_zarr(store, compressor=None)
ms.plot(align=True)
As you can see, memory usage increases steadily over the course of the computation. If you make the array larger, the trend continues.
Anything else we need to know?:
A likely objection to this issue might be "this is an artifact of your funky DevNullStore
". However, the same behavior can be reproduced (albeit much more slowly) with s3fs. This example requires write access to s3
import s3fs
import uuid
# replace with a bucket you can write to
s3_url = "s3://earthmover-rechunk-tmp/memory-leak-test"
def make_s3_store():
url = f'{s3_url}/{uuid.uuid4().hex}'
store = zarr.storage.FSStore(url=url)
return store
with cluster.get_client() as client:
client.restart()
with ms.sample("s3fs-zarr_v2-compute_true"):
data.to_zarr(make_s3_store(), compressor=None)
ms.plot(align=True)
As you can see, the overall magnitude of the memory leak is similar: about 200 MB. This suggests that it is independent of the underlying store.
Environment:
- Dask version: 2023.7.0
- Zarr version: 2.16.1
- s3fs version: 2023.9.0
- Python version: 3.10.12
- Operating System: Linux
- Install method (conda, pip, source): conda + pip
cc @crusaderky, who helped us with an earlier iteration of this problem.