PetaByte-scale virtual ref performance benchmark

**I tried committing a PetaByte's worth of virtual references into Icechunk from VirtualiZarr.** 

It took about 16 minutes for a local store, and but crashed after consuming >100GB of RAM when trying to write the same references to an S3 store. Writing 0.1PB of virtual references to an S3 store took 2 minutes and 16GB of RAM.

[Notebook here](https://gist.github.com/TomNicholas/5990ffb06fccce99deec2ca8b540bf93) - I successfully ran the test with the local store locally on my Mac, but for some reason trying to write the same data to the S3 store blew my local memory, and then also blew the memory of a 128GB Coiled notebook too 🤯

This represents about the largest dataset someone might conceivably want to create virtual reference to and commit in one go, e.g. if someone wanted to generate virtual references to all of ERA5 and commit them to an icechunk store. My test here has 100 million chunks, and for comparison the [[C]Worthy OAE Atlas dataset](https://github.com/zarr-developers/VirtualiZarr/issues/132) is ~5 million chunks.

I have not yet attempted to tackle the problem of actually generating references from thousands of netCDF files and combining them onto one worker here (xref https://github.com/zarr-developers/VirtualiZarr/issues/7 and https://github.com/zarr-developers/VirtualiZarr/issues/123) - I just manually created fake manifests that point at imaginary data and committed those to icechunk. So this test represents only the final step of a real virtual ingestion workflow.

So far no effort has gone into optimizing this on the VirtualiZarr side - we currently have a [serial loop](https://github.com/zarr-developers/VirtualiZarr/blob/33486705b19ee570881e070e86f6c332c8303d60/virtualizarr/writers/icechunk.py#L188) over every reference in every manifest, calling `store.set_virtual_ref` millions of times, so this should be viewed as the worst case to improve upon.

As my example has 100 10TB virtual variables we may be able to get a big speedup in the `vds.virtualizarr.to_icechunk()` step immediately just by writing each variable to icechunk asynchronously within virtualizarr. However that step only takes half the overall time, with  `store.commit()` taking the other half, and that's entirely on icechunk. When experimenting the time taken to write and commit seemed to be proportional to the manifest size, which would make sense.

The RAM usage when writing to the S3 store is odd - watching the indicator in the Coiled notebook it used 81.5GB of RAM to write the references into the store, then died trying to commit them 😕 I don't know why (a) the RAM usage would be any larger than when writing to the local store or (b) why it would ever need to get much larger than 3.2GB, which is what it takes virtualizarr to store the references in memory.

Note also that the manifests in the local store take up 16GB on disk, which also seems large. (I could presumably have written out the references by converting the numpy arrays inside the `ChunkManifest` objects to `.npz` format and used <=3.2GB on disk.)

xref https://github.com/zarr-developers/VirtualiZarr/issues/104

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PetaByte-scale virtual ref performance benchmark #401

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PetaByte-scale virtual ref performance benchmark #401

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions