Description
The reason I made this package is to handle one particularly challenging use case - the [C]Worthy mCDR Atlas - which I still haven't done. Once it's done I plan to write a blog post talking about it, and maybe add it as a usage example to this repository.
This dataset has some characteristics that make it really challenging to kerchunk/virtualize1:
- It's ~50TB compressed on-disk,
- It has ~500,000 netCDF files(!), each with about 40 variables,
- The largest variables are 3-dimensional, and require concatenation along an additional 3 dimensions, so the resulting variables are 6-dimensional,
- It requires merging in lower-dimensional variables too, not just concatenation,
- It has time encoding on some coordinates.
This dataset is therefore comparable to some of the largest datasets already available in Zarr (at least in terms of the number of chunks and variables, if not on-disk size), and is very similar to the pathological case described in #104
24MB per array means that even a really big store with 100 variables, each with a million chunks, still only takes up 2.4GB in memory - i.e. your xarray "virtual" dataset would be ~2.4GB to represent the entire store.
If we can virtualize this we should be able to virtualize most things 💪
To get this done requires many features to be implemented:
- Everything in the MVP of this package (Initial release checklist #2),
- Good performance Performance roadmap #104, including:
- During reference generation (so might need Non-kerchunk backend for HDF5/netcdf4 files. #87)
- For the in-memory representation (Use 3 numpy arrays for manifest internally #107)
- Writing out to kerchunk parquet format (Write to parquet #110)
- Writing out to icechunk (PetaByte-scale virtual ref performance benchmark earth-mover/icechunk#401)
- Effective parallelization of byte range generation across the many files (Parallelization via dask #7), possibly using Serverless parallelization of reference generation #123,
- Handling of variables with time encoding (Option to interpret variables using cftime #117 / Decoding
cftime_variables
#122) - Not dropping non-dimension coordinates (Identify non dimension coords #156)
- "Inlining" of some variables for performance when reading (else all the 3 concatenation coordinates will have chunks of length 1) (Inline loaded variables into kerchunk references #73)
- Optionally using
combine_by_coords
to handle the 3-dimensional concatenation, which would require Inferring concatenation order from coordinate data values #18, - Possibly sub-selection into uncompressed auxiliary data which has a longer time dimension that I only need part of), which requires:
- Choosing arbitrary chunks into some uncompressed data (Arbitrary chunking of uncompressed files (e.g. netCDF3) #86)
- Indexing aligned with chunks (Support indexing by slicing along chunk boundaries? #51)
- A way to get the references files onto S3, either via
- Write zarr stores with manifest.json files to non-local storage #46
- or generating on HPC, changing the paths to the corresponding S3 URLs using Rewrite paths in a manifest #130, and moving the altered reference files to the cloud manually.
- Just using an icechunk S3 store (e.g. as done in PetaByte-scale virtual ref performance benchmark earth-mover/icechunk#401)
Additionally once zarr-python actually understands some kind of chunk manifest, I want to also go back and create an actual zarr store for this dataset. That will additionally require:
- The zarr-python storage transformer + chunk manifest stuff to actually be implemented (Manifest storage transformer zarr-specs#287), EDIT: We can just use Icechunk now
- Ideally not having to re-do the reference generation, instead using Open on-disk kerchunk references as a virtual dataset #118 before calling
.virtualize.to_zarr()
, - The ability to write out selected variables as normal compressed zarr arrays on-disk ("Inlining" data when writing references to disk #62 (comment)).
Footnotes
-
In fact pretty much the only ways in which this dataset could be worse is if it had differences in encoding between netCDF files, variable-length chunks, or netCDF groups, but thankfully it has none of those 😅 ↩