-
Notifications
You must be signed in to change notification settings - Fork 54
Open
Labels
Description
Generalizing the ChunkManifest class to hold native chunks as well as virtual refs would unlock several features.
It's needed for:
- Writing an
IcechunkParser(Reading virtual references back out into VirtualiZarr Manifests earth-mover/icechunk#104) - Concatenating virtual and non-virtual data in memory (Icechunk already supports doing this on-disk via
append_dim) ManifestStore.to_icechunk()/kerchunk(), and thereby making xarray an optional dependency (Make xarray an optional dependency? #521)
The problem is that the current implementation of ChunkManifest uses a clever trick: it's just 3 numpy arrays in a trenchcoat. This gives us loads of stuff for free:
- Efficient contiguous in-memory representation
- Efficient handling of Variable-length strings (via the numpy 2 dtype)
- Efficient functions for iterating over every element
- Efficient multi-dimensional concat/stack functions for merging chunk manifests
- No unusual or non-python dependencies
But I don't know how to keep that design and also store non-virtual chunks in arbitrary locations within those arrays.
Some alternatives:
- Numpy object array (almost certainly very inefficient)-
- Keep the 3 numpy arrays, but have an auxiliary buffer in which to store non-virtual chunks in memory (proposed on another issue by @maxrjones)
- Write our own in-memory chunk manifest class in rust