Skip to content

Generalize ChunkManifest to hold native chunks as well as virtual refs #851

@TomNicholas

Description

@TomNicholas

Generalizing the ChunkManifest class to hold native chunks as well as virtual refs would unlock several features.

It's needed for:

  1. Writing an IcechunkParser (Reading virtual references back out into VirtualiZarr Manifests earth-mover/icechunk#104)
  2. Concatenating virtual and non-virtual data in memory (Icechunk already supports doing this on-disk via append_dim)
  3. ManifestStore.to_icechunk()/kerchunk(), and thereby making xarray an optional dependency (Make xarray an optional dependency? #521)

The problem is that the current implementation of ChunkManifest uses a clever trick: it's just 3 numpy arrays in a trenchcoat. This gives us loads of stuff for free:

  • Efficient contiguous in-memory representation
  • Efficient handling of Variable-length strings (via the numpy 2 dtype)
  • Efficient functions for iterating over every element
  • Efficient multi-dimensional concat/stack functions for merging chunk manifests
  • No unusual or non-python dependencies

But I don't know how to keep that design and also store non-virtual chunks in arbitrary locations within those arrays.

Some alternatives:

  • Numpy object array (almost certainly very inefficient)-
  • Keep the 3 numpy arrays, but have an auxiliary buffer in which to store non-virtual chunks in memory (proposed on another issue by @maxrjones)
  • Write our own in-memory chunk manifest class in rust

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions