Skip to content

Reading virtual references back out into VirtualiZarr Manifests #104

@TomNicholas

Description

@TomNicholas

I would like to be able to read virtual references back out from an icechunk store into VirtualiZarr ManifestArray objects.

Note this issue is the icechunk equivalent of zarr-developers/VirtualiZarr#118, which is about reading kerchunk references into ManifestArray objects.

The main use case is appending new data to an existing store (see zarr-developers/VirtualiZarr#21 (comment)), so that when some new data arrives (e.g. a new grib file with today's weather data), I can add an updated snapshot just with something like:

import virtualizarr as vz

# avoids re-extracting all the metadata from all the past grib files, so should be quick
existing_vds = vz.open_virtual_dataset(icechunkstore, reader='icechunk')

new_vds = vz.open_virtual_dataset('todays_weather.grib', reader='grib')

updated_vds = xr.concat([existing_vds, new_vds], dim='time')

# commit new snapshot that includes today's data
# requires https://github.com/earth-mover/icechunk/issues/103
updated_vds.virtualize.to_icechunk(icechunkstore)
icechunkstore.commit('<todays-date>')

In order to implement that Icechunk reader for virtualizarr I would need some API for getting all virtual (and non-virtual) references for a snapshot back out of the Icechunk store, ideally as a vz.ManifestArray or something I can cheaply coerce to one (see ChunkManifest.from_arrays()).

Writing the updated references as a new snapshot also requires #103.

(I guess the .virtualize.to_icechunk method might also need to know to do array.resize in this example... (see the Append example in this notebook.)

Running that above snippet as a cron job / event-driven serverless function should go a long way towards making ingestion of regularly-updated data archives easier. (cc @mpiannucci)

This feature might also be useful to allow using icechunk as a serialization format during large serverless reductions (xref zarr-developers/VirtualiZarr#123).

cc @paraseba

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancement ✨New feature or requestvirtual references 👻Involves virtual kerchunk/virtualizarr chunk references

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions