Reading virtual references back out into VirtualiZarr Manifests

I would like to be able to read virtual references back out from an icechunk store into VirtualiZarr `ManifestArray` objects.

Note this issue is the icechunk equivalent of https://github.com/zarr-developers/VirtualiZarr/issues/118, which is about reading kerchunk references into `ManifestArray` objects.

The main use case is appending new data to an existing store (see https://github.com/zarr-developers/VirtualiZarr/issues/21#issuecomment-2114001053), so that when some new data arrives (e.g. a new grib file with today's weather data), I can add an updated snapshot just with something like:

```python
import virtualizarr as vz

# avoids re-extracting all the metadata from all the past grib files, so should be quick
existing_vds = vz.open_virtual_dataset(icechunkstore, reader='icechunk')

new_vds = vz.open_virtual_dataset('todays_weather.grib', reader='grib')

updated_vds = xr.concat([existing_vds, new_vds], dim='time')

# commit new snapshot that includes today's data
# requires https://github.com/earth-mover/icechunk/issues/103
updated_vds.virtualize.to_icechunk(icechunkstore)
icechunkstore.commit('<todays-date>')
```

In order to implement that Icechunk reader for virtualizarr I would need some API for getting all virtual (and non-virtual) references for a snapshot back out of the Icechunk store, ideally as a `vz.ManifestArray` or something I can cheaply coerce to one (see [`ChunkManifest.from_arrays()`](https://github.com/zarr-developers/VirtualiZarr/blob/47a5e8702e44f71bb355bcba0ff6214fe6d09d83/virtualizarr/manifests/manifest.py#L142)).

Writing the updated references as a new snapshot also requires https://github.com/earth-mover/icechunk/issues/103.

(I guess the `.virtualize.to_icechunk` method might also need to know to do `array.resize` in this example... (see the Append example in [this notebook](https://github.com/earth-mover/icechunk/blob/main/icechunk-python/notebooks/demo-dummy-data.ipynb).)

Running that above snippet as a cron job / event-driven serverless function should go a long way towards making ingestion of regularly-updated data archives easier. (cc @mpiannucci)

This feature might also be useful to allow using icechunk as a serialization format during large serverless reductions (xref https://github.com/zarr-developers/VirtualiZarr/issues/123).

cc @paraseba

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading virtual references back out into VirtualiZarr Manifests #104

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reading virtual references back out into VirtualiZarr Manifests #104

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions