Replies: 1 comment
-
Update:PR #1272 got merged and will be released tomorrow, it addresses the first part of this workflow, now
I think we can partially implement the second part, This method can for now take the URL of the virtual store and a collection record so it knows how to configure the auth credentials for each driver. graph LR
%% Input Section
subgraph Parsers ["Input Parsers (Source Formats)"]
direction TB
A[GRIB]
B[COGs]
C[HDF5 / NetCDF4]
D[DMRPP]
E[Kerchunk JSON / Parquet]
F[Icechunk]
end
%% Central Logic
G{{"VirtualiZarr Engine<br/>(ManifestArray)"}}
%% Output Section
subgraph Manifests ["Serialized Formats (Target Manifests)"]
direction TB
H[Kerchunk JSON]
I[Kerchunk Parquet]
J[Icechunk]
end
%% Connections
A --> G
B --> G
C --> G
D --> G
E --> G
F --> G
G --> H
G --> I
G --> J
%% Styling with standardized, easily readable colors
%% Replacing Magenta with a standard process Orange (#FF9F43)
style G fill:#FF9F43,stroke:#333,stroke-width:2px,color:white
%% Keeping the readable light blue storage classification
classDef storage fill:#e1f5fe,stroke:#01579b,stroke-width:1px,color:#01579b
class A,B,C,D,E,F,H,I,J storage
IMO a good path will be using dmrpp as input and generating kerchunk parquet or icechunk for our virtual stores (for those collections that have dmrpp). If a collection has no dmrpp we can still
So this will be the temporary method signature until we have a canonical way of finding this info in CMR. # in region icechunk
vds = ea.open_virtual_store("s3://some-bucket/dataset.icechunk", collection_result)
# out of region kerchunk
vds = ea.open_virtual_store("https://some-url/dataset.parquet")The one blocker was that vrirtualizarr embeds coordinates in the virtual store and the it doesn't know how to read it back as virtual, this is being addressed by zarr-developers/VirtualiZarr#938 in the meantime we can use the skip-coords trick zarr-developers/VirtualiZarr#489 (comment) For Icechunk, the way to update it is described in here https://github.com/zarr-developers/VirtualiZarr/blob/main/examples/V1/append/noaa-cdr-sst.ipynb (although v1 the method still holds) |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Currently earthaccess supports creating virtual datasets via a really not used kerchunk API and via dmrpp with Virtualizarr. Kerchunk has been in many ways superseded by Virtualizarr, but since we only support dmrpp it limits ourselves to datasets with OPeNDAP support. I'll be working on a PR that will enable support for all formats, including the ones Kerchunk supports now.
At the end we'll have this to create a VDS:
Once we have this in place and Virtualizarr supports inline references we eventually should be able to just open a virtual data store and keep appending. This could be done using the vds URI or could be via Julius PR to integrate it with a collection level result.
We still need zarr-developers/VirtualiZarr#794 (or zarr-developers/VirtualiZarr#938) to be merged
Beta Was this translation helpful? Give feedback.
All reactions