-
-
Notifications
You must be signed in to change notification settings - Fork 8
Description
Gathering some notes on how best to read multiple ICESat-2 ATL11 data (basically a point cloud) in a user friendly way, with metadata preserved!
TLDR: Be able to do xr.open_mfdataset("ATL11_*.h5", engine="zarr", ...).
Inspired by the blog post "Cloud-Performant NetCDF4/HDF5 Reading with the Zarr Library". Zarr is an amazing project, and I really like the .zmetadata json file which can be opened with a text editor and tell you stuff about the data. The dream would be to read HDF5 files in an out-of-core manner with Zarr like speed/abilities (through the .zmetadata pointer).
Jupyter notebook demo can be found at https://github.com/rsignell-usgs/hurricane-ike-water-levels/blob/master/coawst_3ways.ipynb. See also discussion thread at zarr-developers/zarr-python#535 on "Using the Zarr library to read HDF5".
Main hurdles to get through, dependent on upstream, there's two 'separate' parts:
- Reading a single HDF5 dataset via Zarr
- On xarray - Need
chunk_storeargument to use Zarr to read HDF5 - wait for Allow chunk_store argument when opening Zarr datasets pydata/xarray#3804 - On zarr (though it's partly a dependency problem for me to work out)
- Upgrading numcodecs from 0.6.3 to 0.6.4 (⬆️ Bump numcodecs from 0.6.3 to 0.6.4 #11 ✔️) and to 0.7.2 (⬆️ Bump numcodecs from 0.7.1 to 0.7.2 #160 ✔️) required to use Zarr 2.4.0 or newer (24b6917 ✔️), but fails due to compilation issues - wait for Build wheels on github actions zarr-developers/numcodecs#224 or use conda-forge's numcodecs package.
- Specifically, use this hdf5 branch of Zarr to read HDF5 files via Zarr (TODO!).
- On xarray - Need
- Reading multiple Zarr-like files via xarray/intake:
- On xarray - Streamline opening multiple zarr files via
xr.open_mfdataset- wait for Xarray open_mfdataset with engine Zarr pydata/xarray#4187 / xarray.open_mzar: open multiple zarr files (in parallel) pydata/xarray#4003 - On intake-xarray -
intake.open_ndzarrwill break with the above ☝️ - wait for xarray.open_zarr to be deprecated intake/intake-xarray#70
- On xarray - Streamline opening multiple zarr files via
Current situation in that I do HDF5 -> Zarr conversion, and read from that. It would be nice to stick to the original HDF5 data source (though I might need to flatten the nested ICESat-2 ATL11 data structure). Note that I'm not necessarily after raw speed, I just prefer readability (i.e. having xarray's wonderful annotated metadata).
Other open Issues/Pull Requests:
- Provide offset for memory mapping / contiguous layout zarr-developers/zarr-python#321 - Provide offset for memory mapping / contiguous layout
- RFC: Optionally support memory-mapping DirectoryStore values zarr-developers/zarr-python#377 - RFC: Optionally support memory-mapping DirectoryStore values
- File Chunk Store zarr-developers/zarr-python#556 - File Chunk Store
- Add FSStore zarr-developers/zarr-python#546 - POC: Add FSStore
Blog posts:
- https://medium.com/pangeo/cloud-performant-netcdf4-hdf5-with-zarr-fsspec-and-intake-3d3a3e7cb935
- https://medium.com/pangeo/cloud-performant-reading-of-netcdf4-hdf5-data-using-the-zarr-library-1a95c5c92314
- https://medium.com/informatics-lab/arrays-on-the-fly-9509b6b2e46a
- https://medium.com/pangeo/xpublish-ff788f900bbf
You can tell I had way too many tabs open on my browser 😆