Skip to content

[Exploration]: Ad-hoc testing of Kerchunk engine compatibility with xCDAT I/O and APIs #812

@tomvothecoder

Description

@tomvothecoder

Is your feature request related to a problem?

xCDAT’s open_dataset() extends xarray.open_dataset() and inherits its engine logic. With Xarray Kerchunk support (docs), we need to verify that xCDAT behaves correctly when opening Kerchunk reference JSONs. xCDAT can perform additional operations on top of Xarray (e.g., generating missing bounds), so it’s important to confirm that Kerchunk-backed datasets preserve metadata integrity and behave identically to traditional NetCDF inputs.

This work is related to the exploration of more efficient access to and analysis of the 1PB CMIP archive at NERSC, where Kerchunk could greatly improve performance by enabling pre-indexing of NetCDF and HDF5 data through lightweight JSON reference files without converting them to Zarr (potentially costly in terms of storage and compute).

Ad-hoc testing will verify that xCDAT supports Kerchunk-based analysis of large climate archives and integrates seamlessly into CMIP data workflows.

Describe the solution you'd like

Perform ad-hoc manual testing to confirm that xCDAT works seamlessly with Kerchunk without modifying any code or adding a formal test suite. The goal is to confirm functional parity between Kerchunk-backed datasets and traditional NetCDF I/O through exploratory testing and to document any issues found.

Key areas to check manually:

  1. Basic Open Behavior

    • Can open single-file and multi-file Kerchunk JSONs.
    • Dataset contents match the same data opened via NetCDF.
  2. Metadata and CF Handling

    • CF axes (time, lat, lon, lev) are detected correctly.
    • Time decoding, bounds variables, and attributes are preserved.
  3. xCDAT Functionality

    • Temporal and spatial operations
    • Horizontal and vertical regridding
  4. Performance and Stability

    • Lazy loading works as expected (no data read on open).
    • No Dask graph errors or performance regressions.
    • Compare performance

Describe alternatives you've considered

No response

Additional context

Example Code

import xcdat

# Example: Open a Kerchunk reference JSON as a Dataset
# Replace with your own Kerchunk reference file (e.g., created from CMIP data at NERSC)
json_path = "cmip6_historical_Amon_hus_850hPa_kerchunk.json"

# xCDAT.open_dataset() extends xarray.open_dataset() and supports passing engine="kerchunk"
ds = xcdat.open_dataset(json_path, engine="kerchunk")

# Inspect dataset structure
print(ds)

# Confirm CF metadata and axes are detected correctly
print(ds.cf.describe())

# Try basic operations to test compatibility
ds_mean = ds.xcdat.temporal.average()
print(ds_mean)

# Optional: compare with native NetCDF open
# native = xcdat.open_mfdataset("/path/to/original.nc")
# print(ds.equals(native))

What is Kerchunk?

“Kerchunk is a library that provides a unified way to represent a variety of chunked, compressed data formats (e.g. NetCDF, HDF5, GRIB), allowing efficient access to the data from traditional file systems or cloud object storage.” ¹

“Instead of creating a new copy of the dataset in the Zarr spec/format, Kerchunk reads through the data archive and extracts the byte range and compression information of each chunk, then writes that information to a ‘virtual Zarr store’ using a JSON or Parquet ‘reference file’.” ²

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions