Skip to content

Refactor MultiZarrToZarr into multiple functions #377

Open
@TomNicholas

Description

@TomNicholas

Problem

MultiZarrToZarr is extremely powerful but rather hard to use.

This is important - kerchunk has been transformative, so we increasingly recommend it as the best way to ingest large amounts of data into the pangeo ecosystem's tools. However that means we should make sure the kerchunk user experience is smooth, so that new users don't get stuck early on.

Part of the problem is that this one MultiZarrToZarr function can do many different things. Contrast with xarray - when combining multiple datasets into one, xarray takes some care to distinguish between a few common cases/concepts (we even have a glossary):

  1. Concatenation along a single existing dimension. Achieved by xr.concat where dim is a str
  2. Concatenation along a single new dimension (optionally providing new coordinates to use along that new dimension). Achieved by xr.concat where dim is a set of values
  3. Merging of multiple variables which already share dimensions, first aligned according to their coordinates. Achieved by xr.merge
  4. "Combining" by order given, which means some ordered combination of concatenation along one or more dimensions and/or merging. Achieved by xr.combine_nested
  5. "Combining" by coordinate order, which again means some ordered combination of concatenation along one or more dimensions and/or merging, but the order is specified by information in the datasets' coordinates. Achieved by xr.combine_by_coords

In kerchunk it seems that the recommended way to handle operations resembling all 5 of these cases is through MultiZarrToZarr. It also cannot currently easily handle certain types of multi-dimensional concatenation.

Suggestion

Break up MultiZarrToZarr by defining a set of functions similar to xarray's merge/concat/combine/unify_chunks that consume and produce VirtualZarrStore objects (EDIT: see #375).

Advantages

  • We can replace/deprecate the heavily overloaded and unituitive coo_map kwarg (it has 10 possible input types!). Perhaps giving simply an ordered list of coordinate values would be sufficient, and just make it easier for the user to extract the values they want from the VirtualZarrStore objects they want to concatenate.
  • If users need to do something really unusual they can more easily break their problem up into concatenating each array separately (e.g. for concatenating on staggered grids)
  • Might generalise to later ZEPs more easily (e.g. understanding variable-length chunks, cc @ivirshup, see Concatenate arrays with varchunks #374)
  • Can think of as a refactoring to move some pangeo-forge functionality upstream, reducing redundancy. We shouldn't have 3 completely different designs for multidimensional concatenation in adjacent libraries in the stack.
  • These new functions would be more useful as basic primitives for parallelization frameworks to call (e.g. doing tree reduction via dask, beam, or cubed), rather than trying to wrap calls to those frameworks within kerchunk (like kerchunk.combine.auto_dask does).

Questions

  • How close can these functions be to xarray's version of merge/concat/combine? And what can we learn from the design decisions in pangeo-forge-recipes FilePattern? (@cisaacstern @rabernat )
  • How close are kerchunk's existing combine.merge_vars and combine.concatenate_arrays functions to providing this functionality? If the answer is "pretty close", then how much of this issue could be solved via documentation?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions