Description
Problem
MultiZarrToZarr
is extremely powerful but rather hard to use.
This is important - kerchunk has been transformative, so we increasingly recommend it as the best way to ingest large amounts of data into the pangeo ecosystem's tools. However that means we should make sure the kerchunk user experience is smooth, so that new users don't get stuck early on.
Part of the problem is that this one MultiZarrToZarr
function can do many different things. Contrast with xarray - when combining multiple datasets into one, xarray takes some care to distinguish between a few common cases/concepts (we even have a glossary):
- Concatenation along a single existing dimension. Achieved by
xr.concat
wheredim
is a str - Concatenation along a single new dimension (optionally providing new coordinates to use along that new dimension). Achieved by
xr.concat
wheredim
is a set of values - Merging of multiple variables which already share dimensions, first aligned according to their coordinates. Achieved by
xr.merge
- "Combining" by order given, which means some ordered combination of concatenation along one or more dimensions and/or merging. Achieved by
xr.combine_nested
- "Combining" by coordinate order, which again means some ordered combination of concatenation along one or more dimensions and/or merging, but the order is specified by information in the datasets' coordinates. Achieved by
xr.combine_by_coords
In kerchunk it seems that the recommended way to handle operations resembling all 5 of these cases is through MultiZarrToZarr
. It also cannot currently easily handle certain types of multi-dimensional concatenation.
Suggestion
Break up MultiZarrToZarr
by defining a set of functions similar to xarray's merge
/concat
/combine
/unify_chunks
that consume and produce VirtualZarrStore
objects (EDIT: see #375).
Advantages
- We can replace/deprecate the heavily overloaded and unituitive
coo_map
kwarg (it has 10 possible input types!). Perhaps giving simply an ordered list of coordinate values would be sufficient, and just make it easier for the user to extract the values they want from theVirtualZarrStore
objects they want to concatenate. - If users need to do something really unusual they can more easily break their problem up into concatenating each array separately (e.g. for concatenating on staggered grids)
- Might generalise to later ZEPs more easily (e.g. understanding variable-length chunks, cc @ivirshup, see Concatenate arrays with varchunks #374)
- Can think of as a refactoring to move some pangeo-forge functionality upstream, reducing redundancy. We shouldn't have 3 completely different designs for multidimensional concatenation in adjacent libraries in the stack.
- These new functions would be more useful as basic primitives for parallelization frameworks to call (e.g. doing tree reduction via dask, beam, or cubed), rather than trying to wrap calls to those frameworks within kerchunk (like
kerchunk.combine.auto_dask
does).
Questions
- How close can these functions be to xarray's version of
merge
/concat
/combine
? And what can we learn from the design decisions in pangeo-forge-recipesFilePattern
? (@cisaacstern @rabernat ) - How close are kerchunk's existing
combine.merge_vars
andcombine.concatenate_arrays
functions to providing this functionality? If the answer is "pretty close", then how much of this issue could be solved via documentation?