Skip to content

Option to preserve child metadata during consolidation at parent level #3289

@emmanuelmathot

Description

@emmanuelmathot

Summary

The current consolidate_metadata() function clears child group metadata when consolidating at the parent level. This behavior prevents hierarchical metadata organization where both parent and child groups need consolidated metadata for different access patterns. We propose adding an option to preserve existing child consolidated metadata during parent-level consolidation.

Background

When working with complex hierarchical Zarr stores (like new EOPF Sentinel), there's a need for consolidated metadata at multiple levels:

  1. Root level consolidation: For accessing the entire store structure
  2. Group level consolidation: For efficient access to specific resolution groups or data subsets

Currently, consolidating metadata at the parent level removes any existing consolidated metadata from child groups, forcing users to choose between parent-level or child-level consolidation but not both.

Use Case: EOPF Sentinel

Data Structure

store.zarr/
├── measurements/
│   └── reflectance/
│       ├── r10m/           # 10m resolution group
│       │   ├── 0/          # Native resolution
│       │   ├── 1/          # Overview level 1
│       │   └── 2/          # Overview level 2
│       ├── r20m/           # 20m resolution group
│       │   ├── 0/
│       │   ├── 1/
│       │   └── 2/
│       └── r60m/           # 60m resolution group
│           ├── 0/
│           ├── 1/
│           └── 2/
└── quality/
    └── l1c_quicklook/
        └── r10m/
            ├── 0/
            ├── 1/
            └── 2/

Required Metadata Consolidation Levels

  1. Resolution Group Level (r10m/, r20m/, r60m/):

    • Needed for efficient access to all overview levels within a resolution
    • Enables fast enumeration of available zoom levels
    • Required for GeoZarr multiscales metadata organization
  2. Root Level (store.zarr/):

    • Needed for discovering all available resolution groups
    • Enables efficient store-wide operations
    • Required for complete store metadata access

Current Problem

import zarr

# Step 1: Consolidate at resolution group level (works fine)
zarr.consolidate_metadata("store.zarr/measurements/reflectance/r10m")
zarr.consolidate_metadata("store.zarr/measurements/reflectance/r20m") 
zarr.consolidate_metadata("store.zarr/measurements/reflectance/r60m")

# Step 2: Consolidate at root level (removes child consolidated metadata!)
zarr.consolidate_metadata("store.zarr")

# Result: Resolution groups no longer have consolidated metadata
# This breaks efficient access patterns for resolution-specific operations

Current Workaround

We've had to implement a custom consolidation function that manually preserves child metadata by removing the lines

# While consolidating, we want to be explicit about when child groups
# are empty by inserting an empty dict for consolidated_metadata.metadata
for k, v in members_metadata.items():
if isinstance(v, GroupMetadata) and v.consolidated_metadata is None:
v = dataclasses.replace(v, consolidated_metadata=ConsolidatedMetadata(metadata={}))
members_metadata[k] = v

Proposed Solution

Add an optional parameter to consolidate_metadata() to preserve existing child consolidated metadata:

def consolidate_metadata(
    store: StoreLike,
    path: str | None = None,
    zarr_format: ZarrFormat | None = None,
    preserve_child: bool = False  # New parameter
) -> Group:
    """
    Consolidate the metadata of all nodes in a hierarchy.
    
    Parameters
    ----------
    preserve_child_consolidated : bool, default False
        If True, preserve existing consolidated metadata in child groups
        instead of clearing it during parent consolidation.
    """

Behavior with preserve_child=True:

  1. Collect metadata normally from all descendants
  2. Check for existing consolidated metadata in child groups
  3. Preserve child consolidated metadata in the consolidated structure
  4. Allow both parent and child to have consolidated metadata simultaneously

Example Usage:

# Step 1: Consolidate at resolution group level
zarr.consolidate_metadata("store.zarr/measurements/reflectance/r10m")
zarr.consolidate_metadata("store.zarr/measurements/reflectance/r20m")
zarr.consolidate_metadata("store.zarr/measurements/reflectance/r60m")

# Step 2: Consolidate at root level while preserving child metadata
zarr.consolidate_metadata("store.zarr", preserve_child_consolidated=True)

# Result: Both root and resolution groups have consolidated metadata!

Benefits

  1. Hierarchical Access Patterns: Enables efficient access at multiple levels
  2. Backward Compatibility: Default behavior unchanged (preserve_child=False)
  3. Complex Data Structures: Better support for scientific and Earth observation data
  4. Performance: Avoids need to re-consolidate child groups after parent consolidation
  5. Standards Compliance: Enables better compliance with specifications like GeoZarr that expect metadata at multiple levels

Questions for Maintainers

  1. Conceptual Design: Is the current behavior of clearing child metadata an intentional design choice, or could it be enhanced?

  2. Implementation Approach: Would you prefer:

    • A new parameter to existing function (as proposed above)
    • A separate function (e.g., consolidate_metadata_hierarchical())
    • A different approach entirely?
  3. Metadata Structure: How should the consolidated metadata structure represent preserved child consolidated metadata? Should it:

    • Keep child consolidated metadata as-is in their respective paths?
    • Include references to child consolidated metadata in parent metadata?
    • Use a different organizational approach?
  4. Performance Considerations: Are there performance implications we should consider when preserving child metadata during parent consolidation?

related spec issue: zarr-developers/zarr-specs#309

cc @TomAugspurger @maxrjones @d-v-b @vincentsarago

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions