-
-
Notifications
You must be signed in to change notification settings - Fork 368
Description
Summary
The current consolidate_metadata() function clears child group metadata when consolidating at the parent level. This behavior prevents hierarchical metadata organization where both parent and child groups need consolidated metadata for different access patterns. We propose adding an option to preserve existing child consolidated metadata during parent-level consolidation.
Background
When working with complex hierarchical Zarr stores (like new EOPF Sentinel), there's a need for consolidated metadata at multiple levels:
- Root level consolidation: For accessing the entire store structure
- Group level consolidation: For efficient access to specific resolution groups or data subsets
Currently, consolidating metadata at the parent level removes any existing consolidated metadata from child groups, forcing users to choose between parent-level or child-level consolidation but not both.
Use Case: EOPF Sentinel
Data Structure
store.zarr/
├── measurements/
│ └── reflectance/
│ ├── r10m/ # 10m resolution group
│ │ ├── 0/ # Native resolution
│ │ ├── 1/ # Overview level 1
│ │ └── 2/ # Overview level 2
│ ├── r20m/ # 20m resolution group
│ │ ├── 0/
│ │ ├── 1/
│ │ └── 2/
│ └── r60m/ # 60m resolution group
│ ├── 0/
│ ├── 1/
│ └── 2/
└── quality/
└── l1c_quicklook/
└── r10m/
├── 0/
├── 1/
└── 2/
Required Metadata Consolidation Levels
-
Resolution Group Level (
r10m/,r20m/,r60m/):- Needed for efficient access to all overview levels within a resolution
- Enables fast enumeration of available zoom levels
- Required for GeoZarr multiscales metadata organization
-
Root Level (
store.zarr/):- Needed for discovering all available resolution groups
- Enables efficient store-wide operations
- Required for complete store metadata access
Current Problem
import zarr
# Step 1: Consolidate at resolution group level (works fine)
zarr.consolidate_metadata("store.zarr/measurements/reflectance/r10m")
zarr.consolidate_metadata("store.zarr/measurements/reflectance/r20m")
zarr.consolidate_metadata("store.zarr/measurements/reflectance/r60m")
# Step 2: Consolidate at root level (removes child consolidated metadata!)
zarr.consolidate_metadata("store.zarr")
# Result: Resolution groups no longer have consolidated metadata
# This breaks efficient access patterns for resolution-specific operationsCurrent Workaround
We've had to implement a custom consolidation function that manually preserves child metadata by removing the lines
zarr-python/src/zarr/api/asynchronous.py
Lines 220 to 225 in 9dc744d
| # While consolidating, we want to be explicit about when child groups | |
| # are empty by inserting an empty dict for consolidated_metadata.metadata | |
| for k, v in members_metadata.items(): | |
| if isinstance(v, GroupMetadata) and v.consolidated_metadata is None: | |
| v = dataclasses.replace(v, consolidated_metadata=ConsolidatedMetadata(metadata={})) | |
| members_metadata[k] = v |
Proposed Solution
Add an optional parameter to consolidate_metadata() to preserve existing child consolidated metadata:
def consolidate_metadata(
store: StoreLike,
path: str | None = None,
zarr_format: ZarrFormat | None = None,
preserve_child: bool = False # New parameter
) -> Group:
"""
Consolidate the metadata of all nodes in a hierarchy.
Parameters
----------
preserve_child_consolidated : bool, default False
If True, preserve existing consolidated metadata in child groups
instead of clearing it during parent consolidation.
"""Behavior with preserve_child=True:
- Collect metadata normally from all descendants
- Check for existing consolidated metadata in child groups
- Preserve child consolidated metadata in the consolidated structure
- Allow both parent and child to have consolidated metadata simultaneously
Example Usage:
# Step 1: Consolidate at resolution group level
zarr.consolidate_metadata("store.zarr/measurements/reflectance/r10m")
zarr.consolidate_metadata("store.zarr/measurements/reflectance/r20m")
zarr.consolidate_metadata("store.zarr/measurements/reflectance/r60m")
# Step 2: Consolidate at root level while preserving child metadata
zarr.consolidate_metadata("store.zarr", preserve_child_consolidated=True)
# Result: Both root and resolution groups have consolidated metadata!Benefits
- Hierarchical Access Patterns: Enables efficient access at multiple levels
- Backward Compatibility: Default behavior unchanged (
preserve_child=False) - Complex Data Structures: Better support for scientific and Earth observation data
- Performance: Avoids need to re-consolidate child groups after parent consolidation
- Standards Compliance: Enables better compliance with specifications like GeoZarr that expect metadata at multiple levels
Questions for Maintainers
-
Conceptual Design: Is the current behavior of clearing child metadata an intentional design choice, or could it be enhanced?
-
Implementation Approach: Would you prefer:
- A new parameter to existing function (as proposed above)
- A separate function (e.g.,
consolidate_metadata_hierarchical()) - A different approach entirely?
-
Metadata Structure: How should the consolidated metadata structure represent preserved child consolidated metadata? Should it:
- Keep child consolidated metadata as-is in their respective paths?
- Include references to child consolidated metadata in parent metadata?
- Use a different organizational approach?
-
Performance Considerations: Are there performance implications we should consider when preserving child metadata during parent consolidation?
related spec issue: zarr-developers/zarr-specs#309