-
Notifications
You must be signed in to change notification settings - Fork 62
Description
Both RFC5 (ping @jo-mueller @bogovicj) and RFC8 (ping @normanrz) have the feature of referring to other zarr objects; either nodes, or metadata fields. We should probably settle on one representation for these which covers the necessary use cases.
Zarr node reference use cases, from simple to complicated:
- reference to zarr node (group or array) in the same hierarchy, supporting only relative paths with no upward traversal
- as above, plus upward traversal
⚠️ sub-hierarchies can be mounted/ symlinked, so depending on how a node is accessed, different clients may have different trees to traverse
- as above, plus absolute paths
⚠️ the root node can only be discovered by looking forzarr.jsonin ancestor directories (which has the same problems as upward traversal above), and finding one missing is not sufficient because intermediate groups with no metadata don't need azarr.json, so you always have to go to the root of the store (which might mean you hit additional authentication problems)
- as above, plus references to external stores
⚠️ there is no canonical way to represent URLs to external stores; the closest are the conventions used by fsspec/ object_store⚠️ the client would need to maintain a registry of parameters for accessing these external stores, including out-of-band authentication if required⚠️ these are fragile to changes in location, e.g.- from a pre-publication local location to a published repository
- from one institution to another if a research group migrates
- when AWS kills the Open Data program
This may be best handled at the zarr level (ping @d-v-b), but encoding external store references in the Zarr spec may mean encoding more stores (currently filesystem is the only store in the spec).
Metadata reference use case, from simple to complicated, on top of the levels above:
- access to arbitrary metadata e.g. in an
attributesorextensionfield
- we can use JSON Pointer for this
⚠️ need to determine whether the starting point is the zarr.json root object"", the"/attributes"object, the"/attributes/ome"object etc.. The first presupposes that zarr v3 will never change (out of our control); the last leads to shorter paths and allows access to pre-v0.5 fields.⚠️ JSON Pointer addresses array members only by index, and so is sensitive to changes in order⚠️ deserialised metadata may not have the same shape as the JSON form (e.g. a reader may always parse an array of objects with anamefield into a map indexed by name), making arbitrary JSON queries difficult
- access to context-sensitive OME metadata, e.g. where the field containing the reference expects a coordinate system, it would be nice to shortcut the reference so it only looks at the coordinateSystems field of the referenced OME metadata.
- this also makes it more robust to OME-Zarr spec updates
It may be possible to pack all of this into a single string by abusing the fragment and query portions, but it would probably be more clear and flexible to use an object representation, possibly allowing a string for the simplest case (referring to a zarr node in the same hierarchy without upward traversal i.e. a path starting with ./ and not containing ..).
So, I'd suggest something like an object with fields:
node: string which MUST be interpreted relative to thestoreroot, if given, or the current node, and possibly should not support upwards traversal (or maybe we just hand the users their own footgun, with a warning label). Omission means the root of the store (if given) or this node if not.store: optional URL (something something fsspec conventions) to some zarr node- ideally this would be something like a PURL where an admin can update the redirect if the data moves, although PURL only accepts specific schemes so isn't something we could use directly. OME-PURL? 😬
- the same mechanism used for retrieving different stores for different targets can be used to resolve particular PURLs to local paths pre-publication. N.B. we would probably want to make some recommendations about caching redirects
pointer: optional JSON Pointer string applied to the zarr attributes object. The presence of thepointerfield marks this as a reference to a metadata item.
Possibly, we may also want a type field to distinguish whether the reference is pointing to a "group", "array", or "attribute". Then we could extend the types to e.g. a "coordinateSystem" or "transformation", which would do the shortcutting mentioned above and could have different fields to refer to the object by name instead of index.
If we were to allow a uuid or @id in every OME-Zarr metadata type, then we could have a targetId field here to sanity-check that we got the right one.
A field containing a reference could then either be the object, or for the simple case a string, which is interpreted as {"node": "that_string"}.
This also opens the question of whether we would support references in any field (e.g. an array of coordinateSystems which are a mixture of inlined and referenced systems), not just fields which must be references. To which I say 🤷