Skip to content

Canonical way of representing paths to other objects #383

@clbarnes

Description

@clbarnes

Both RFC5 (ping @jo-mueller @bogovicj) and RFC8 (ping @normanrz) have the feature of referring to other zarr objects; either nodes, or metadata fields. We should probably settle on one representation for these which covers the necessary use cases.

Zarr node reference use cases, from simple to complicated:

  1. reference to zarr node (group or array) in the same hierarchy, supporting only relative paths with no upward traversal
  2. as above, plus upward traversal
  • ⚠️ sub-hierarchies can be mounted/ symlinked, so depending on how a node is accessed, different clients may have different trees to traverse
  1. as above, plus absolute paths
  • ⚠️ the root node can only be discovered by looking for zarr.json in ancestor directories (which has the same problems as upward traversal above), and finding one missing is not sufficient because intermediate groups with no metadata don't need a zarr.json, so you always have to go to the root of the store (which might mean you hit additional authentication problems)
  1. as above, plus references to external stores
  • ⚠️ there is no canonical way to represent URLs to external stores; the closest are the conventions used by fsspec/ object_store
  • ⚠️ the client would need to maintain a registry of parameters for accessing these external stores, including out-of-band authentication if required
  • ⚠️ these are fragile to changes in location, e.g.
    • from a pre-publication local location to a published repository
    • from one institution to another if a research group migrates
    • when AWS kills the Open Data program

This may be best handled at the zarr level (ping @d-v-b), but encoding external store references in the Zarr spec may mean encoding more stores (currently filesystem is the only store in the spec).

Metadata reference use case, from simple to complicated, on top of the levels above:

  1. access to arbitrary metadata e.g. in an attributes or extension field
  • we can use JSON Pointer for this
    • JSON Path, JMES Path, and jq (or jaq) are alternative languages of increasing complexity for accessing JSON: multiple strings could resolve to the same location, and they allow editing the returned data. JSON Pointer is simply a deterministic reference to a specific location in a JSON document.
  • ⚠️ need to determine whether the starting point is the zarr.json root object "", the "/attributes" object, the "/attributes/ome" object etc.. The first presupposes that zarr v3 will never change (out of our control); the last leads to shorter paths and allows access to pre-v0.5 fields.
  • ⚠️ JSON Pointer addresses array members only by index, and so is sensitive to changes in order
  • ⚠️ deserialised metadata may not have the same shape as the JSON form (e.g. a reader may always parse an array of objects with a name field into a map indexed by name), making arbitrary JSON queries difficult
  1. access to context-sensitive OME metadata, e.g. where the field containing the reference expects a coordinate system, it would be nice to shortcut the reference so it only looks at the coordinateSystems field of the referenced OME metadata.
  • this also makes it more robust to OME-Zarr spec updates

It may be possible to pack all of this into a single string by abusing the fragment and query portions, but it would probably be more clear and flexible to use an object representation, possibly allowing a string for the simplest case (referring to a zarr node in the same hierarchy without upward traversal i.e. a path starting with ./ and not containing ..).

So, I'd suggest something like an object with fields:

  • node: string which MUST be interpreted relative to the store root, if given, or the current node, and possibly should not support upwards traversal (or maybe we just hand the users their own footgun, with a warning label). Omission means the root of the store (if given) or this node if not.
  • store: optional URL (something something fsspec conventions) to some zarr node
    • ideally this would be something like a PURL where an admin can update the redirect if the data moves, although PURL only accepts specific schemes so isn't something we could use directly. OME-PURL? 😬
    • the same mechanism used for retrieving different stores for different targets can be used to resolve particular PURLs to local paths pre-publication. N.B. we would probably want to make some recommendations about caching redirects
  • pointer: optional JSON Pointer string applied to the zarr attributes object. The presence of the pointer field marks this as a reference to a metadata item.

Possibly, we may also want a type field to distinguish whether the reference is pointing to a "group", "array", or "attribute". Then we could extend the types to e.g. a "coordinateSystem" or "transformation", which would do the shortcutting mentioned above and could have different fields to refer to the object by name instead of index.

If we were to allow a uuid or @id in every OME-Zarr metadata type, then we could have a targetId field here to sanity-check that we got the right one.

A field containing a reference could then either be the object, or for the simple case a string, which is interpreted as {"node": "that_string"}.

This also opens the question of whether we would support references in any field (e.g. an array of coordinateSystems which are a mixture of inlined and referenced systems), not just fields which must be references. To which I say 🤷

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions