Skip to content

Conversation

@TomAugspurger
Copy link

@TomAugspurger TomAugspurger commented Jun 2, 2025

This adds a pair of json schema schemas to the repository. One for Array metadata and one for Group metadata.

For those unfamiliar with json-schema, it's a language for validating JSON documents. You write schemas (in JSON) and tools can validate JSON objects ("instances") against that schema. For example, the following Group would be flagged as invalid, because it lacks a zarr_format field:

{
    "node_type": "group",
    "attributes": {
        "spam": "ham",
        "eggs": 42
    }
} 

The check-jsonschema tool can be used to validate this, but there are many alternative tools that could be used:

❯ check-jsonschema --schemafile json-schema/group.json examples/example-group/zarr.json
Schema validation errors were encountered.
  examples/example-group/zarr.json::$: 'zarr_format' is a required property

Note that this only validates metadata stored within the zarr.json objects. It has no bearing on the actual data in the chunk files.

In addition to the schemas, I've included the metadata for a few examples, and have validated them against the json schema.

This is motivated by zarr-developers/geozarr-spec#72. geozarr can define its own json schema for the additional properties it adds.

@@ -0,0 +1,639 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://zarr-specs.readthedocs.io/v3/json-schema/array.json",
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to find a suitable URI for the schema itself. Ideally we can publish the schema on tags to this repository through CI/CD.

STAC uses https://schemas.stacspec.org//item-spec/json-schema/.json, for example https://schemas.stacspec.org/v1.1.0/item-spec/json-schema/item.json.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good thing is if that is resolvable. You can also use this prefix if do not plan to keep own domain 'forever' https://schemas.opengis.net/. As you'll see here not all JSON schemas has Id (xml namespaces were more disciplined here) and these which has not always use the host e.g. https://schemas.opengis.net/os-geojson/1.0/example-1-eo-collections.json

"name": {
"type": "string",
"not": {
"anyOf": [
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a common pattern for all the extension objects (data_type, codec, etc.). We're using the oneOf keyword to ensure that the data type, say, matches exactly one data type definition.

To ensure that a data type like "bool" doesn't match against both the core bool data type and an extension data type, we need to prohibit extension types from shadowing a core data type.

@@ -0,0 +1,58 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://zarr-specs.readthedocs.io/v3/json-schema/group.json",
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also needs a permanent URI.

@d-v-b
Copy link
Contributor

d-v-b commented Jun 2, 2025

this is awesome work tom!

@jbms
Copy link
Contributor

jbms commented Jun 2, 2025

This is great work.

It does seem rather unfortunate to have to list all of the ids defined in the core spec redundantly in order to exclude them as valid extension names.

One idea would be to just pull in all of the schemas from zarr-extensions automatically (e.g. via a program that generates the schema), and disallow in the schema unknown IDs.

We could update zarr-extensions to include separate schemas for the core ids also. That way almost everything could be pulled in just from zarr-extensions.

@TomAugspurger
Copy link
Author

TomAugspurger commented Jun 3, 2025

It does seem rather unfortunate to have to list all of the ids defined in the core spec redundantly in order to exclude them as valid extension names.

My natural preference is for simple / dumb solutions. In this case I'm probably fine with repeating the names since CI should immediately fail if we add some new core object but forget to update the list of fields. I think it's impossible for these to get out of sync.

We could update zarr-extensions to include separate schemas for the core ids also.

I wasn't aware of the zarr-extensions repo until after I submitted this PR. My initial preference would be to keep the JSON schema in the same repository otherwise it's (even more) likely to fall out of date as the spec evolves.

But it'd be good to figure out some way to share what's already be done there (CI / tooling maybe?) with what's proposed here. I'll take a closer look when I get a chance.

"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://zarr-specs.readthedocs.io/v3/json-schema/group.json",
"title": "Zarr v3 Group Metadata Schema",
"description": "JSON Schema for Zarr v3 Group metadata documents.",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered factoring out the common part? https://json-schema.org/blog/posts/modelling-inheritance
It will be useful for the further schemas like CF profile

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like array.json schema could use $ref top level to this file not to replicate definitions. Resolvable URI thath can be changed later would help here, indeed.

@jbms
Copy link
Contributor

jbms commented Jun 4, 2025

It does seem rather unfortunate to have to list all of the ids defined in the core spec redundantly in order to exclude them as valid extension names.

My natural preference is for simple / dumb solutions. In this case I'm probably fine with repeating the names since CI should immediately fail if we add some new core object but forget to update the list of fields. I think it's impossible for these to get out of sync.

If something is missing from the list of exclusions then it will just also validate as an extension, meaning the configuration doesn't get checked.

Additionally, if you make a typo in an identifier it will also just be considered an extension and validate successfully.

We could update zarr-extensions to include separate schemas for the core ids also.

I wasn't aware of the zarr-extensions repo until after I submitted this PR. My initial preference would be to keep the JSON schema in the same repository otherwise it's (even more) likely to fall out of date as the spec evolves.

Putting the separate schemas in this repo instead would also be fine, or zarr-extensions could even be merged into this repo.

But it'd be good to figure out some way to share what's already be done there (CI / tooling maybe?) with what's proposed here. I'll take a closer look when I get a chance.

That repo basically has the complement of what you have here represented as a schema.

@TomAugspurger
Copy link
Author

If something is missing from the list of exclusions then it will just also validate as an extension, meaning the configuration doesn't get checked.

Mmm here's what I had in mind: With a diff like this that "forgets" to add default to the list of exclusions:

❯ git diff
diff --git a/json-schema/array.json b/json-schema/array.json
index c9d2085..f69dd1c 100644
--- a/json-schema/array.json
+++ b/json-schema/array.json
@@ -559,7 +559,6 @@
                             "type": "string",
                             "not": {
                                 "enum": [
-                                    "default",
                                     "v2"
                                 ]
                             }

We get an error, thanks to that that matching both the default and extension chunk key encodings:

❯ check-jsonschema --schemafile json-schema/array.json examples/air_temperature.zarr/air/zarr.json --verbose
Schema validation errors were encountered.
  examples/air_temperature.zarr/air/zarr.json::$.chunk_key_encoding: {'name': 'default', 'configuration': {'separator': '/'}} is valid under each of {'$ref': '#/$defs/extension_chunk_key_encoding'}, {'$ref': '#/$defs/default_chunk_key_encoding'}

But if I forget to add v2 instead, then this passes. Maybe that's what you were saying? I guess if we have 100% coverage of the core spec then we'd be OK...

What do you think about a tool that checks that we didn't forget any keys, rather than generating the json-schema files? That sounds pretty straightforward to write and run in CI.

Additionally, if you make a typo in an identifier it will also just be considered an extension and validate successfully.

Yeah, that seems like a problem... But the core schema can't know anything about the extension schemas, I think. I'm not an expert on json-schema, but perhaps this is why STAC includes a stac_extensions array in its core spec, and STAC-specific tools know to load all of the json-schemas at those URIs and validate the document against each (xref #316).

@joshmoore
Copy link
Member

Thanks, @TomAugspurger! Happy to help get the permanent, resolvable URI. (My instinct is to put all of this under a v3/ directory.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants