Added json schema #347

TomAugspurger · 2025-06-02T02:23:09Z

This adds a pair of json schema schemas to the repository. One for Array metadata and one for Group metadata.

For those unfamiliar with json-schema, it's a language for validating JSON documents. You write schemas (in JSON) and tools can validate JSON objects ("instances") against that schema. For example, the following Group would be flagged as invalid, because it lacks a zarr_format field:

{
    "node_type": "group",
    "attributes": {
        "spam": "ham",
        "eggs": 42
    }
}

The check-jsonschema tool can be used to validate this, but there are many alternative tools that could be used:

❯ check-jsonschema --schemafile json-schema/group.json examples/example-group/zarr.json
Schema validation errors were encountered.
  examples/example-group/zarr.json::$: 'zarr_format' is a required property

Note that this only validates metadata stored within the zarr.json objects. It has no bearing on the actual data in the chunk files.

In addition to the schemas, I've included the metadata for a few examples, and have validated them against the json schema.

This is motivated by zarr-developers/geozarr-spec#72. geozarr can define its own json schema for the additional properties it adds.

TomAugspurger · 2025-06-02T02:25:34Z

json-schema/array.json

@@ -0,0 +1,639 @@
+{
+    "$schema": "https://json-schema.org/draft/2020-12/schema",
+    "$id": "https://zarr-specs.readthedocs.io/v3/json-schema/array.json",


We'll need to find a suitable URI for the schema itself. Ideally we can publish the schema on tags to this repository through CI/CD.

STAC uses https://schemas.stacspec.org//item-spec/json-schema/.json, for example https://schemas.stacspec.org/v1.1.0/item-spec/json-schema/item.json.

Good thing is if that is resolvable. You can also use this prefix if do not plan to keep own domain 'forever' https://schemas.opengis.net/. As you'll see here not all JSON schemas has Id (xml namespaces were more disciplined here) and these which has not always use the host e.g. https://schemas.opengis.net/os-geojson/1.0/example-1-eo-collections.json

TomAugspurger · 2025-06-02T02:32:56Z

json-schema/array.json

+                        "name": {
+                            "type": "string",
+                            "not": {
+                                "anyOf": [


This is a common pattern for all the extension objects (data_type, codec, etc.). We're using the oneOf keyword to ensure that the data type, say, matches exactly one data type definition.

To ensure that a data type like "bool" doesn't match against both the core bool data type and an extension data type, we need to prohibit extension types from shadowing a core data type.

TomAugspurger · 2025-06-02T02:33:23Z

json-schema/group.json

@@ -0,0 +1,58 @@
+{
+    "$schema": "https://json-schema.org/draft/2020-12/schema",
+    "$id": "https://zarr-specs.readthedocs.io/v3/json-schema/group.json",


Also needs a permanent URI.

d-v-b · 2025-06-02T09:37:29Z

this is awesome work tom!

jbms · 2025-06-02T13:28:42Z

This is great work.

It does seem rather unfortunate to have to list all of the ids defined in the core spec redundantly in order to exclude them as valid extension names.

One idea would be to just pull in all of the schemas from zarr-extensions automatically (e.g. via a program that generates the schema), and disallow in the schema unknown IDs.

We could update zarr-extensions to include separate schemas for the core ids also. That way almost everything could be pulled in just from zarr-extensions.

TomAugspurger · 2025-06-03T23:45:14Z

It does seem rather unfortunate to have to list all of the ids defined in the core spec redundantly in order to exclude them as valid extension names.

My natural preference is for simple / dumb solutions. In this case I'm probably fine with repeating the names since CI should immediately fail if we add some new core object but forget to update the list of fields. I think it's impossible for these to get out of sync.

We could update zarr-extensions to include separate schemas for the core ids also.

I wasn't aware of the zarr-extensions repo until after I submitted this PR. My initial preference would be to keep the JSON schema in the same repository otherwise it's (even more) likely to fall out of date as the spec evolves.

But it'd be good to figure out some way to share what's already be done there (CI / tooling maybe?) with what's proposed here. I'll take a closer look when I get a chance.

pzaborowski · 2025-06-04T20:37:02Z

json-schema/group.json

+    "$schema": "https://json-schema.org/draft/2020-12/schema",
+    "$id": "https://zarr-specs.readthedocs.io/v3/json-schema/group.json",
+    "title": "Zarr v3 Group Metadata Schema",
+    "description": "JSON Schema for Zarr v3 Group metadata documents.",


Have you considered factoring out the common part? https://json-schema.org/blog/posts/modelling-inheritance
It will be useful for the further schemas like CF profile

looks like array.json schema could use $ref top level to this file not to replicate definitions. Resolvable URI thath can be changed later would help here, indeed.

jbms · 2025-06-04T21:20:20Z

It does seem rather unfortunate to have to list all of the ids defined in the core spec redundantly in order to exclude them as valid extension names.

My natural preference is for simple / dumb solutions. In this case I'm probably fine with repeating the names since CI should immediately fail if we add some new core object but forget to update the list of fields. I think it's impossible for these to get out of sync.

If something is missing from the list of exclusions then it will just also validate as an extension, meaning the configuration doesn't get checked.

Additionally, if you make a typo in an identifier it will also just be considered an extension and validate successfully.

We could update zarr-extensions to include separate schemas for the core ids also.

I wasn't aware of the zarr-extensions repo until after I submitted this PR. My initial preference would be to keep the JSON schema in the same repository otherwise it's (even more) likely to fall out of date as the spec evolves.

Putting the separate schemas in this repo instead would also be fine, or zarr-extensions could even be merged into this repo.

But it'd be good to figure out some way to share what's already be done there (CI / tooling maybe?) with what's proposed here. I'll take a closer look when I get a chance.

That repo basically has the complement of what you have here represented as a schema.

TomAugspurger · 2025-06-04T22:04:53Z

If something is missing from the list of exclusions then it will just also validate as an extension, meaning the configuration doesn't get checked.

Mmm here's what I had in mind: With a diff like this that "forgets" to add default to the list of exclusions:

❯ git diff
diff --git a/json-schema/array.json b/json-schema/array.json
index c9d2085..f69dd1c 100644
--- a/json-schema/array.json
+++ b/json-schema/array.json
@@ -559,7 +559,6 @@
                             "type": "string",
                             "not": {
                                 "enum": [
-                                    "default",
                                     "v2"
                                 ]
                             }

We get an error, thanks to that that matching both the default and extension chunk key encodings:

❯ check-jsonschema --schemafile json-schema/array.json examples/air_temperature.zarr/air/zarr.json --verbose
Schema validation errors were encountered.
  examples/air_temperature.zarr/air/zarr.json::$.chunk_key_encoding: {'name': 'default', 'configuration': {'separator': '/'}} is valid under each of {'$ref': '#/$defs/extension_chunk_key_encoding'}, {'$ref': '#/$defs/default_chunk_key_encoding'}

But if I forget to add v2 instead, then this passes. Maybe that's what you were saying? I guess if we have 100% coverage of the core spec then we'd be OK...

What do you think about a tool that checks that we didn't forget any keys, rather than generating the json-schema files? That sounds pretty straightforward to write and run in CI.

Additionally, if you make a typo in an identifier it will also just be considered an extension and validate successfully.

Yeah, that seems like a problem... But the core schema can't know anything about the extension schemas, I think. I'm not an expert on json-schema, but perhaps this is why STAC includes a stac_extensions array in its core spec, and STAC-specific tools know to load all of the json-schemas at those URIs and validate the document against each (xref #316).

joshmoore · 2025-06-18T09:33:26Z

Thanks, @TomAugspurger! Happy to help get the permanent, resolvable URI. (My instinct is to put all of this under a v3/ directory.)

TomAugspurger added 2 commits June 1, 2025 20:42

Added json schema

fc211d3

Added examples, linting

6217477

TomAugspurger commented Jun 2, 2025

View reviewed changes

pzaborowski reviewed Jun 4, 2025

View reviewed changes

Added json schema #347

Are you sure you want to change the base?

Added json schema #347

Uh oh!

Conversation

TomAugspurger commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomAugspurger Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

pzaborowski Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

TomAugspurger Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

TomAugspurger Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

d-v-b commented Jun 2, 2025

Uh oh!

jbms commented Jun 2, 2025

Uh oh!

TomAugspurger commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pzaborowski Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

pzaborowski Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

jbms commented Jun 4, 2025

Uh oh!

TomAugspurger commented Jun 4, 2025

Uh oh!

joshmoore commented Jun 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

TomAugspurger commented Jun 2, 2025 •

edited

Loading

TomAugspurger commented Jun 3, 2025 •

edited

Loading