Skip to content

Check sharding bounds #6

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Aug 15, 2024

Conversation

brokkoli71
Copy link
Member

fixes #5

this PR implements checks if the dataset shape, shard shape, (nested shard shapes) and chunk shape are compatible with each other

Copy link
Member

@normanrz normanrz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should also be checked in ArrayMetadata? Imaginge someone uses teh ArrayMetadata constructure directly or creates a faulty zarr.json that zarr-java reads.

@brokkoli71
Copy link
Member Author

brokkoli71 commented Aug 14, 2024

I wonder if this should also be checked in ArrayMetadata? Imaginge someone uses teh ArrayMetadata constructure directly or creates a faulty zarr.json that zarr-java reads.

Good idea @normanrz, I moved the checks into the ArrayMetadata constructor. Then, they will be called in all three cases

Copy link
Member

@normanrz normanrz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@sbesson sbesson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @brokkoli71, as we are exploring how user would specific shard sizes via options, it is really useful that the ArrayMetadataBuilder API include some validation of the various chunk sizes.

In addition to the test case introduced in this PR, can I also offer sbesson@bf3f08f which should test the validation in the non-sharding scenario. The first two parametrized tests should pass with the current HEAD of zarr-java (as per the mismatching number of dimensions) and the two additional tests are now covered by the extra checks introduced in this PR

Add unit test for invalid chunk bounds
@brokkoli71
Copy link
Member Author

Thanks for the test @sbesson, I merged it to this branch.

Seems like a good addition @normanrz, I will have a look at it

@normanrz normanrz merged commit b53e163 into zarr-developers:main Aug 15, 2024
3 checks passed
@sbesson sbesson mentioned this pull request Aug 22, 2024
@sbesson
Copy link
Contributor

sbesson commented Aug 26, 2024

After including the latest release for a round of functional testing, I realised that the containment check introduced in this PR enforces that for every dimension i

array_shape[i] >= outer_chunk[i] (if applicable) >= inner_chunk[i]

This constraint is critical when writing multiscale groups as defined in the OME-NGFF specification where the shape of each array is reduced alongside 2 or more dimensions and the chunk/shard shape might be kept constant.

Taking the concrete example of one of the Zarr dataset converted as part of the OME NGFF 2024 challenge using zarr-python, the 11th resolution of the multiscale group has the following array metadata

{
  "chunk_grid": {
    "configuration": {
      "chunk_shape": [
        1,
        1,
        1,
        4096,
        4096
      ]
    },
    "name": "regular"
  },
  "chunk_key_encoding": {
    "name": "default"
  },
  "codecs": [
    {
      "configuration": {
        "chunk_shape": [
          1,
          1,
          1,
          1024,
          1024
        ],
        "codecs": [
          {
            "configuration": {
              "endian": "little"
            },
            "name": "bytes"
          },
          {
            "configuration": {
              "blocksize": 0,
              "clevel": 5,
              "cname": "zstd",
              "shuffle": "shuffle",
              "typesize": 2
            },
            "name": "blosc"
          }
        ],
        "index_codecs": [
          {
            "configuration": {
              "endian": "little"
            },
            "name": "bytes"
          },
          {
            "name": "crc32c"
          }
        ]
      },
      "name": "sharding_indexed"
    }
  ],
  "data_type": "uint16",
  "dimension_names": [
    "t",
    "c",
    "z",
    "y",
    "x"
  ],
  "fill_value": 0,
  "node_type": "array",
  "shape": [
    1,
    1,
    1,
    91,
    141
  ],
  "zarr_format": 3
}

With this change, such metadata cannot be written using zarr-java. This feels at odds with the Zarr v3 specification where I see two main constraints around shape sizes:

/cc @melissalinkert @joshmoore

@normanrz
Copy link
Member

I think that constraint is too strict. The array size is independent of the chunk size (except that they need to have the same number of dimensions). For sharding, inner chunk sizes need to evenly divide the outer chunk sizes, which implies inner_chunk[i] <= outer_chunk[i].
@brokkoli71 Could you have a look at this please?

@brokkoli71
Copy link
Member Author

right, @sbesson. I have changed the bounds check in #9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Attempting to write with sharding throws ArrayIndexOutOfBoundsException
3 participants