Add zstd codec by jbms · Pull Request #256 · zarr-developers/zarr-specs

jbms · 2023-08-01T07:41:18Z

No description provided.

jbms · 2023-08-01T07:41:53Z

@normanrz Please take a look.

jstriebel

Should this still be part of ZEP 1? I'd argue for a mini-ZEP for this. IMO it's a good candidate to show that ZEPs can be quite small and concise, with a short duration for acceptance.

Apart from that LGTM!

jbms · 2023-08-03T21:00:31Z

Agreed that this should be separate from ZEP 1. I should update this document to not mention ZEP 1. But perhaps there should be a lighter weight process --- it would be nice to just get a vote on this PR directly, rather than writing up a separate ZEP document, etc.

There are going to be a lot of codecs added, and it would be nice to make that process relatively easy while still getting appropriate feedback.

jstriebel · 2023-08-10T06:40:09Z

But perhaps there should be a lighter weight process --- it would be nice to just get a vote on this PR directly, rather than writing up a separate ZEP document, etc.

That would indeed be nice, but maybe we could have lean ZEPs, with less boilerplate and just pointing to an MR?

@MSanKeys963, what's your take on this?

mkitti · 2024-05-02T22:39:28Z

Is there any chance we could change the default compression level to match ZSTD_CLEVEL_DEFAULT (3)?
https://github.com/facebook/zstd/blob/97291fc5020a8994019ab76cf0cda83a9824374c/lib/zstd.h#L129

jbms · 2024-05-03T04:40:00Z

This proposal doesn't indicate a default compression level if level is unspecified --- the level parameter is intended to be required, though I should clarify that.

When creating an array, implementations are free to provide a mechanism for the user to leave the level unspecified, however, and in that case the implementation must choose a default value.

The only note on default is that level 0 means the default level --- that follows the zstd C API.

If the array is intended to be used for writing, then I don't see much value in allowing the configuration options to be unspecified; that just leaves room for implementation variance. However, if someone is creating zarr metadata to somehow match some existing chunked data, that is known to be zstd compressed (but for which the level may not even be known), and which is intended to be read only, then the level is indeed not needed and it is a bit unfortunate that a fake value would have to be specified. Still I don't think we really need to optimize for this case.

mkitti · 2024-05-08T15:26:55Z

Just to be clear, numcodecs.zstd.Zstd does document that the compression level defaults to 1 and sets DEFAULT_CLEVEL to 1. Default to the C header value would be a change from the current practice in Python's numcodecs.

https://numcodecs.readthedocs.io/en/stable/zstd.html#numcodecs.zstd.Zstd
https://github.com/zarr-developers/numcodecs/blob/79a0e933d78a51f32237b29fc375d0ee27aace37/numcodecs/zstd.pyx#L54

mkitti · 2024-05-08T15:28:22Z

Furthermore, Python's numcodecs seems to disallow negative compression levels and interprets any nonpositive compression level as 1.

https://github.com/zarr-developers/numcodecs/blob/79a0e933d78a51f32237b29fc375d0ee27aace37/numcodecs/zstd.pyx#L83-L84

jbms · 2024-05-08T16:11:35Z

Default values are only really needed to support evolving the spec. Default values in the initial version of the spec basically just serve to save space in the stored metadata, which I think is pretty negligible in this case.

The zarr-python v3 implementation is still free to choose whichever parameters it likes if the user passes in zarr.codecs.Zstd() for example. Any defaults indicated in the spec would only apply when reading the stored metadata.

Supporting negative levels should be a trivial fix.

mkitti · 2024-05-08T18:08:40Z

I think the fixes are trivial. The question is would it be considered breaking to change the default compression level?

It's not even clear that we need to include the compression level or the checksum flag in the codec configuration since that information is stored by the codec itself. The compression level is perhaps only really useful if someone is trying to add new compressed chunks to the dataset or if they are trying to update chunks. In that case we only need the information to be consistent with the rest of the dataset.

normanrz · 2024-05-08T18:16:20Z

I think the fixes are trivial. The question is would it be considered breaking to change the default compression level?

I don't think the spec for the zstd codec should have default values. All options should be mandatory to set. As @jbms outlined, implementations can choose default values. I'll change the default level to 3 for zarr-python.

It's not even clear that we need to include the compression level or the checksum flag in the codec configuration since that information is stored by the codec itself. The compression level is perhaps only really useful if someone is trying to add new compressed chunks to the dataset or if they are trying to update chunks. In that case we only need the information to be consistent with the rest of the dataset.

The gzip and blosc codec also have mandatory level options (among others) for the reason you mentioned. I think the zstd codec should be consistent with that.

jbms · 2024-05-08T18:27:53Z

We could permit the parameters to be left unspecified in the metadata if only read operations are performed, and fail if any write operations are performed. However it is not clear that the added complexity of that is worth it --- it really just saves having to put in some arbitrary values if you don't know the level and only care about reading.

zoj613 · 2024-09-11T20:11:07Z

Is the checksum parameter really necessary here? I went though the list of implementations in different languages and it seems the large majority do not support adding a checksum to the compressed output. Wouldn't this limit how many Zarr implementations can support this codec's spec?

normanrz · 2024-09-11T20:47:27Z

I know of zstd libraries for Python, C/C++, Java, Javascript and Rust that support the checksum flag. Which languages are you referring to?

zoj613 · 2024-09-11T21:28:13Z

I know of zstd libraries for Python, C/C++, Java, Javascript and Rust that support the checksum flag. Which languages are you referring to?

One example is OCaml. Both libraries available don't support this flag. Looking at the list of implementations at the Zstd official site, the other languages that don't include PHP, Fortran, Swift, Ruby, R, Perl, Common Lisp, Ada, Haskell, Julia, Racket, Nim, Elixir.

mkitti · 2024-09-12T04:49:09Z

In Julia, the checksum flag is available through the low-level interface:
https://github.com/JuliaIO/CodecZstd.jl/blob/master/src%2FLibZstd_clang.jl#L154

@nhz2, perhaps we should discuss how to expose the entire parameter surface.

LDeakin · 2025-01-05T00:20:13Z

zstd has been chosen as the default codec in zarr-python v3.0.0. I find that a little concerning since this has not been merged. What needs to be done to get this PR over the line? Implementations that don't support checksum can just reject arrays with checksum: true until they can support it.

nhz2 · 2025-01-10T00:37:15Z

docs/v3/codecs/zstd/v1.0.rst

+========================
+
+level:
+    An integer from -131072 to 22 which controls the speed and level


I don't think these levels should be hard coded as they may change in future libzstd versions. libzstd will clamp out-of-range compression level values to the range it supports, so any int value of level should be accepted to improve forward compatibility.

rouault · 2025-02-03T21:30:46Z

docs/v3/codecs/zstd/v1.0.rst

+    achieve a higher compression ratio at the cost of lower speed.
+
+checksum:
+    A boolean that indicates whether to store a checksum when writing that will


Trying to implement that from scratch, it wasn't immediately obvious to me what that checksum was, whether this was some CRC manually appended or something really belonging to libzstd.
Seeing https://github.com/zarr-developers/numcodecs/blob/main/numcodecs/zstd.pyx#L117, it is the later. It could make sense to be explicit about that

joshmoore · 2025-02-18T11:54:18Z

Beyond any updates related to comments above, I'd additionally suggest the following next steps for this PR based on ZEP0009:

Briefly wait for ZEP9 (phase 1): add clarifications for extension naming #330 to be discussed & merged
Open a copy of this PR against zarr-extensions to register the name zstd (Any schema changes should have been made.)
We update the ZEP process to simplify decision-making/voting/etc.
Optionally, we come back to this PR to vote on having zstd added to zarr-specs if there's interest.

mkitti · 2025-07-25T15:48:49Z

I think we need to add some clarity here regarding the pledged size.

My recommendation is that all implementations MUST include the pledged size in the Zstandard header. However, they SHOULD be able to decompress data which does not include the pledged size in the Zstandard header.

xref: zarr-developers/numcodecs#707

mkitti · 2025-07-25T15:50:37Z

Should a Zstandard dictionary also be considered here?

nhz2 · 2025-11-17T19:19:16Z

Zstandard dictionaries add enough complexity for it to be a new codec. Adding dictionary support is also not backwards compatible when decoding, as a decoder that doesn't know how to read the dictionary will fail to decode the data.

RFC8878 already does a good job at describing how a compliant compressor and a compliant decompressor should act, so I don't think we need to add extra requirements about pledged size or other parameters here. It is fairly easy to decode any valid ZSTD-compressed data, but if an implementation wants to limit itself to only decode a subset of ZSTD, as long as there are clear error messages, that is fine.

mkitti · 2025-11-17T21:48:05Z

RFC8878 leaves a lot of room for compliance:

Unless otherwise indicated below, a compliant compressor must produce data sets that conform to the specifications presented here. However, it does not need to support all options.

A compliant decompressor must be able to decompress at least one working set of parameters that conforms to the specifications presented here. It may also ignore informative fields, such as the checksum. Whenever it does not support a parameter defined in the compressed stream, it must produce an unambiguous error code and associated error message explaining which parameter is unsupported.

We should be slightly stricter here as I have been involved in addressing the inability to decode unknown frame content sizes across four implementations thus far. Zarr implementations should define which subset of options and parameters should be supported for communication between them. It does not work well if one implementation write zstandard in a way that another implementation cannot decode.

nhz2 · 2025-11-17T22:31:40Z

We can have a collection of example zstandard encoded data with different encoding option edge cases for implementations to test against, for example, skippable frames, multiple frames, missing frame content size, and large window sizes. We can then see what the current status is for the various zarr implementations.

mkitti · 2025-11-17T22:56:01Z

Here are the issues that I encountered.

Tensorstore encodes Zarr v2 data with an unknown frame content size when using Zstandard:
- Reported issue: Zstd compression does not encode content size in header google/tensorstore#182 (unresolved)
Neuroglancer could not decode data with an unknown frame content size
- Reported issue: Zstd decompression fails due to unknown frame content size google/neuroglancer#625 (resolved)
- See numcodecs.js resolution
Numcodecs.js could not decode data with an unknown frame content size:
- Reported issue: The returned value from ZSTD_getFrameContentSize is not checked. manzt/numcodecs.js#46 (resolved)
Python numcodecs could not decode data with an unknown frame content size:
- Reported issue: Add streaming decompression for ZSTD_CONTENTSIZE_UNKNOWN case numcodecs#707 (resolved)
- zstd failure for unsepcified buffer size numcodecs#424 (resolved)
python-zstd does not decode data with an unknown frame content size:
- Reported issue: Decompression fails where no content size is included in the frame (e.g. streaming) sergey-dryabzhinsky/python-zstd#53 (will not resolve)
libcramjam did not encode data with a frame content size
- Zstd: Set pledged source size when possible cramjam/libcramjam#27 (resolved)

Add zstd codec

f8b32eb

jbms requested a review from jstriebel August 1, 2023 07:41

normanrz approved these changes Aug 1, 2023

View reviewed changes

jstriebel reviewed Aug 3, 2023

View reviewed changes

sanketverma1704 mentioned this pull request Jan 16, 2024

Some reflections about how to improve the ZEP process zarr-developers/zeps#55

Open

sanketverma1704 mentioned this pull request Mar 12, 2024

Revise ZEP0 zarr-developers/zeps#59

Open

mkitti approved these changes May 2, 2024

View reviewed changes

This was referenced Sep 11, 2024

Define the list of codecs in the v3 spec #312

Closed

codec specification in v3 #293

Open

LDeakin mentioned this pull request Dec 6, 2024

Add default compressors to config zarr-developers/zarr-python#2470

Merged

6 tasks

nhz2 reviewed Jan 10, 2025

View reviewed changes

rouault reviewed Feb 3, 2025

View reviewed changes

rouault mentioned this pull request Feb 3, 2025

Group.create_array() uses the zstd codec which is not in the Zarr V3 spec zarr-developers/zarr-python#2790

Open

joshmoore mentioned this pull request Feb 12, 2025

Add ZEP 9 (extension naming) draft zarr-developers/zeps#65

Merged

5 tasks

joshmoore mentioned this pull request Feb 18, 2025

Add bfloat16 data type #257

Open

Conversation

jbms commented Aug 1, 2023

Uh oh!

jbms commented Aug 1, 2023

Uh oh!

jstriebel left a comment

Choose a reason for hiding this comment

Uh oh!

jbms commented Aug 3, 2023

Uh oh!

jstriebel commented Aug 10, 2023

Uh oh!

mkitti commented May 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jbms commented May 3, 2024

Uh oh!

mkitti commented May 8, 2024

Uh oh!

mkitti commented May 8, 2024

Uh oh!

jbms commented May 8, 2024

Uh oh!

mkitti commented May 8, 2024

Uh oh!

normanrz commented May 8, 2024

Uh oh!

jbms commented May 8, 2024

Uh oh!

zoj613 commented Sep 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

normanrz commented Sep 11, 2024

Uh oh!

zoj613 commented Sep 11, 2024

Uh oh!

mkitti commented Sep 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LDeakin commented Jan 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nhz2 Jan 10, 2025

Choose a reason for hiding this comment

Uh oh!

rouault Feb 3, 2025

Choose a reason for hiding this comment

Uh oh!

joshmoore commented Feb 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mkitti commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mkitti commented Jul 25, 2025

Uh oh!

nhz2 commented Nov 17, 2025

Uh oh!

mkitti commented Nov 17, 2025

Uh oh!

nhz2 commented Nov 17, 2025

Uh oh!

mkitti commented Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Comments

mkitti commented May 2, 2024 •

edited

Loading

zoj613 commented Sep 11, 2024 •

edited

Loading

mkitti commented Sep 12, 2024 •

edited

Loading

LDeakin commented Jan 5, 2025 •

edited

Loading

joshmoore commented Feb 18, 2025 •

edited

Loading

mkitti commented Jul 25, 2025 •

edited

Loading