-
Notifications
You must be signed in to change notification settings - Fork 32
Add zstd codec #256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add zstd codec #256
Conversation
|
@normanrz Please take a look. |
jstriebel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this still be part of ZEP 1? I'd argue for a mini-ZEP for this. IMO it's a good candidate to show that ZEPs can be quite small and concise, with a short duration for acceptance.
Apart from that LGTM!
|
Agreed that this should be separate from ZEP 1. I should update this document to not mention ZEP 1. But perhaps there should be a lighter weight process --- it would be nice to just get a vote on this PR directly, rather than writing up a separate ZEP document, etc. There are going to be a lot of codecs added, and it would be nice to make that process relatively easy while still getting appropriate feedback. |
That would indeed be nice, but maybe we could have lean ZEPs, with less boilerplate and just pointing to an MR? @MSanKeys963, what's your take on this? |
|
Is there any chance we could change the default compression level to match |
|
This proposal doesn't indicate a default compression level if When creating an array, implementations are free to provide a mechanism for the user to leave the level unspecified, however, and in that case the implementation must choose a default value. The only note on default is that level 0 means the default level --- that follows the zstd C API. If the array is intended to be used for writing, then I don't see much value in allowing the configuration options to be unspecified; that just leaves room for implementation variance. However, if someone is creating zarr metadata to somehow match some existing chunked data, that is known to be zstd compressed (but for which the level may not even be known), and which is intended to be read only, then the level is indeed not needed and it is a bit unfortunate that a fake value would have to be specified. Still I don't think we really need to optimize for this case. |
|
Just to be clear, https://numcodecs.readthedocs.io/en/stable/zstd.html#numcodecs.zstd.Zstd |
|
Furthermore, Python's numcodecs seems to disallow negative compression levels and interprets any nonpositive compression level as |
|
Default values are only really needed to support evolving the spec. Default values in the initial version of the spec basically just serve to save space in the stored metadata, which I think is pretty negligible in this case. The zarr-python v3 implementation is still free to choose whichever parameters it likes if the user passes in Supporting negative levels should be a trivial fix. |
|
I think the fixes are trivial. The question is would it be considered breaking to change the default compression level? It's not even clear that we need to include the compression level or the checksum flag in the codec configuration since that information is stored by the codec itself. The compression level is perhaps only really useful if someone is trying to add new compressed chunks to the dataset or if they are trying to update chunks. In that case we only need the information to be consistent with the rest of the dataset. |
I don't think the spec for the zstd codec should have default values. All options should be mandatory to set. As @jbms outlined, implementations can choose default values. I'll change the default level to 3 for zarr-python.
The |
|
We could permit the parameters to be left unspecified in the metadata if only read operations are performed, and fail if any write operations are performed. However it is not clear that the added complexity of that is worth it --- it really just saves having to put in some arbitrary values if you don't know the level and only care about reading. |
|
Is the |
|
I know of zstd libraries for Python, C/C++, Java, Javascript and Rust that support the checksum flag. Which languages are you referring to? |
One example is OCaml. Both libraries available don't support this flag. Looking at the list of implementations at the Zstd official site, the other languages that don't include PHP, Fortran, Swift, Ruby, R, Perl, Common Lisp, Ada, Haskell, Julia, Racket, Nim, Elixir. |
|
In Julia, the checksum flag is available through the low-level interface: @nhz2, perhaps we should discuss how to expose the entire parameter surface. |
|
Related: |
| ======================== | ||
|
|
||
| level: | ||
| An integer from -131072 to 22 which controls the speed and level |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think these levels should be hard coded as they may change in future libzstd versions. libzstd will clamp out-of-range compression level values to the range it supports, so any int value of level should be accepted to improve forward compatibility.
| achieve a higher compression ratio at the cost of lower speed. | ||
|
|
||
| checksum: | ||
| A boolean that indicates whether to store a checksum when writing that will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to implement that from scratch, it wasn't immediately obvious to me what that checksum was, whether this was some CRC manually appended or something really belonging to libzstd.
Seeing https://github.com/zarr-developers/numcodecs/blob/main/numcodecs/zstd.pyx#L117, it is the later. It could make sense to be explicit about that
|
Beyond any updates related to comments above, I'd additionally suggest the following next steps for this PR based on ZEP0009:
|
|
I think we need to add some clarity here regarding the pledged size. My recommendation is that all implementations MUST include the pledged size in the Zstandard header. However, they SHOULD be able to decompress data which does not include the pledged size in the Zstandard header. |
|
Should a Zstandard dictionary also be considered here? |
|
Zstandard dictionaries add enough complexity for it to be a new codec. Adding dictionary support is also not backwards compatible when decoding, as a decoder that doesn't know how to read the dictionary will fail to decode the data. RFC8878 already does a good job at describing how a compliant compressor and a compliant decompressor should act, so I don't think we need to add extra requirements about pledged size or other parameters here. It is fairly easy to decode any valid ZSTD-compressed data, but if an implementation wants to limit itself to only decode a subset of ZSTD, as long as there are clear error messages, that is fine. |
|
RFC8878 leaves a lot of room for compliance:
We should be slightly stricter here as I have been involved in addressing the inability to decode unknown frame content sizes across four implementations thus far. Zarr implementations should define which subset of options and parameters should be supported for communication between them. It does not work well if one implementation write zstandard in a way that another implementation cannot decode. |
|
We can have a collection of example zstandard encoded data with different encoding option edge cases for implementations to test against, for example, skippable frames, multiple frames, missing frame content size, and large window sizes. We can then see what the current status is for the various zarr implementations. |
|
Here are the issues that I encountered.
|
No description provided.