Skip to content

Conversation

@crusaderky
Copy link
Contributor

@crusaderky crusaderky commented Oct 16, 2025

Blosc and Blosc2 crash when faced with variable-width strings, both the legacy object strings or the new NpyStrings a.k.a. StringDType.

This is caused by an upstream bug. Pytables is also affected.

This PR:

  • adds unit tests for string dtypes
  • fixes the upstream bugs in the vendored hdf5-blosc and hdf5-blosc2.

Clearly, it should not be merged as is; the fixes should be applied in the upstream repos and a release should be issued.

I can see two options before that happens:

  1. wait and leave this PR as draft; then incorporate the new releases of hdf5-blosc and hdf5-blosc2
  2. revert the upstream fix and merge this PR as is, with a @pytest.skip for the blosc and blosc2 tests. In a follow-up, update the upstream libraries.

Reproducer

compression i8 S3 object T
"gzip" ✔️ ✔️ ✔️ ✔️
"lzf" ✔️ ✔️ ✔️ ✔️
hdf5plugin.BZip2() ✔️ ✔️ ✔️ ✔️
hdf5plugin.LZ4() ✔️ ✔️ ✔️ ✔️
hdf5plugin.Blosc() ✔️ ✔️ segfault segfault
hdf5plugin.Blosc2() ✔️ ✔️ segfault segfault

Full reproducer:

import os

import h5py
import hdf5plugin
import numpy as np

fname = "/tmp/ds.h5"

for compression in (
    None,
    "gzip",
    "lzf",
    hdf5plugin.BZip2(),
    hdf5plugin.LZ4(),
    hdf5plugin.Blosc(),
    hdf5plugin.Blosc2(),
):
    for data in (
        np.asarray([1]),
        np.asarray(["foo"], dtype="S"),
        np.asarray([b"foo"], dtype="O"),
        np.asarray(["foo"], dtype="T"),
    ):
        print("desired compression =", compression)
        print("dtype =", data.dtype)

        # Optional: produce meaningful differences in file size
        data = np.tile(data, 1_000_000)

        with h5py.File(fname, "w") as f:
            f.create_dataset("mydataset", data=data, compression=compression)

        print("file size =", os.path.getsize(fname))
        with h5py.File(fname, "r+") as f:
            ds = f["mydataset"]
            print("actual compression =", ds.compression)
            print("compression_opts =", ds.compression_opts)

            actual = (ds.astype("T") if data.dtype.kind == "T" else ds)[:]
        np.testing.assert_array_equal(actual, data)

        print("=" * 80, flush=True)

@crusaderky crusaderky force-pushed the strings branch 2 times, most recently from 79bb47e to dc6eda7 Compare October 16, 2025 19:56
@t20100
Copy link
Member

t20100 commented Oct 17, 2025

Thanks for the issue report and the proposed test and fix.

We aim at not patching vendored filters in this project, so this issue & patch would need to be proposed first to upstream projects: HDF5-Blosc2 & hdf5-blosc.

BTW, the set_local function returns an error when typesize == 0:

typesize = (unsigned int) H5Tget_size(type);
if (typesize == 0) return -1;

So I'm not sure this kind of data is supported by blosc/blosc2 filters.
However, whether a dataset is supported or not looks to be handled by the filter's can_apply function rather than an error from set_local.

attn @FrancescAlted

@crusaderky
Copy link
Contributor Author

We aim at not patching vendored filters in this project, so this issue & patch would need to be proposed first to upstream projects: HDF5-Blosc2 & hdf5-blosc.

Yes, fully agree. What would you like to do of this PR until then?

  1. wait and leave it as draft; then incorporate the new releases of hdf5-blosc and hdf5-blosc2 in it when they will be available
  2. revert the upstream fix and merge this PR as is, to add the tests for the string types for all other compression algorithms, with @pytest.skip for the blosc and blosc2 tests. In a follow-up, update the upstream libraries and remove the skips

BTW, the set_local function returns an error when typesize == 0:

To me this feels like a quirk of H5Tget_size that blosc/blosc2 plugins should route around. For predominantly latin-based text, blosc should optimize for 1 byte per character. Admittedly this will produce suboptimal compression performance for non-latin languages (cyrillic, arabic, chinese, etc.) where the majority of UTF-8 characters take 2 or 3 bytes, but there is no such information available in hdf5. This feels like a compression option to be added to the filters, as a follow-up improvement.

Copy link
Member

@t20100 t20100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go for option 2: revert the upstream fix and merge this PR as is, to add the tests for the string types for all other compression algorithms, with @pytest.skip for the blosc and blosc2 tests.

@crusaderky crusaderky marked this pull request as ready for review October 17, 2025 11:33
@crusaderky
Copy link
Contributor Author

@t20100 this is ready for final review and merge.
I've opened a tracker for the upstream fix in #364.
Once this PR is merged I'll also open a draft PR for those who wish to test the prototype fix early.

@t20100 t20100 added this to the Next release milestone Oct 17, 2025
Copy link
Member

@t20100 t20100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@t20100 t20100 merged commit 077d3f7 into silx-kit:main Oct 17, 2025
4 checks passed
@crusaderky crusaderky deleted the strings branch October 24, 2025 13:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants