Blosc/Blosc2 crash with variable-width strings #363

crusaderky · 2025-10-16T18:55:03Z

Blosc and Blosc2 crash when faced with variable-width strings, both the legacy object strings or the new NpyStrings a.k.a. StringDType.

This is caused by an upstream bug. Pytables is also affected.

This PR:

adds unit tests for string dtypes
fixes the upstream bugs in the vendored hdf5-blosc and hdf5-blosc2.

Clearly, it should not be merged as is; the fixes should be applied in the upstream repos and a release should be issued.

I can see two options before that happens:

wait and leave this PR as draft; then incorporate the new releases of hdf5-blosc and hdf5-blosc2
revert the upstream fix and merge this PR as is, with a @pytest.skip for the blosc and blosc2 tests. In a follow-up, update the upstream libraries.

Reproducer

compression	i8	S3	object	T
"gzip"	✔️	✔️	✔️	✔️
"lzf"	✔️	✔️	✔️	✔️
hdf5plugin.BZip2()	✔️	✔️	✔️	✔️
hdf5plugin.LZ4()	✔️	✔️	✔️	✔️
hdf5plugin.Blosc()	✔️	✔️	segfault	segfault
hdf5plugin.Blosc2()	✔️	✔️	segfault	segfault

Full reproducer:

import os

import h5py
import hdf5plugin
import numpy as np

fname = "/tmp/ds.h5"

for compression in (
    None,
    "gzip",
    "lzf",
    hdf5plugin.BZip2(),
    hdf5plugin.LZ4(),
    hdf5plugin.Blosc(),
    hdf5plugin.Blosc2(),
):
    for data in (
        np.asarray([1]),
        np.asarray(["foo"], dtype="S"),
        np.asarray([b"foo"], dtype="O"),
        np.asarray(["foo"], dtype="T"),
    ):
        print("desired compression =", compression)
        print("dtype =", data.dtype)

        # Optional: produce meaningful differences in file size
        data = np.tile(data, 1_000_000)

        with h5py.File(fname, "w") as f:
            f.create_dataset("mydataset", data=data, compression=compression)

        print("file size =", os.path.getsize(fname))
        with h5py.File(fname, "r+") as f:
            ds = f["mydataset"]
            print("actual compression =", ds.compression)
            print("compression_opts =", ds.compression_opts)

            actual = (ds.astype("T") if data.dtype.kind == "T" else ds)[:]
        np.testing.assert_array_equal(actual, data)

        print("=" * 80, flush=True)

t20100 · 2025-10-17T08:17:31Z

Thanks for the issue report and the proposed test and fix.

We aim at not patching vendored filters in this project, so this issue & patch would need to be proposed first to upstream projects: HDF5-Blosc2 & hdf5-blosc.

BTW, the set_local function returns an error when typesize == 0:

hdf5plugin/lib/HDF5-Blosc2/src/blosc2_filter.c

Lines 133 to 134 in 5fa76ae

    
           typesize = (unsigned int) H5Tget_size(type); 
        
           if (typesize == 0) return -1;

So I'm not sure this kind of data is supported by blosc/blosc2 filters.
However, whether a dataset is supported or not looks to be handled by the filter's can_apply function rather than an error from set_local.

attn @FrancescAlted

This reverts commit 86609c2.

crusaderky · 2025-10-17T11:09:59Z

We aim at not patching vendored filters in this project, so this issue & patch would need to be proposed first to upstream projects: HDF5-Blosc2 & hdf5-blosc.

Yes, fully agree. What would you like to do of this PR until then?

wait and leave it as draft; then incorporate the new releases of hdf5-blosc and hdf5-blosc2 in it when they will be available
revert the upstream fix and merge this PR as is, to add the tests for the string types for all other compression algorithms, with @pytest.skip for the blosc and blosc2 tests. In a follow-up, update the upstream libraries and remove the skips

BTW, the set_local function returns an error when typesize == 0:

To me this feels like a quirk of H5Tget_size that blosc/blosc2 plugins should route around. For predominantly latin-based text, blosc should optimize for 1 byte per character. Admittedly this will produce suboptimal compression performance for non-latin languages (cyrillic, arabic, chinese, etc.) where the majority of UTF-8 characters take 2 or 3 bytes, but there is no such information available in hdf5. This feels like a compression option to be added to the filters, as a follow-up improvement.

t20100

Let's go for option 2: revert the upstream fix and merge this PR as is, to add the tests for the string types for all other compression algorithms, with @pytest.skip for the blosc and blosc2 tests.

src/hdf5plugin/test.py

This reverts commit af261ac.

crusaderky · 2025-10-17T11:34:45Z

@t20100 this is ready for final review and merge.
I've opened a tracker for the upstream fix in #364.
Once this PR is merged I'll also open a draft PR for those who wish to test the prototype fix early.

t20100

LGTM, thanks!

crusaderky force-pushed the strings branch 2 times, most recently from 79bb47e to dc6eda7 Compare October 16, 2025 19:56

crusaderky added 4 commits October 17, 2025 12:00

Test strings compression

2777cd7

DNM fix upstream libraries

af261ac

DNM local debugging

8a11f48

Revert "DNM local debugging"

b28949d

This reverts commit 86609c2.

crusaderky force-pushed the strings branch from dc6eda7 to b28949d Compare October 17, 2025 11:00

t20100 reviewed Oct 17, 2025

View reviewed changes

src/hdf5plugin/test.py Outdated Show resolved Hide resolved

crusaderky added 2 commits October 17, 2025 12:27

Code review

bec8455

Revert "DNM fix upstream libraries"

4b6c29a

This reverts commit af261ac.

crusaderky mentioned this pull request Oct 17, 2025

Blosc/Blosc2 segfault with variable-width strings #364

Open

Add issue xref

fa9e15e

crusaderky marked this pull request as ready for review October 17, 2025 11:33

t20100 added this to the Next release milestone Oct 17, 2025

t20100 approved these changes Oct 17, 2025

View reviewed changes

t20100 merged commit 077d3f7 into silx-kit:main Oct 17, 2025
4 checks passed

crusaderky deleted the strings branch October 24, 2025 13:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Blosc/Blosc2 crash with variable-width strings #363

Blosc/Blosc2 crash with variable-width strings #363

Uh oh!

crusaderky commented Oct 16, 2025 •

edited

Loading

Uh oh!

t20100 commented Oct 17, 2025

Uh oh!

crusaderky commented Oct 17, 2025

Uh oh!

t20100 left a comment

Uh oh!

Uh oh!

crusaderky commented Oct 17, 2025

Uh oh!

t20100 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Blosc/Blosc2 crash with variable-width strings #363

Blosc/Blosc2 crash with variable-width strings #363

Uh oh!

Conversation

crusaderky commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reproducer

Uh oh!

t20100 commented Oct 17, 2025

Uh oh!

crusaderky commented Oct 17, 2025

Uh oh!

t20100 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

crusaderky commented Oct 17, 2025

Uh oh!

t20100 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

crusaderky commented Oct 16, 2025 •

edited

Loading