Skip to content

Blosc/Blosc2 segfault with variable-width strings #364

@crusaderky

Description

@crusaderky

Blosc and Blosc2 crash when faced with variable-width strings, both the legacy object strings or the new NpyStrings a.k.a. StringDType.

This is caused by an upstream bug. Pytables is also affected.
#363 introduces unit tests for string dtypes, which have been temporarily skipped for blosc and blosc2.

Reproducer

compression i8 S3 object T
"gzip" ✔️ ✔️ ✔️ ✔️
"lzf" ✔️ ✔️ ✔️ ✔️
hdf5plugin.BZip2() ✔️ ✔️ ✔️ ✔️
hdf5plugin.LZ4() ✔️ ✔️ ✔️ ✔️
hdf5plugin.Blosc() ✔️ ✔️ segfault segfault
hdf5plugin.Blosc2() ✔️ ✔️ segfault segfault

Full reproducer:

import os

import h5py
import hdf5plugin
import numpy as np

fname = "/tmp/ds.h5"

for compression in (
    None,
    "gzip",
    "lzf",
    hdf5plugin.BZip2(),
    hdf5plugin.LZ4(),
    hdf5plugin.Blosc(),
    hdf5plugin.Blosc2(),
):
    for data in (
        np.asarray([1]),
        np.asarray(["foo"], dtype="S"),
        np.asarray([b"foo"], dtype="O"),
        np.asarray(["foo"], dtype="T"),
    ):
        print("desired compression =", compression)
        print("dtype =", data.dtype)

        # Optional: produce meaningful differences in file size
        data = np.tile(data, 1_000_000)

        with h5py.File(fname, "w") as f:
            f.create_dataset("mydataset", data=data, compression=compression)

        print("file size =", os.path.getsize(fname))
        with h5py.File(fname, "r+") as f:
            ds = f["mydataset"]
            print("actual compression =", ds.compression)
            print("compression_opts =", ds.compression_opts)

            actual = (ds.astype("T") if data.dtype.kind == "T" else ds)[:]
        np.testing.assert_array_equal(actual, data)

        print("=" * 80, flush=True)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions