-
Notifications
You must be signed in to change notification settings - Fork 28
Open
Milestone
Description
Blosc and Blosc2 crash when faced with variable-width strings, both the legacy object strings or the new NpyStrings a.k.a. StringDType.
This is caused by an upstream bug. Pytables is also affected.
#363 introduces unit tests for string dtypes, which have been temporarily skipped for blosc and blosc2.
Reproducer
| compression | i8 | S3 | object | T |
|---|---|---|---|---|
| "gzip" | ✔️ | ✔️ | ✔️ | ✔️ |
| "lzf" | ✔️ | ✔️ | ✔️ | ✔️ |
| hdf5plugin.BZip2() | ✔️ | ✔️ | ✔️ | ✔️ |
| hdf5plugin.LZ4() | ✔️ | ✔️ | ✔️ | ✔️ |
| hdf5plugin.Blosc() | ✔️ | ✔️ | segfault | segfault |
| hdf5plugin.Blosc2() | ✔️ | ✔️ | segfault | segfault |
Full reproducer:
import os
import h5py
import hdf5plugin
import numpy as np
fname = "/tmp/ds.h5"
for compression in (
None,
"gzip",
"lzf",
hdf5plugin.BZip2(),
hdf5plugin.LZ4(),
hdf5plugin.Blosc(),
hdf5plugin.Blosc2(),
):
for data in (
np.asarray([1]),
np.asarray(["foo"], dtype="S"),
np.asarray([b"foo"], dtype="O"),
np.asarray(["foo"], dtype="T"),
):
print("desired compression =", compression)
print("dtype =", data.dtype)
# Optional: produce meaningful differences in file size
data = np.tile(data, 1_000_000)
with h5py.File(fname, "w") as f:
f.create_dataset("mydataset", data=data, compression=compression)
print("file size =", os.path.getsize(fname))
with h5py.File(fname, "r+") as f:
ds = f["mydataset"]
print("actual compression =", ds.compression)
print("compression_opts =", ds.compression_opts)
actual = (ds.astype("T") if data.dtype.kind == "T" else ds)[:]
np.testing.assert_array_equal(actual, data)
print("=" * 80, flush=True)Metadata
Metadata
Assignees
Labels
No labels