Skip to content

Add binary/opaque dtype #34

@rly

Description

@rly

Related to NeurodataWithoutBorders/nwb-schema#574 to allow the storage of raw binary data that follows a particular format, e.g., MP4, PNG.

In the hdmf schema language, dtype "bytes" maps to variable length string with ascii encoding.
In HDMF, if I try to write a MP4 byte stream with dtype "bytes" to an HDF5 file, I get the error ValueError: VLEN strings do not support embedded NULLs.

Here is the error with a simple h5py-based exmple:

import h5py
f = h5py.File("test.h5", "w")
f.create_dataset(name="data", data=video_data, dtype=h5py.string_dtype('ascii'))
# NOTE: h5py.string_dtype('ascii') is equivalent to h5py.special_dtype(vlen=bytes)
# NOTE: f.create_dataset(name="data", data=video_data) assumes the data is a string and will return the same error
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/rly/mambaforge/envs/temp/lib/python3.11/site-packages/h5py/_hl/group.py", line 183, in create_dataset
    dsid = dataset.make_new_dset(group, shape, dtype, data, name, **kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/rly/mambaforge/envs/temp/lib/python3.11/site-packages/h5py/_hl/dataset.py", line 166, in make_new_dset
    dset_id.write(h5s.ALL, h5s.ALL, data)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5d.pyx", line 282, in h5py.h5d.DatasetID.write
  File "h5py/_proxy.pyx", line 147, in h5py._proxy.dset_rw
  File "h5py/_conv.pyx", line 442, in h5py._conv.str2vlen
  File "h5py/_conv.pyx", line 96, in h5py._conv.generic_converter
  File "h5py/_conv.pyx", line 254, in h5py._conv.conv_str2vlen
ValueError: VLEN strings do not support embedded NULLs

H5py docs recommend against storing raw binary data as variable length strings with an encoding. It says:

If you have a non-text blob in a Python byte string (as opposed to ASCII or UTF-8 encoded text, which is fine), you should wrap it in a void type for storage. This will map to the HDF5 OPAQUE datatype, and will prevent your blob from getting mangled by the string machinery.

To enable storage of raw binary data, I propose we add a new dtype to the schema language that maps to HDF5 OPAQUE / void dtype. We can't use the dtype name "bytes" because we use that for ascii data. What about "binary"?

>>> import h5py
>>> with h5py.File("test.h5", "w") as f:
...     f.create_dataset(name="data", data=np.void(video_data))
... 
<HDF5 dataset "data": shape (), type "|V1048061">
>>> with h5py.File("test.h5", "r") as f:
...     data = f["data"][()].tobytes()
...

Alternatively, raw binary data could be stored as a 1-D array of uint8 values, but using dtype uint8, as opposed to OPAQUE, may cause accidental conversion.

Metadata

Metadata

Assignees

Labels

category: proposaldiscussion of proposed enhancements or new featurespriority: lowalternative solution already working and/or relevant to only specific user(s)

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions