Skip to content

Feature Request: Parametrization with compression_opts #365

@kmuehlbauer

Description

@kmuehlbauer

I'm trying to make hdf5plugin usable within h5netcdf.

It works already nicely using the advertised approach using either compression=hdf5plugin.LZ4() or via **hdf5plugin.Blosc().

Now, for conciseness, I want to be able to directly use it like this:

compression = 320001  # blosc
compression_opts = (0, 0, 0, 0, 4, 1, 1)  # should work, since this is what will be provided by dict-unpacking

This does return without flaws:

import numpy
import h5py
import hdf5plugin
# Compression
with h5py.File('test.h5', 'w') as f:
    f.create_dataset(
        'data', 
        data=numpy.arange(100), 
        chunks=(50,), 
        compression=32001, 
        compression_opts=(0,0,0,0,4,1,1))
# Decompression
with h5py.File('test.h5', 'r') as f:
    data = f['data']
    print(data[()])
    print(data._filters)
    print(data.id.get_create_plist().get_nfilters())
    print(data.id.get_create_plist().get_filter(0))
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
 96 97 98 99]
{'32001': (2, 2, 8, 400, 4, 1, 1)}
1
(32001, 1, (2, 2, 8, 400, 4, 1, 1), b'blosc')

h5dump shows filter with compression:

HDF5 "test.h5" {
GROUP "/" {
   DATASET "data" {
      DATATYPE  H5T_STD_I64LE
      DATASPACE  SIMPLE { ( 100 ) / ( 100 ) }
      STORAGE_LAYOUT {
         CHUNKED ( 50 )
         SIZE 184 (4.348:1 COMPRESSION)
      }
      FILTERS {
         USER_DEFINED_FILTER {
            FILTER_ID 32001
            COMMENT blosc
            PARAMS { 2 2 8 400 4 1 1 }
         }
      }

But this (silently) does nothing (I think this is normal h5py/hdf5 behaviour if a filter is not applicable for some reason) although the filter is reported everywhere:

# change this to some erroneous value
compression_opts=(0,0,0,0,10,1,1))

Also note the clevel related output:

`clevel` parameter must be between 0 and 9!
`clevel` parameter must be between 0 and 9!
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
 96 97 98 99]
{'32001': (2, 2, 8, 400, 10, 1, 1)}
1
(32001, 1, (2, 2, 8, 400, 10, 1, 1), b'blosc')

In the h5dump, we can see that the filter was not applied (no compression), although it is added. Bug or feature?

HDF5 "test.h5" {
GROUP "/" {
   DATASET "data" {
      DATATYPE  H5T_STD_I64LE
      DATASPACE  SIMPLE { ( 100 ) / ( 100 ) }
      STORAGE_LAYOUT {
         CHUNKED ( 50 )
         SIZE 800 (1.000:1 COMPRESSION)
      }
      FILTERS {
         USER_DEFINED_FILTER {
            FILTER_ID 32001
            COMMENT blosc
            PARAMS { 2 2 8 400 10 1 1 }
         }
      }
}

The dataset is reported to have Blosc filter applied via h5py and also via netCDF4 (which also reports the wrong clevel of 10).

with nc.Dataset("test.h5") as ds:
    print(ds["data"][:])
    print(ds["data"].filters())
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
 96 97 98 99]
{'zlib': False, 'szip': False, 'zstd': False, 'bzip2': False, 'blosc': {'compressor': 'blosc_lz4', 'shuffle': 1}, 'shuffle': False, 'complevel': 10, 'fletcher32': False}

We get these kind of warnings (see above) for wrong clevel and shuffle, but we do not get this if we have compression out of range. It just silently does not apply the filter.

Any thoughts on that? How can we make sure to not use any problematic settings?

Would it be possible to do something like this:

import hdf5plugin

filter = hdf5plugin.from_id(32001, opts=(0, 0, 0, 0, 4, 1, 1))
f.create_dataset(
        'data', 
        data=numpy.arange(100), 
        chunks=(50,), 
        **filter)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions