-
Notifications
You must be signed in to change notification settings - Fork 53
HDF5: Empiric for Optimal Chunk Size #916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
287d7ee to
fdbbf4f
Compare
|
At the moment, a naïve port of the logic causes a deadlock in the MPI tests. Stuck only in After some tests, it looks like adding chunking makes the dataset declaration |
| //for( auto const& val : parameters.chunkSize ) | ||
| // chunk_dims.push_back(static_cast< hsize_t >(val)); | ||
|
|
||
| herr_t status = H5Pset_chunk(datasetCreationProperty, chunk_dims.size(), chunk_dims.data()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fun fact: H5Pset_chunk_opts (HDF5 1.10.0+):
H5Pset_chunk_opts is used to specify storage options for chunks on the edge of a dataset’s dataspace. This capability allows the user to tune performance in cases where the dataset size may not be a multiple of the chunk size and the handling of partial edge chunks can impact performance.
a625114 to
83695d8
Compare
83695d8 to
04c0164
Compare
04c0164 to
44f72a4
Compare
|
I've added some scaffolding for JSON options in HDF5 |
0c2a55a to
34cde08
Compare
34cde08 to
22f70f0
Compare
|
Finished the global options (JSON & env) to disable chucking when needed (mainly for HiPACE's legacy pipeline + potential regressions). |
368c2c4 to
aea7e96
Compare
aea7e96 to
2d34cf6
Compare
|
It's a bit concerning that the clang sanitizer run parallel benchmark (8) runs into a time-out with the new patch: But I don't see an immediately relation. 'll turn the chunking off for this one. |
d834f9f to
ff1d13d
Compare
|
I tried out whether this PR alone already enables extensible datasets in HDF5, apparently not: According to the documentation, only certain kinds of datasets can have their extents extended:
Our datasets seem to fall into the second category. I guess, for extensible datasets, we would have to create unlimited datasets from the beginning? |
|
In order to make a dataset resizable, we need to pass Setting |
This ports a prior empirical algorithm from libSplash to determine an optimal (large) chunk size for an HDF5 dataset based on its datatype and global extent. Original implementation by Felix Schmitt @f-schmitt (ZIH, TU Dresden) in [libSplash](https://github.com/ComputationalRadiationPhysics/libSplash). Original source: - https://github.com/ComputationalRadiationPhysics/libSplash/blob/v1.7.0/src/DCDataSet.cpp - https://github.com/ComputationalRadiationPhysics/libSplash/blob/v1.7.0/src/include/splash/core/DCHelper.hpp Co-authored-by: Felix Schmitt <[email protected]>
The parallel, independent I/O pattern here is corner-case for what HDF5 can support, due to non-collective declarations of data sets. Testing shows that it does not work with chunking.
Runs into timeout for unclear reasons with this patch: ``` 15/32 Test openPMD#15: MPI.8_benchmark_parallel ...............***Timeout 1500.17 sec ```
ff1d13d to
16c14c7
Compare
Co-authored-by: Franz Pöschel <[email protected]>
|
|
||
| A full configuration of the HDF5 backend: | ||
|
|
||
| .. literalinclude:: hdf5.json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@franzpoeschel I just realized I forgot to add a hdf5.json file here 🤪
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I accidentally named it json.json
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> #1169
| All keys found under ``hdf5.dataset`` are applicable globally (future: as well as per dataset). | ||
| Explanation of the single keys: | ||
|
|
||
| * ``adios2.dataset.chunks``: This key contains options for data chunking via `H5Pset_chunk <https://support.hdfgroup.org/HDF5/doc/RM/H5P/H5Pset_chunk.htm>`__. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ouch, that should read hdf5.dataset.chunks...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> #1169
This ports a prior empirical algorithm from libSplash to determine an optimal (large) chunk size for an HDF5 dataset based on its datatype and global extent.
Original implementation by Felix Schmitt @f-schmitt (ZIH, TU Dresden) in libSplash.
Original source:
Close #406
Related to #898 (improve HDF5 baseline performance)
Required for #510: basis to extend resizable data sets (#829) to HDF5
To Do
OPENPMD_HDF5_INDEPENDENT="OFF"andOPENPMD_HDF5_ALIGNMENT="1048576"for our8_benchmark_parallel -wbenchmarkSample & bin directory
du -hs bin sampleswith:OPENPMD_HDF5_CHUNKSon my laptop (4KiB blocksize).
With
"auto"theMPI.8_benchmark_paralleltest is significantly slower. Changing the 4D test to 3D brings the difference down to about 20% slowdown #1010. (Maybe the 4th, 10-element dimension is sub-ideal for chunking?)Follow-Ups