Skip to content

HDF5 tools performance issue for multiply-linked objects #6399

@jhendersonHDF

Description

@jhendersonHDF

The h5trav interfaces used by various HDF5 tools (h5dump, h5ls, h5diff, h5repack, h5stat and h5format_convert) to traverse file structures keep an internal array of objects with a reference count greater than 1 in order to avoid redundant processing on already-visited objects linked to elsewhere with hard links. Checking if an object has already been visited performs a linear scan over this array that compares object "tokens" with H5Otoken_cmp() until a matching object is found. For file structures that contain many objects with at least two hard links to them (including the original hard link when the object was created), this can result in behavior that is quadratic or near-quadratic with the number of objects.

Additionally, h5repack uses an h5trav interface which keeps a separate internal array of hard link name aliases for already-visited objects. This array is used so that processing can avoid accidentally expanding a hard link into a duplicate object in a repacked file. Checking for visited objects here also performs a linear scan over the array, comparing object "tokens" with H5Otoken_cmp() until a matching object is found. For files with structures similar to the one mentioned above, this can approach 2n2 complexity when combined with the linear scan from above.

Quick initial profiling of repacking a file with ~500,000 objects, where most of the objects are linked with hard links at least twice, using no options for h5repack showed ~95% of the runtime being spent in H5Otoken_cmp() operations. Replacing these arrays with hash tables greatly improves the performance of the HDF5 tools above for specific file structures.

Note that a similar performance issue may exist for soft/external links in the symlink_visit_add() and symlink_is_visited() functions in tools/lib/h5trav.c, but this path hasn't been tested yet.

Running h5repack on the above file with no options:

Current

Manually aborted after 3 hrs, the entirety of which was spent setting up the table of objects to traverse (no actual repacking had occurred yet)

With hash table replacement:

real 5m23.073s
user 3m10.633s
sys 0m47.585s

Running h5stat on the same file:

Current

Manually aborted after 1 hr, the entirety of which was spent setting up the table of objects to traverse

With hash table replacement:

real 0m32.472s
user 0m29.419s
sys 0m3.042s

Metadata

Metadata

Assignees

Labels

Component - ToolsCommand-line tools like h5dump, includes high-level tools
No fields configured for Enhancement.

Projects

Status

In progress

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions