Skip to content

Subset writer tests are surprisingly slow #283

@jchelly

Description

@jchelly

While testing #281 I found that test_subset_write.py is much faster when working on remote files accessed via the hdfstream web service. We seem to incur a lot of overhead in h5py's dataset.read_direct() method. Using local files on Cosma the test on EagleDistributed.hdf5 takes 70 seconds and spends 90% of that time in Dataset.read_direct(). Using remote files it takes 25 seconds, of which 70% is in SSLSocket.read().

Here's a profile of the case where we read local HDF5 files:

Image

read_direct() is called about 5000 times. I was able to reduce the runtime from 70 seconds to 20 seconds by using the h5py low level API to select all of the slices then read them with a single H5Dread(). But I'm not sure how this will affect more realistic cases and it results in a different ordering of the result if the slices are not sorted, so one of the SOAP tests fails.

The tests in test_mask.py show similar behaviour: running on remote files takes less than half as long. Although the individual tests are only ~5 seconds or less in that case.

Metadata

Metadata

Assignees

No one assigned

    Labels

    performanceAn issue that is impacting performance

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions