Skip to content

core driver performance prohibitively slow on Windows #3275

@AlexKirko

Description

@AlexKirko

Describe the bug
We attempted to use the 'core' driver on Windows (in-memory), and quickly discovered that it was about 300x slower on Windows than on MacOS for a 22 GB file. The default driver works fine.

We use HDF5 to store a hierarchy of data points. When building the object, we stream from a binary, do some preprocessing, and then populate the HDF storage with more than a 100 data nodes, each containing about two dozen complex-valued numpy arrays.

Expected behavior
The use of the in-memory driver should be faster than the default (on-disk) one and the performance should be comparable between different OS.

Platform (please complete the following information)

  • HDF5 version: 1.14.1
  • OS and version: Windows 11 Pro
  • Compiler and version: installed from the official binary, hdf5-1.14.1-2-Std-win10_64-vs17.zip
  • Build system (e.g. CMake, Autotools): Visual Studio 17.6.5
  • Any configure options you specified: Nothing special

Additional context
This issue was originally reported on the h5py GitHub in 2021 here: h5py/h5py#1827 , but I guess no one raised it with your group.

The problem seems to be that you rely on a realloc() call to get more memory regardless of the OS, and on Windows this often results in the dataset being copied on each additional memory request. In our situation, where we need to request thousands of extensions, hundreds larger than 1MB (default block_size), the storage creation slows down hundreds of times compared to Linux/MacOS because of the copy overhead (also, RAM usage spikes to 2x the size of the dataset on each copy).

Here is an explanation of the realloc differences on Windows and Linux.

You want to look here in your code. What people typically do on Windows is request 1.3x the new dataset size. This requires 30% more RAM on Windows, but the performance gets much closer to O(log(n)) the way it should be, instead of a copy on each increment.

Metadata

Metadata

Assignees

Labels

Component - C LibraryCore C library issues (usually in the src directory)Priority - 0. BlockerThis MUST be merged for the release to happen

Projects

Status

Scheduled/On-Deck

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions