-
Notifications
You must be signed in to change notification settings - Fork 142
Description
I'm using an HDF5 file to store more data than can fit in RAM. In my case, I'm running a simulation that writes out to the HDF5 file as it goes (using chunking and -1 for max_dims so that the arrays can grow over time). Afterwards, I want to analyze that data. For instance, I might want to call mean on the dataset, which efficiently calculates the mean by iterating over the data.
But, iterate is not implemented for HDF5 data sets, and iterating using getindex is prohibitively slow. Clearly, I clearly can't read(the_whole_thing) and then iterate on that, because my data is too big to fit into RAM.
Testing on a simple 10,000-element Vector{Float64} that I can fit into RAM, reading the whole array and then calling sum is ~1000 times faster than reading each index individually. I assume this is because each getindex needs to go figure out where the data is each time. However, if iterate were implemented for an HDF5 dataset, its state could remember where it was in the HDF5 file, enabling much faster incremental reads.
(As a work-around for myself, I can implement a wrapper that reads in chunks corresponding with the chunk size of the data, and I can implement iterate for that wrapper, but I wonder if this is really the best way; presumably, whatever HDF5.jl is doing to read all of the elements could be made to work with iterate.)
I might even consider trying to implement this for HDF5.jl myself. Is there a reason iterate doesn't already exist? Have smarter people than me tried and failed at this? There is a related open issue from 2019.