Is there an option for faster iteration?

I'm using an HDF5 file to store more data than can fit in RAM. In my case, I'm running a simulation that writes out to the HDF5 file as it goes (using chunking and -1 for `max_dims` so that the arrays can grow over time). Afterwards, I want to analyze that data. For instance, I might want to call `mean` on the dataset, which efficiently calculates the mean by iterating over the data.

But, `iterate` is not implemented for HDF5 data sets, and iterating using `getindex` is prohibitively slow. Clearly, I clearly can't `read(the_whole_thing)` and then iterate on that, because my data is too big to fit into RAM.

Testing on a simple 10,000-element `Vector{Float64}` that I _can_ fit into RAM, reading the whole array and then calling `sum` is ~1000 times faster than reading each index individually. I assume this is because each `getindex` needs to go figure out where the data is each time. However, if `iterate` were implemented for an HDF5 dataset, its state could remember where it was in the HDF5 file, enabling much faster incremental reads.

(As a work-around for myself, I can implement a wrapper that reads in chunks corresponding with the chunk size of the data, and I can implement `iterate` for that wrapper, but I wonder if this is really the best way; presumably, whatever HDF5.jl is doing to read all of the elements could be made to work with `iterate`.)

I might even consider trying to implement this for HDF5.jl myself. Is there a reason `iterate` doesn't already exist? Have smarter people than me tried and failed at this? There is a [related open issue from 2019](https://github.com/JuliaIO/HDF5.jl/issues/586).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Is there an option for faster iteration? #1211

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Is there an option for faster iteration? #1211

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions