Skip to content

Conversation

@stephprince
Copy link
Contributor

@stephprince stephprince commented Oct 22, 2025

Adds benchmarks to get download times for HDF5 and Zarr NWB files to compare against slicing extrapolations. To be used when deciding whether to download vs. stream an NWB file depending on the amount of data being accessed.

We had discussed making these either in a separate benchmarks folder or adding an environment variable. I thought setting the environment variable with the skip decorator was cleaner since there are other download related benchmarks and only two full file download tests. The download benchmarks can then be run manually with:

RUN_DOWNLOAD_BENCHMARKS=true nwb_benchmarks run --bench "time_download.HDF5DownloadDandiAPIBenchmark.time_download_hdf5_dandi_api"

@stephprince stephprince requested a review from rly October 22, 2025 19:54
@stephprince stephprince marked this pull request as ready for review October 22, 2025 19:55
@stephprince
Copy link
Contributor Author

stephprince commented Oct 22, 2025

Looking back at our initial figure sketches, should these tests also include time to open + slice data after downloading? Or is that timing negligible at that point.

Edit: This would be covered by #163 so probably do not need to also include here.

@CodyCBakerPhD
Copy link
Collaborator

Looking back at our initial figure sketches, should these tests also include time to open + slice data after downloading? Or is that timing negligible at that point

If the benchmarks can re-use the already downloaded file, then timing the file open and slicing would be fair, as per #163 can be added in this PR or in a follow up

@CodyCBakerPhD
Copy link
Collaborator

(both of these are really more of a 'calibration' of average system bandwidth + disk)

@oruebel
Copy link
Contributor

oruebel commented Oct 22, 2025

Looking back at our initial figure sketches, should these tests also include time to open + slice data after downloading? Or is that timing negligible at that point

If the benchmarks can re-use the already downloaded file, then timing the file open and slicing would be fair, as per #163 can be added in this PR or in a follow up

Agreed. Timing slicing from local files can be separate and use already downloaded files.

@rly rly mentioned this pull request Oct 23, 2025
download(urls=params["https_url"], output_dir=self.tmpdir.name)


class LindiDownloadFsspecBenchmark(BaseBenchmark):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class LindiDownloadFsspecBenchmark(BaseBenchmark):
class LindiDownloadBenchmark(BaseBenchmark):

Should this be just LindiDownloadBenchmark?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see, it uses download_file which uses fsspec

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency and maintainability, I think we should just use the DANDI API to download the LINDI files. I'll also update the download_file usage in a separate PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users would most likely either be using the dandi api/cli or their browser to download these files, not fsspec anyway

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense to keep consistent - I will update here to use the dandi API across all file types

Copy link
Contributor Author

@stephprince stephprince Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use dandi api here but left the download_file definition as it was for now.

The lindi files don't take very long to download, does it make sense to remove the skip decorator in this case or is that more confusing/inconsistent? We discussed adding these lindi download times in our benchmark plots which is another reason to not skip

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lindi files don't take very long to download, does it make sense to remove the skip decorator in this case

For lindi I think it makes sense to always run the download test

@rly
Copy link
Contributor

rly commented Oct 24, 2025

Looks good!

@CodyCBakerPhD CodyCBakerPhD merged commit 7fc6911 into main Oct 24, 2025
3 checks passed
@CodyCBakerPhD CodyCBakerPhD deleted the add-download-benchmarks branch October 24, 2025 15:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants