The `skhep_testdata.data_path` function should run a checksum on the file that it finds

We just came across another reason why it's dangerous to only check cached files by name (other than https://github.com/scikit-hep/scikit-hep-testdata/issues/147#issuecomment-2015769482): tests of `uproot.update` might change them in place and then subsequent tests are not testing what we think they're testing.

The ultimate solution for this would be for Scikit-HEP-testdata to maintain a mapping (dict? JSON?) of filename → checksum, generated and hard-coded into each release, and then the `skhep_testdata.data_path` function would both check to see that a file with the right name exists _and_ that it has the right checksum. Computing a checksum of a < 1 MB file shouldn't be _too_ expensive. (It has to be done every time a user requests a file path, or at least once per change of the file's late modified date, but that's more complicated.) In Python, it can be computed with [hashlib.hash.hexdigest](https://docs.python.org/3/library/hashlib.html#hashlib.hash.hexdigest) ([StackOverflow](https://stackoverflow.com/a/16876405/1623645)). And then there's the added complication of embedding a hard-coded filename → checksum mapping, which would presumably need to be computed during the release phase.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The `skhep_testdata.data_path` function should run a checksum on the file that it finds #153

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The skhep_testdata.data_path function should run a checksum on the file that it finds #153

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

The `skhep_testdata.data_path` function should run a checksum on the file that it finds #153