feat: Add a parquet uuid calculation #3440
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Calculate a uuid from parquet metadata, utilizing detailed info of the first and last row_groups plus the col_counts of all row_groups of the file or dataset. At the column-page level, parquet should have a checksum AFAIK, but an approximate calculation that would catch differences in numbers of rows, row groups, columns, compression, etc. that deterministically uses two row groups should be sufficient for the equivalent of what coffea does with root files (which is flag them for changes to recalculate the form, steps, etc.).
https://github.com/scikit-hep/coffea/blob/master/src/coffea/dataset_tools/preprocess.py#L46-L48
Also, the ParquetMetadata namedtuple doesn't appear to be used, at least in this file that's touched. Given there's an extra line to handle not changing the length of returned tuple to try and avoid breaking compatibility with outside users, maybe this should be deprecated and the namedtuple should be used instead?