Skip to content

feat: Add a parquet uuid calculation #3440

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

NJManganelli
Copy link

Calculate a uuid from parquet metadata, utilizing detailed info of the first and last row_groups plus the col_counts of all row_groups of the file or dataset. At the column-page level, parquet should have a checksum AFAIK, but an approximate calculation that would catch differences in numbers of rows, row groups, columns, compression, etc. that deterministically uses two row groups should be sufficient for the equivalent of what coffea does with root files (which is flag them for changes to recalculate the form, steps, etc.).

https://github.com/scikit-hep/coffea/blob/master/src/coffea/dataset_tools/preprocess.py#L46-L48

Also, the ParquetMetadata namedtuple doesn't appear to be used, at least in this file that's touched. Given there's an extra line to handle not changing the length of returned tuple to try and avoid breaking compatibility with outside users, maybe this should be deprecated and the namedtuple should be used instead?

Nick Manganelli added 2 commits March 30, 2025 23:14
…e first and last row_groups plus the col_counts of all row_groups of the file or dataset
@NJManganelli NJManganelli changed the title Add a parquet uuid calculation feat: Add a parquet uuid calculation Mar 31, 2025
Copy link

codecov bot commented Mar 31, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.43%. Comparing base (b749e49) to head (5000ce8).
Report is 293 commits behind head on main.

Additional details and impacted files
Files with missing lines Coverage Δ
src/awkward/operations/ak_from_parquet.py 92.36% <100.00%> (+1.31%) ⬆️
src/awkward/operations/ak_metadata_from_parquet.py 100.00% <100.00%> (ø)

... and 185 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant