Skip to content

feat: Add a parquet uuid calculation #3440

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
May 1, 2025
Merged

Conversation

NJManganelli
Copy link
Contributor

Calculate a uuid from parquet metadata, utilizing detailed info of the first and last row_groups plus the col_counts of all row_groups of the file or dataset. At the column-page level, parquet should have a checksum AFAIK, but an approximate calculation that would catch differences in numbers of rows, row groups, columns, compression, etc. that deterministically uses two row groups should be sufficient for the equivalent of what coffea does with root files (which is flag them for changes to recalculate the form, steps, etc.).

https://github.com/scikit-hep/coffea/blob/master/src/coffea/dataset_tools/preprocess.py#L46-L48

Also, the ParquetMetadata namedtuple doesn't appear to be used, at least in this file that's touched. Given there's an extra line to handle not changing the length of returned tuple to try and avoid breaking compatibility with outside users, maybe this should be deprecated and the namedtuple should be used instead?

@NJManganelli NJManganelli changed the title Add a parquet uuid calculation feat: Add a parquet uuid calculation Mar 31, 2025
Copy link

codecov bot commented Mar 31, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.44%. Comparing base (b749e49) to head (6f57cec).
Report is 324 commits behind head on main.

Additional details and impacted files
Files with missing lines Coverage Δ
src/awkward/operations/ak_from_parquet.py 93.42% <100.00%> (+2.37%) ⬆️
src/awkward/operations/ak_metadata_from_parquet.py 100.00% <100.00%> (ø)

... and 190 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ianna
Copy link
Collaborator

ianna commented Apr 17, 2025

@NJManganelli - what is the status of this PR? Are you still working on it? Thanks!

@NJManganelli
Copy link
Contributor Author

Hi @ianna I'll add a test, then I think it'll be ready from my side

@NJManganelli
Copy link
Contributor Author

Not without a performance penalty, but if it needs to be optimized, we could figure out a smarter but still sufficient calculation (I'd like to ensure that any changes in compression, columns, rows is captured). It's also possible that I didn't explore enough some checksum information that's supposed to be available (but these I think were at the page level or something, and just the loop over all those seems like it would be much worse than this)

Without uuid:

python -m timeit -n 1000 "import test_3440_calculate_parquet_uuid; test_3440_calculate_parquet_uuid.test_parquet_uuid()"
1000 loops, best of 5: 83 usec per loop

With uuid:

tests git:(parquet_uuid) ✗ python -m timeit -n 1000 "import test_3440_calculate_parquet_uuid; test_3440_calculate_parquet_uuid.test_parquet_uuid()"
1000 loops, best of 5: 209 usec per loop

Nick Manganelli added 3 commits April 18, 2025 09:54
…e first and last row_groups plus the col_counts of all row_groups of the file or dataset
@NJManganelli
Copy link
Contributor Author

Rebased for my own sanity, and marked ready (presuming all the tests are going to pass, will fix if otherwise)

@NJManganelli NJManganelli marked this pull request as ready for review April 18, 2025 14:55
Copy link
Collaborator

@ianna ianna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NJManganelli - it looks like uuids do not match:

______________________________ test_parquet_uuid _______________________________

    def test_parquet_uuid():
        meta = metadata_from_parquet(input)
>       assert (
            meta["uuid"]
            == "93fd596534251b1ac750f359d9d55418b02bcddcc79cd5ab1e3d9735941fbcae"
        )
E       AssertionError: assert 'adc236484e10384ad680a1ae2ce1fc675f6f9f0a84a02c651ed91ad97fba41c0' == '93fd596534251b1ac750f359d9d55418b02bcddcc79cd5ab1e3d9735941fbcae'
E         
E         - 93fd596534251b1ac750f359d9d55418b02bcddcc79cd5ab1e3d9735941fbcae
E         + adc236484e10384ad680a1ae2ce1fc675f6f9f0a84a02c651ed91ad97fba41c0

meta       = {'col_counts': [5],
 'columns': ['u1', 'u4', 'u8', 'f4', 'f8', 'raw', 'utf8'],
 'form': RecordForm([BitMaskedForm('u8', NumpyForm('bool'), True, True), BitMaskedForm('u8', NumpyForm('int32'), True, True), BitMaskedForm('u8', NumpyForm('int64'), True, True), BitMaskedForm('u8', NumpyForm('float32'), True, True), BitMaskedForm('u8', NumpyForm('float64'), True, True), BitMaskedForm('u8', ListOffsetForm('i32', NumpyForm('uint8', parameters={'__array__': 'byte'}), parameters={'__array__': 'bytestring'}), True, True), BitMaskedForm('u8', ListOffsetForm('i32', NumpyForm('uint8', parameters={'__array__': 'char'}), parameters={'__array__': 'string'}), True, True)], ['u1', 'u4', 'u8', 'f4', 'f8', 'raw', 'utf8']),
 'fs': <fsspec.implementations.local.LocalFileSystem object at 0x7ff63efc9340>,
 'num_row_groups': 1,
 'num_rows': 5,
 'paths': ['/home/runner/work/awkward/awkward/tests/samples/nullable-record-primitives.parquet'],
 'uuid': 'adc236484e10384ad680a1ae2ce1fc675f6f9f0a84a02c651ed91ad97fba41c0'}

tests/test_3440_calculate_parquet_uuid.py:22: AssertionError

@NJManganelli
Copy link
Contributor Author

Aye, looks like this will need to be more selective of what goes into the hash, from that printout. I’ll have a look when I am back from holidays

@NJManganelli
Copy link
Contributor Author

@ianna I'm trying a more selective set of key-value pairs, hoping it'll be more stable, but "it works on my machine" the same as the previous one, so need to see what the tests say I think

@NJManganelli NJManganelli requested a review from ianna April 24, 2025 20:46
@ianna
Copy link
Collaborator

ianna commented Apr 25, 2025

@all-contributors please add @NJManganelli for code

Copy link
Contributor

@ianna

I've put up a pull request to add @NJManganelli! 🎉

@ianna
Copy link
Collaborator

ianna commented Apr 25, 2025

strange:

ImportError: to use ak.from_parquet, you must install pyarrow:

    pip install pyarrow

or

    conda install -c conda-forge pyarrow

@NJManganelli
Copy link
Contributor Author

Nope, it is not actually producing a stable hash, so it's still not a uuid. I don't know if the dictionary keys might not be sorting the same every time, either because of parquet data ingestion or something else, if some of the values are not stable, or something else entirely.

How should I iterate on this? temporarily add in a print out of all the intermediate info that goes into the hash and find the difference when a test fails?

@NJManganelli
Copy link
Contributor Author

@ianna could tests be rerun with these debug commits? I couldn't think of another way to discern what is os-dependent or not deterministic, i always get the same hash on my system

ianna and others added 2 commits April 28, 2025 10:14
… value can be None or 0 depending on versions, sorting_columns also different and removed from this list
@NJManganelli
Copy link
Contributor Author

Found two problems, in the "columns" key a field called "statistics" stores extra info like min, max, etc. distinct_counts is None or 0 in the minimal/full ubuntu install. the sorting columns is also missing in one. Let's see if that's everything or not, and if so I'll remove the two debug commits / update the test hash

@NJManganelli
Copy link
Contributor Author

Alright, 17th time is the charm, as they say

@NJManganelli
Copy link
Contributor Author

Thanks, @ikrommyd
Passed this time, if we later discover any skew that maybe pops up with pyarrow changes, I’ll address it then.

@NJManganelli
Copy link
Contributor Author

@ianna a anything else needed?

Copy link
Collaborator

@ianna ianna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NJManganelli - great! Thanks. I’m merging it.

@ianna ianna merged commit 798d0ee into scikit-hep:main May 1, 2025
43 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants