-
Notifications
You must be signed in to change notification settings - Fork 948
Experimental PQ reader utility to calculate total rows in input row groups #18716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experimental PQ reader utility to calculate total rows in input row groups #18716
Conversation
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
.stats_level(cudf::io::statistics_freq::STATISTICS_COLUMN); | ||
|
||
if constexpr (NumTableConcats > 1) { | ||
out_opts.set_row_group_size_rows(20000); | ||
out_opts.set_max_page_size_rows(5000); | ||
out_opts.set_row_group_size_rows(num_ordered_rows); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use constexpr
variables from header instead of magic numbers to avoid problems false positive failures later on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what problems do you expect later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like if the value of num_ordered_rows
changes in the header for some reason, which is used to generate columns, the test may fail due to magic numbers being used instead of the same constexprs
auto col0 = testdata::ascending<uint32_t>();
auto col1 = testdata::descending<int64_t>();
auto col2 = testdata::ascending<cudf::string_view>();
.row_group_size_rows(page_size_for_ordered_tests) | ||
.max_page_size_rows(page_size_for_ordered_tests / 5) | ||
.compression(compression) | ||
.dictionary_policy(cudf::io::dictionary_policy::ALWAYS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Always try to write dictionaries to test dictionary page pruning in future
Co-authored-by: David Wendt <[email protected]>
/merge |
Description
This PR adds a utility function to the experimental Parquet reader to calculate the total number of rows in the input list of parquet row groups.
Checklist