Skip to content

Experimental PQ reader utility to calculate total rows in input row groups #18716

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

mhaseeb123
Copy link
Member

Description

This PR adds a utility function to the experimental Parquet reader to calculate the total number of rows in the input list of parquet row groups.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link

copy-pr-bot bot commented May 8, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label May 8, 2025
@mhaseeb123 mhaseeb123 changed the title Experimental PQ reader utility to calculate the total rows in input row groups Utility to calculate total rows in input row groups in experimental PQ reader May 8, 2025
@mhaseeb123 mhaseeb123 changed the title Utility to calculate total rows in input row groups in experimental PQ reader Experimental PQ reader utility to calculate total rows in input row groups May 8, 2025
@mhaseeb123 mhaseeb123 added feature request New feature or request cuIO cuIO issue non-breaking Non-breaking change labels May 8, 2025
.stats_level(cudf::io::statistics_freq::STATISTICS_COLUMN);

if constexpr (NumTableConcats > 1) {
out_opts.set_row_group_size_rows(20000);
out_opts.set_max_page_size_rows(5000);
out_opts.set_row_group_size_rows(num_ordered_rows);
Copy link
Member Author

@mhaseeb123 mhaseeb123 May 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use constexpr variables from header instead of magic numbers to avoid problems false positive failures later on

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what problems do you expect later?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like if the value of num_ordered_rows changes in the header for some reason, which is used to generate columns, the test may fail due to magic numbers being used instead of the same constexprs

  auto col0 = testdata::ascending<uint32_t>();
  auto col1 = testdata::descending<int64_t>();
  auto col2 = testdata::ascending<cudf::string_view>();

.row_group_size_rows(page_size_for_ordered_tests)
.max_page_size_rows(page_size_for_ordered_tests / 5)
.compression(compression)
.dictionary_policy(cudf::io::dictionary_policy::ALWAYS)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always try to write dictionaries to test dictionary page pruning in future

@mhaseeb123 mhaseeb123 marked this pull request as ready for review May 8, 2025 03:34
@mhaseeb123 mhaseeb123 requested a review from a team as a code owner May 8, 2025 03:34
@mhaseeb123 mhaseeb123 requested review from shrshi, ttnghia and vuule May 8, 2025 03:34
@mhaseeb123 mhaseeb123 requested review from ttnghia and vuule May 8, 2025 23:24
@mhaseeb123 mhaseeb123 added the 4 - Needs Review Waiting for reviewer to review or respond label May 8, 2025
@mhaseeb123 mhaseeb123 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 4 - Needs Review Waiting for reviewer to review or respond labels May 10, 2025
@mhaseeb123 mhaseeb123 moved this to Burndown in libcudf May 10, 2025
@mhaseeb123
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit e213c1c into rapidsai:branch-25.06 May 12, 2025
126 checks passed
@mhaseeb123 mhaseeb123 deleted the fea/hybrid-scan-total-rows-in-row-groups branch May 12, 2025 17:12
@GregoryKimball GregoryKimball moved this from Burndown to Landed in libcudf May 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
Status: Landed
Development

Successfully merging this pull request may close these issues.

4 participants