Enable `skip_rows` in the chunked parquet reader. #18130

mhaseeb123 · 2025-02-28T03:10:38Z

Description

Closes #16186
Closes #16249

This PR adds multiple bug fixes required to enable the skip_rows option for the Chunked Parquet reader.

CC: @etseidl

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2025-02-28T03:10:43Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Matt711

I think we can support skipping row groups in cudf.polars (xref #16257) once this PR goes through.

mhaseeb123 · 2025-03-01T04:46:30Z

cpp/src/io/parquet/reader_impl_chunking.cu

@@ -539,19 +539,24 @@ struct get_page_span {
    auto const column_page_end   = page_row_index + page_offsets[column_index + 1];
    auto const num_pages         = column_page_end - column_page_start;
    bool const is_list           = chunks[column_index].max_level[level_type::REPETITION] > 0;
+    // list rows can span page boundaries, so it is not always safe to filter page spans simply
+    // based on `start_row` so select all pages.
+    if (is_list) {


Not 100% sure about this fix yet but seems like do end up needing more pages than found with start_row and end_row for list columns in some cases. I am guessing the page_row_index doesn't contain exact row indices for list columns? Maybe @etseidl or @nvdbaranec can elaborate if possible.

Question: There shouldn't be any performance penalties for this change as we do a trim pass later on with accurate row counts for lists pages, right?

The following case fails (not all pages for the list col c selected) without this change.

import cudf from cudf.testing import assert_eq import pyarrow as pa from io import BytesIO data = [ { "a": "g", "b": { "b_a": 10, "b_b": {"b_b_b": None, "b_b_a": 2}, }, "c": None, }, {"a": None, "b": {"b_a": None, "b_b": None}, "c": [15, 16]}, {"a": "j", "b": None, "c": [8, 10]}, {"a": None, "b": {"b_a": None, "b_b": None}, "c": None}, None, { "a": None, "b": {"b_a": None, "b_b": {"b_b_b": 1}}, "c": [18, 19], }, {"a": None, "b": None, "c": None}, ] * 1000 pa_struct = pa.Table.from_pydict({"struct": data}) df = cudf.DataFrame.from_arrow(pa_struct) buffer = BytesIO() df.to_parquet(buffer, row_group_size_rows=7000, max_page_size_rows=100) # Number of rows to read with cudf.option_context("io.parquet.low_memory", True): actual = cudf.read_parquet( [buffer], _chunk_read_limit=0, _pass_read_limit=1024000000, nrows=999, skip_rows=6001, ).reset_index(drop=True) expected = cudf.read_parquet(buffer, nrows=999, skip_rows=6001).reset_index( drop=True ) assert_eq(expected, actual)

I'm not sure this is the right way to do this. If I'm understanding what's going on here, this will just arbitrarily return the entire row group. Which I'm pretty sure will cause us to read all pages for all columns in a row group, which reverts us back to just being a row group reader in the case where lists exist.

I think the only real option here, if you have lists, is for each eligible row group

march through the entire row group, using the row estimates as we do now.

for each subpass that we read in, after we compute the real row counts (roughly here

cudf/cpp/src/io/parquet/reader_impl_preprocess.cu

Line 1485 in ea2bca3

size_t first_page_index = 0;

) determine re-apply skip_rows/num_rows. If we have managed to read a subpass that contains none of the rows we care about, early out and move on to the next subpass.

stop early if we cross the last row index we care about.

mhaseeb123 · 2025-03-03T23:08:50Z

cpp/src/io/parquet/reader_impl_preprocess.cu

-  subpass.num_rows    = max_row - subpass.skip_rows;
+  max_row             = std::min<size_t>(max_row, pass_end);
+  // Make sure we don't skip past the max rows.
+  CUDF_EXPECTS(max_row > subpass.skip_rows, "Unexpected subpass row count");


Catch the infinite for loop here if occurs.

nvdbaranec · 2025-03-04T20:06:14Z

cpp/src/io/parquet/reader_impl_chunking.cu

@@ -539,19 +539,24 @@ struct get_page_span {
    auto const column_page_end   = page_row_index + page_offsets[column_index + 1];
    auto const num_pages         = column_page_end - column_page_start;
    bool const is_list           = chunks[column_index].max_level[level_type::REPETITION] > 0;
+    // list rows can span page boundaries, so it is not always safe to filter page spans simply
+    // based on `start_row` so select all pages.
+    if (is_list) {


I'm not sure this is the right way to do this. If I'm understanding what's going on here, this will just arbitrarily return the entire row group. Which I'm pretty sure will cause us to read all pages for all columns in a row group, which reverts us back to just being a row group reader in the case where lists exist.

I think the only real option here, if you have lists, is for each eligible row group

march through the entire row group, using the row estimates as we do now.

for each subpass that we read in, after we compute the real row counts (roughly here

cudf/cpp/src/io/parquet/reader_impl_preprocess.cu

Line 1485 in ea2bca3

size_t first_page_index = 0;

) determine re-apply skip_rows/num_rows. If we have managed to read a subpass that contains none of the rows we care about, early out and move on to the next subpass.

stop early if we cross the last row index we care about.

nvdbaranec

Minor note on that one function name.

nvdbaranec · 2025-05-27T19:06:23Z

cpp/src/io/parquet/reader_impl_preprocess.cu

+    thrust::for_each(rmm::exec_policy_nosync(_stream),
+                     thrust::counting_iterator<size_t>(0),
+                     thrust::counting_iterator(pass.pages.size()),
+                     compute_page_num_rows_from_chunk_rows{pass.pages, pass.chunks});


Oh. I was referring to generate_list_column_row_count_estimates()

Matt711

One question about the behavior when skip_rows > len(df) before I approve

python/cudf/cudf/tests/test_parquet.py

Co-authored-by: Matthew Murray <[email protected]>

Matt711 · 2025-05-27T22:47:31Z

PR is blocked by #18971. Will rerun CI on this PR once 18971 gets in

Matt711 · 2025-05-28T21:47:17Z

This should be good to go once #19016 is in. For real this time lol

Matt711 · 2025-05-29T11:44:07Z

/merge

…#19036) Follows up #18130. This PR removes the workaround that manually sliced off `skip_rows` from chunked Parquet reads by instead using built-in `set_skip_rows` API. Authors: - Matthew Murray (https://github.com/Matt711) Approvers: - Matthew Roeschke (https://github.com/mroeschke) URL: #19036

@etseidl

Closes rapidsai#16186 Closes rapidsai#16249 This PR adds multiple bug fixes required to enable the `skip_rows` option for the Chunked Parquet reader. CC: @etseidl Authors: - Muhammad Haseeb (https://github.com/mhaseeb123) - Matthew Murray (https://github.com/Matt711) Approvers: - https://github.com/nvdbaranec - Vukasin Milovanovic (https://github.com/vuule) - Matthew Murray (https://github.com/Matt711) URL: rapidsai#18130

…rapidsai#19036) Follows up rapidsai#18130. This PR removes the workaround that manually sliced off `skip_rows` from chunked Parquet reads by instead using built-in `set_skip_rows` API. Authors: - Matthew Murray (https://github.com/Matt711) Approvers: - Matthew Roeschke (https://github.com/mroeschke) URL: rapidsai#19036

Fix for allowing skip_rows in Chunked parquet reader.

0d9b9cc

github-actions bot assigned mhaseeb123 Feb 28, 2025

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Feb 28, 2025

mhaseeb123 added bug Something isn't working 2 - In Progress Currently a work in progress cuIO cuIO issue breaking Breaking change labels Feb 28, 2025

mhaseeb123 and others added 2 commits February 28, 2025 03:21

Style fix

8a8dc88

Merge branch 'branch-25.04' into fix/enable-skip-rows-chunked-pq-reader

6bd4caa

mhaseeb123 changed the title ~~🚧 Enable skip_rows in the chunked parquet reader.~~ 🚧 Enable skip_rows in the chunked parquet reader. Feb 28, 2025

Matt711 reviewed Feb 28, 2025

View reviewed changes

Add pytests and improvements

1e1cc09

github-actions bot added the Python Affects Python cuDF API. label Feb 28, 2025

github-project-automation bot added this to cuDF Python Feb 28, 2025

mhaseeb123 added 3 commits February 28, 2025 23:45

Corrected bug fix and updated pytests

05d90dd

Fix for lists

5cccd8f

Bug fix for nested types and lists

657117c

mhaseeb123 requested a review from nvdbaranec March 1, 2025 04:31

mhaseeb123 commented Mar 1, 2025

View reviewed changes

mhaseeb123 and others added 4 commits February 28, 2025 20:59

Merge branch 'branch-25.04' into fix/enable-skip-rows-chunked-pq-reader

8484d98

Minor comment update

3b10c45

Codespell

f508dc3

Minor comments update

913eca0

mhaseeb123 commented Mar 3, 2025

View reviewed changes

mhaseeb123 and others added 3 commits March 4, 2025 01:49

Paramterize and remove repeated checks

bd4b752

Merge branch 'branch-25.04' into fix/enable-skip-rows-chunked-pq-reader

75f1360

Merge branch 'branch-25.04' into fix/enable-skip-rows-chunked-pq-reader

8cb2c65

nvdbaranec requested changes Mar 4, 2025

View reviewed changes

Revert list change

c403884

mhaseeb123 added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels May 22, 2025

mhaseeb123 marked this pull request as ready for review May 22, 2025 21:29

mhaseeb123 requested a review from nvdbaranec May 23, 2025 20:17

mhaseeb123 and others added 2 commits May 23, 2025 13:18

Merge branch 'branch-25.08' into fix/enable-skip-rows-chunked-pq-reader

51d54e9

Minor improvements

3933c2e

nvdbaranec approved these changes May 27, 2025

View reviewed changes

vuule approved these changes May 27, 2025

View reviewed changes

Matt711 reviewed May 27, 2025

View reviewed changes

python/cudf/cudf/tests/test_parquet.py Outdated Show resolved Hide resolved

python/cudf/cudf/tests/test_parquet.py Show resolved Hide resolved

mhaseeb123 and others added 2 commits May 27, 2025 13:07

Update python/cudf/cudf/tests/test_parquet.py

43852a6

Co-authored-by: Matthew Murray <[email protected]>

Add a new gtest

ebc5442

mhaseeb123 requested a review from Matt711 May 27, 2025 20:15

Matt711 approved these changes May 27, 2025

View reviewed changes

mhaseeb123 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels May 27, 2025

Merge branch 'branch-25.08' into fix/enable-skip-rows-chunked-pq-reader

a5fe1f8

Merge branch 'branch-25.08' into fix/enable-skip-rows-chunked-pq-reader

22cdcf4

github-actions bot assigned Matt711 May 28, 2025

Matt711 added 2 commits May 28, 2025 21:11

Merge branch 'branch-25.08' into fix/enable-skip-rows-chunked-pq-reader

7c5e98d

Merge branch 'branch-25.08' into fix/enable-skip-rows-chunked-pq-reader

e5b3e57

Matt711 mentioned this pull request May 29, 2025

Remove skip_rows workaround for chunked Parquet reader in cudf-polars #19036

Merged

3 tasks

rapids-bot bot merged commit 5be254e into rapidsai:branch-25.08 May 29, 2025
127 checks passed

github-project-automation bot moved this from In Progress to Done in cuDF Python May 29, 2025

mhaseeb123 deleted the fix/enable-skip-rows-chunked-pq-reader branch May 29, 2025 19:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable `skip_rows` in the chunked parquet reader. #18130

Enable `skip_rows` in the chunked parquet reader. #18130

Uh oh!

mhaseeb123 commented Feb 28, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Feb 28, 2025

Uh oh!

Matt711 left a comment

Uh oh!

mhaseeb123 Mar 1, 2025 •

edited

Loading

Uh oh!

nvdbaranec Mar 4, 2025

Uh oh!

mhaseeb123 Mar 3, 2025

Uh oh!

nvdbaranec Mar 4, 2025

Uh oh!

nvdbaranec left a comment

Uh oh!

nvdbaranec May 27, 2025

Uh oh!

Matt711 left a comment

Uh oh!

Uh oh!

Uh oh!

Matt711 commented May 27, 2025 •

edited

Loading

Uh oh!

Matt711 commented May 28, 2025

Uh oh!

Matt711 commented May 29, 2025

Uh oh!

Uh oh!

Uh oh!

Enable skip_rows in the chunked parquet reader. #18130

Enable skip_rows in the chunked parquet reader. #18130

Uh oh!

Conversation

mhaseeb123 commented Feb 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

copy-pr-bot bot commented Feb 28, 2025

Uh oh!

Matt711 left a comment

Choose a reason for hiding this comment

Uh oh!

mhaseeb123 Mar 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nvdbaranec Mar 4, 2025

Choose a reason for hiding this comment

Uh oh!

mhaseeb123 Mar 3, 2025

Choose a reason for hiding this comment

Uh oh!

nvdbaranec Mar 4, 2025

Choose a reason for hiding this comment

Uh oh!

nvdbaranec left a comment

Choose a reason for hiding this comment

Uh oh!

nvdbaranec May 27, 2025

Choose a reason for hiding this comment

Uh oh!

Matt711 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Matt711 commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Matt711 commented May 28, 2025

Uh oh!

Matt711 commented May 29, 2025

Uh oh!

Uh oh!

Uh oh!

Enable `skip_rows` in the chunked parquet reader. #18130

Enable `skip_rows` in the chunked parquet reader. #18130

mhaseeb123 commented Feb 28, 2025 •

edited

Loading

mhaseeb123 Mar 1, 2025 •

edited

Loading

Matt711 commented May 27, 2025 •

edited

Loading