-
Notifications
You must be signed in to change notification settings - Fork 953
Enable skip_rows
in the chunked parquet reader.
#18130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable skip_rows
in the chunked parquet reader.
#18130
Conversation
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
skip_rows
in the chunked parquet reader.skip_rows
in the chunked parquet reader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can support skipping row groups in cudf.polars (xref #16257) once this PR goes through.
@@ -539,19 +539,24 @@ struct get_page_span { | |||
auto const column_page_end = page_row_index + page_offsets[column_index + 1]; | |||
auto const num_pages = column_page_end - column_page_start; | |||
bool const is_list = chunks[column_index].max_level[level_type::REPETITION] > 0; | |||
// list rows can span page boundaries, so it is not always safe to filter page spans simply | |||
// based on `start_row` so select all pages. | |||
if (is_list) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not 100% sure about this fix yet but seems like do end up needing more pages than found with start_row
and end_row
for list columns in some cases. I am guessing the page_row_index
doesn't contain exact row indices for list columns? Maybe @etseidl or @nvdbaranec can elaborate if possible.
Question: There shouldn't be any performance penalties for this change as we do a trim pass later on with accurate row counts for lists pages, right?
The following case fails (not all pages for the list col c
selected) without this change.
import cudf
from cudf.testing import assert_eq
import pyarrow as pa
from io import BytesIO
data = [
{
"a": "g",
"b": {
"b_a": 10,
"b_b": {"b_b_b": None, "b_b_a": 2},
},
"c": None,
},
{"a": None, "b": {"b_a": None, "b_b": None}, "c": [15, 16]},
{"a": "j", "b": None, "c": [8, 10]},
{"a": None, "b": {"b_a": None, "b_b": None}, "c": None},
None,
{
"a": None,
"b": {"b_a": None, "b_b": {"b_b_b": 1}},
"c": [18, 19],
},
{"a": None, "b": None, "c": None},
] * 1000
pa_struct = pa.Table.from_pydict({"struct": data})
df = cudf.DataFrame.from_arrow(pa_struct)
buffer = BytesIO()
df.to_parquet(buffer, row_group_size_rows=7000, max_page_size_rows=100)
# Number of rows to read
with cudf.option_context("io.parquet.low_memory", True):
actual = cudf.read_parquet(
[buffer],
_chunk_read_limit=0,
_pass_read_limit=1024000000,
nrows=999,
skip_rows=6001,
).reset_index(drop=True)
expected = cudf.read_parquet(buffer, nrows=999, skip_rows=6001).reset_index(
drop=True
)
assert_eq(expected, actual)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is the right way to do this. If I'm understanding what's going on here, this will just arbitrarily return the entire row group. Which I'm pretty sure will cause us to read all pages for all columns in a row group, which reverts us back to just being a row group reader in the case where lists exist.
I think the only real option here, if you have lists, is for each eligible row group
-
march through the entire row group, using the row estimates as we do now.
-
for each subpass that we read in, after we compute the real row counts (roughly here
cudf/cpp/src/io/parquet/reader_impl_preprocess.cu
Line 1485 in ea2bca3
size_t first_page_index = 0; -
stop early if we cross the last row index we care about.
subpass.num_rows = max_row - subpass.skip_rows; | ||
max_row = std::min<size_t>(max_row, pass_end); | ||
// Make sure we don't skip past the max rows. | ||
CUDF_EXPECTS(max_row > subpass.skip_rows, "Unexpected subpass row count"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Catch the infinite for loop here if occurs.
@@ -539,19 +539,24 @@ struct get_page_span { | |||
auto const column_page_end = page_row_index + page_offsets[column_index + 1]; | |||
auto const num_pages = column_page_end - column_page_start; | |||
bool const is_list = chunks[column_index].max_level[level_type::REPETITION] > 0; | |||
// list rows can span page boundaries, so it is not always safe to filter page spans simply | |||
// based on `start_row` so select all pages. | |||
if (is_list) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is the right way to do this. If I'm understanding what's going on here, this will just arbitrarily return the entire row group. Which I'm pretty sure will cause us to read all pages for all columns in a row group, which reverts us back to just being a row group reader in the case where lists exist.
I think the only real option here, if you have lists, is for each eligible row group
-
march through the entire row group, using the row estimates as we do now.
-
for each subpass that we read in, after we compute the real row counts (roughly here
cudf/cpp/src/io/parquet/reader_impl_preprocess.cu
Line 1485 in ea2bca3
size_t first_page_index = 0; -
stop early if we cross the last row index we care about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor note on that one function name.
thrust::for_each(rmm::exec_policy_nosync(_stream), | ||
thrust::counting_iterator<size_t>(0), | ||
thrust::counting_iterator(pass.pages.size()), | ||
compute_page_num_rows_from_chunk_rows{pass.pages, pass.chunks}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh. I was referring to generate_list_column_row_count_estimates()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question about the behavior when skip_rows > len(df)
before I approve
Co-authored-by: Matthew Murray <[email protected]>
PR is blocked by #18971. Will rerun CI on this PR once 18971 gets in |
This should be good to go once #19016 is in. For real this time lol |
/merge |
…#19036) Follows up #18130. This PR removes the workaround that manually sliced off `skip_rows` from chunked Parquet reads by instead using built-in `set_skip_rows` API. Authors: - Matthew Murray (https://github.com/Matt711) Approvers: - Matthew Roeschke (https://github.com/mroeschke) URL: #19036
Closes rapidsai#16186 Closes rapidsai#16249 This PR adds multiple bug fixes required to enable the `skip_rows` option for the Chunked Parquet reader. CC: @etseidl Authors: - Muhammad Haseeb (https://github.com/mhaseeb123) - Matthew Murray (https://github.com/Matt711) Approvers: - https://github.com/nvdbaranec - Vukasin Milovanovic (https://github.com/vuule) - Matthew Murray (https://github.com/Matt711) URL: rapidsai#18130
…rapidsai#19036) Follows up rapidsai#18130. This PR removes the workaround that manually sliced off `skip_rows` from chunked Parquet reads by instead using built-in `set_skip_rows` API. Authors: - Matthew Murray (https://github.com/Matt711) Approvers: - Matthew Roeschke (https://github.com/mroeschke) URL: rapidsai#19036
Description
Closes #16186
Closes #16249
This PR adds multiple bug fixes required to enable the
skip_rows
option for the Chunked Parquet reader.CC: @etseidl
Checklist