Skip to content

Conversation

@lbhm
Copy link
Collaborator

@lbhm lbhm commented Sep 22, 2025

What changes are proposed in this pull request?

This PR follows up on #1272 and addresses open review comments.

  • It removes raw pointer indexing that can theoretically lead to panics.
    • We refactor impl IntoIter to a fallible RowInderBuilder::build method that verifies that row group ordinals are unique and within bounds.
  • It replaces complex type signatures with more intuitive type aliases.
  • It adds more unit tests for RowIndexBuilder.

Note: We cannot add integration tests for row index reads with filter predicates before addressing #860 and implementing predicate pushdown.

How was this change tested?

New UT.

@codecov
Copy link

codecov bot commented Sep 22, 2025

Codecov Report

❌ Patch coverage is 97.11538% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.73%. Comparing base (e8e3ebe) to head (d1efad5).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/engine/default/parquet.rs 0.00% 1 Missing and 1 partial ⚠️
kernel/src/engine/sync/parquet.rs 0.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1334      +/-   ##
==========================================
+ Coverage   84.66%   84.73%   +0.07%     
==========================================
  Files         113      113              
  Lines       28303    28396      +93     
  Branches    28303    28396      +93     
==========================================
+ Hits        23963    24062      +99     
+ Misses       3205     3198       -7     
- Partials     1135     1136       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@lbhm lbhm requested a review from zachschuermann September 22, 2025 12:56
@lbhm lbhm force-pushed the row-index-cleanup branch from d9b401d to d08e062 Compare September 22, 2025 13:02
pub(crate) use prim_array_cmp;

type FieldIndex = usize;
type FlattenedRangeIterator<T> = std::iter::Flatten<std::vec::IntoIter<Range<T>>>;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The various call sites like fixup_parquet_read and reorder_struct_array should start using this instead of <RowIndexBuilder as IntoIterator>::IntoIter

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The various call sites like fixup_parquet_read and reorder_struct_array should start using this instead of <RowIndexBuilder as IntoIterator>::IntoIter

I also did this as part of this PR. I searched for all occurrences of <RowIndexBuilder as IntoIterator> and replaced them.

// We have to clone here to avoid modifying the original vector in each iteration
ordinals
.iter()
.filter_map(|&i| self.row_group_row_index_ranges.get(i).cloned())
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think clone is a reasonable trade-off here, since (1) we are only cloning Ranges (i.e., two i64s) and (2) preventing the clone would require us to modify the underlying vector as we iterate (costly) or to design something like a FilteredRowIndexIterator that does not collect intermediate results.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, IMO Range<i64> should be Copy -- it's only 16 bytes, the same size as &[T] which is Copy.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While working on this, I learned that Range<T> is not Copy because copyable iterators can be confusing.

Thus, we are only left with cloning as far as I'm aware.

Comment on lines +38 to +40
fn extract_record_batch(
scan_result: ScanResult,
) -> Result<RecordBatch, Box<dyn std::error::Error>> {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is sort of a prefactoring. As I understand #860, we currently never produce a mask for filter predicates (only for DVs). Once we resolve #860, this method will be handy.

@lbhm lbhm requested a review from scovich September 23, 2025 08:53
Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one nit

Copy link
Member

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

@zachschuermann zachschuermann merged commit a27f366 into delta-io:main Sep 25, 2025
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants