Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kernel] During Active-AddFile-Log-Replay do not pass the RemoveFile to checkpoint reader #4137

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

anoopj
Copy link
Collaborator

@anoopj anoopj commented Feb 8, 2025

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

Implemented a minor performance improvement to not read any RemoveFiles when we read checkpoint Parquet files during active-add-file-log-replay. Fixes #4102

How was this patch tested?

Existing unit test, manual test using
delta/kernel/examples/run-kernel-examples.py --use-local
and
./run-tests.py --group kernel

Does this PR introduce any user-facing changes?

No

@anoopj anoopj force-pushed the anoopj-remove branch 3 times, most recently from 4a92f4e to 6d05a97 Compare February 10, 2025 18:24
@anoopj anoopj requested a review from scottsand-db February 10, 2025 18:54
Copy link
Collaborator

@scottsand-db scottsand-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Just left some minor comments about variable name and comments.

After this -- LGTM!

Thanks!

Copy link
Collaborator

@scottsand-db scottsand-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

…to checkpoint reader

#### Which Delta project/connector is this regarding?

- [ ] Spark
- [ ] Standalone
- [ ] Flink
- [X] Kernel
- [ ] Other (fill in here)

## Description

Implemented a minor performance improvement to not read any RemoveFiles when we read checkpoint Parquet files during active-add-file-log-replay.
Fixes delta-io#4102

## How was this patch tested?

Existing unit test, manual test using
./run-tests.py --group kernel
and
delta/kernel/examples/run-kernel-examples.py --use-local

## Does this PR introduce _any_ user-facing changes?

No
Copy link
Collaborator

@vkorukanti vkorukanti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! thank you!

@@ -60,7 +60,14 @@ public class ActionsIterator implements CloseableIterator<ActionWrapper> {
*/
private final LinkedList<DeltaLogFile> filesList;

private final StructType readSchema;
/** Schema used for reading delta files. */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: update the class docs to indicate two schemas of data.

Comment on lines 123 to +125
/**
* @return a tuple of (ColumnarBatch, isFromCheckpoint), where ColumnarBatch conforms to the
* instance {@link #readSchema}.
* instance {@link #deltaReadSchema}.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update these docs to reflect that we return #checkpointReadSchema batches when isFromCheckpoint=true?

// Step 3: Drop the RemoveFile column and use the selection vector to build a new
// FilteredColumnarBatch
ColumnarBatch scanAddFiles = addRemoveColumnarBatch.withDeletedColumnAt(1);
// For Parquet files, we would only have read the adds, not the removes, hence the check.
if (scanAddFiles.getSchema().length() > 1) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could also use isFromCheckpoint, which I maybe prefer in case in the future there's ever additional columns added or anything?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open to opinions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants