-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kernel] During Active-AddFile-Log-Replay do not pass the RemoveFile to checkpoint reader #4137
base: master
Are you sure you want to change the base?
Conversation
4a92f4e
to
6d05a97
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Just left some minor comments about variable name and comments.
After this -- LGTM!
Thanks!
kernel/kernel-api/src/main/java/io/delta/kernel/internal/replay/ActionsIterator.java
Show resolved
Hide resolved
kernel/kernel-api/src/main/java/io/delta/kernel/internal/replay/ActionsIterator.java
Outdated
Show resolved
Hide resolved
kernel/kernel-api/src/main/java/io/delta/kernel/internal/replay/ActiveAddFilesIterator.java
Show resolved
Hide resolved
6d05a97
to
e40da17
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
kernel/kernel-api/src/main/java/io/delta/kernel/internal/replay/ActionsIterator.java
Outdated
Show resolved
Hide resolved
…to checkpoint reader #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [X] Kernel - [ ] Other (fill in here) ## Description Implemented a minor performance improvement to not read any RemoveFiles when we read checkpoint Parquet files during active-add-file-log-replay. Fixes delta-io#4102 ## How was this patch tested? Existing unit test, manual test using ./run-tests.py --group kernel and delta/kernel/examples/run-kernel-examples.py --use-local ## Does this PR introduce _any_ user-facing changes? No
e40da17
to
fbe3c3d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! thank you!
@@ -60,7 +60,14 @@ public class ActionsIterator implements CloseableIterator<ActionWrapper> { | |||
*/ | |||
private final LinkedList<DeltaLogFile> filesList; | |||
|
|||
private final StructType readSchema; | |||
/** Schema used for reading delta files. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: update the class docs to indicate two schemas of data.
/** | ||
* @return a tuple of (ColumnarBatch, isFromCheckpoint), where ColumnarBatch conforms to the | ||
* instance {@link #readSchema}. | ||
* instance {@link #deltaReadSchema}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update these docs to reflect that we return #checkpointReadSchema batches when isFromCheckpoint=true
?
// Step 3: Drop the RemoveFile column and use the selection vector to build a new | ||
// FilteredColumnarBatch | ||
ColumnarBatch scanAddFiles = addRemoveColumnarBatch.withDeletedColumnAt(1); | ||
// For Parquet files, we would only have read the adds, not the removes, hence the check. | ||
if (scanAddFiles.getSchema().length() > 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could also use isFromCheckpoint
, which I maybe prefer in case in the future there's ever additional columns added or anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open to opinions
Which Delta project/connector is this regarding?
Description
Implemented a minor performance improvement to not read any RemoveFiles when we read checkpoint Parquet files during active-add-file-log-replay. Fixes #4102
How was this patch tested?
Existing unit test, manual test using
delta/kernel/examples/run-kernel-examples.py --use-local
and
./run-tests.py --group kernel
Does this PR introduce any user-facing changes?
No