Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: Support IncrementalChangelogScan with deletes #9888

Closed
wants to merge 1 commit into from

Conversation

manuzhang
Copy link
Collaborator

@manuzhang manuzhang commented Mar 7, 2024

Pick up #6182 with an alternative implementation

cc @aokolnychyi @szehon-ho @flyrain for review

@manuzhang manuzhang force-pushed the changelog-deletes branch from 7f0170c to 8318f2b Compare March 7, 2024 08:43
wypoon added a commit to wypoon/iceberg that referenced this pull request Aug 16, 2024
…9888.

This uses the BaseIncrementalChangeloggScan implementation from apache#9888
and the rest of the changes (including tests) from apache#10935 to test
the BaseIncrementalChangeloggScan implementation.
@wypoon
Copy link
Contributor

wypoon commented Aug 16, 2024

You have made changes in BaseIncrementalChangelogScan that produces BaseDeletedRowsScanTask in certain cases. However, you have not made changes in the Spark ChangelogRowReader to handle DeletedRowsScanTasks. As reading the changelog has only been implemented in Spark thus far, the changes cannot be consumed.
You have added some tests to the Spark TestCreateChangelogViewProcedure, which actually makes use of ChangelogRowReader to read the changelog, but those tests currently don't actually test any DeletedRowsScanTasks, because the tables used in the test are not created with write modes set to merge-on-read (the default is copy-on-write); no delete files are ever created in the tests.

I took my changes in #10935 and swapped out my BaseIncrementalChangelogScan implementation for yours (see #10954). I was interested to see how your BaseIncrementalChangelogScan implementation works. (I needed my other changes in order for Spark changelog reading to actually work.) I found that testDataFilters and testDataRewritesAreIgnored in TestChangelogTable failed for the merge-on-read case (in the former, the delete is not detected, and in the latter, the second delete is detected but not the first). In TestChangelogReader, testPositionDeletes, testEqualityDeletes and testMixOfPositionAndEqualityDeletes failed. It seems to me that your implementation is not catching all deletes that can happen.

context.specAsString(),
context.residuals());
} else {
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is one way! ;-)
In my implementation, I use a placeholder DummyChangelogScanTask that is filtered out.

@manuzhang
Copy link
Collaborator Author

Closed in favor of #10935

@manuzhang manuzhang closed this Sep 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants