-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core/RewriteFiles: Duplicate Data Bug - Fixed dropping delete files that are still required #10962
base: main
Are you sure you want to change the base?
Conversation
I'm not sure what the policy on supporting older version of Iceberg is, but this might be something that should be patched / backported to all supported versions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, but would appreciate if someone else took a look too.
Flink jobs all failed with timeout after 6h (https://github.com/apache/iceberg/actions/runs/10442521766/job/28914906512?pr=10962) |
@findepi sorry for the ping. This causes duplicate data. Though in rare orders of operation, but I still think it's important to merge. Anyway to move this forward? |
@jasonf20 the CI looks red. Please try rebasing the PR and make sure it's green. I am not aware of any pre-existing flakiness that could cause this. |
@findepi I see it seems to be the flink tests specifically. I can't tell from the logs which test seems to be failing. Trying to run locally but I'm not sure how long that will take or if they will work locally/ever finish. Any way to get more information in the build about what failed? |
The build times out. Can it be that something takes more time as a result of the changes being introduced here? |
@findepi
Debugging that I found that it reached the code I changed with I'm pushing a new version. EDIT:
I suppose that fails the SQL execution. The test just seems to hang though instead of failing with an error message. I suppose some improvements to the test infrastructure is required. Unrelated to this though, I think the bug has been fixed. It might also be advisable to see why Flink passes data sequence numbers of |
@findepi Looks good now :) |
Thanks for fixing the build! |
LGTM but someone more experienced with snapshot producers should also take a look. |
I'll take a pass over this PR tomorrow! thanks for your patience @jasonf20 ! |
long minNewFileSequenceNumber = | ||
addedDataFiles().stream() | ||
.filter(x -> x.dataSequenceNumber() != null && x.dataSequenceNumber() >= 0) | ||
.mapToLong(ContentFile::dataSequenceNumber) | ||
.reduce( | ||
newDataFilesDataSequenceNumber != null | ||
? newDataFilesDataSequenceNumber | ||
: base.nextSequenceNumber(), | ||
Math::min); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually need to iterate through the addedDataFiles
?
If I understood the issue correctly, the problem is that it's possible for a user to commit a rewrite operation and specify an older data sequence number, and the current logic would drop delete files which actually need to still be referenced in the new commit since it's not considering the specified data file sequence number.
So I think all we would need to do here is keep the existing logic for determining minDataSequenceNumber and then also min that with the newDataFilesDataSequenceNumber
if it's not null
long minNewDataSequenceNumber = <Existing logic>
if (newDataFilesDataSequenceNumber != null) {
minNewDataSequenceNumber = Math.min(minNewDataSequenceNumber, newDataFilesDataSequenceNumber);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried the above and the new unit test still passes, though let me know if you see a flaw in my reasoning. cc @rdblue @aokolnychyi @findepi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently I think ('m also not 100% sure) you are right as there isn't an API that would allow setting a specific data file to a SN smaller than newDataFilesDataSequenceNumber
. But I think that's just a side affect of the current operations inheriting from MergingSnapshotProducer
. A new inheritor of MergingSnapshotProducer
could simply be using minNewDataSequenceNumber
as a default and pass in files with explicit SNs on top.
// `firstRewriteSequenceNumber` | ||
// explicitly which is the same as that of A2_DELETES and before B2_DELETES | ||
|
||
// Technically A1_DELETES could be removed since it's an equality delete and should apply on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry not sure I really follow the comment? There is no A1_DELETES....there's a FILE_A_DELETES but that's a positional delete. I don't see those referenced in the above operations.
I think it's true though that we should be able to drop equality deletes older than the minimum sequence number but that's already happening in the existing MergingSnapshotProducer check no? Don't think anything needs to distinguish there between equality and positional delete
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant to say FILE_A2_DELETES. The existing snapshot producer check drops files older than the minimum sequence number, only if their sequence number is strictly smaller. So it leaves equality deletes in place that have a SN equal to the SN of the min SN. While equality deletes are only applied on data files with SNs that are strictly smaller than the delete file's SN. So you can actually delete equality delete files that are the same sequence number as the minimum sequence number of the new snapshot.
In an even broader sense FILE_A2_DELETES could even be ignored completely when doing the initial commit here since there are no data files before it since this is the first commit. So effectively it doesn't do anything.
@amogh-jahagirdar If you'd like I can change the code to not iterate the files if that will make this quicker to merge, I think it's preferable to iterate them but it should be fine for now either way. |
@amogh-jahagirdar gentle reminder about this |
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions. |
I'll send a message about this in [email protected] |
@@ -384,6 +386,116 @@ public void testRewriteDataAndAssignOldSequenceNumber() { | |||
assertThat(listManifestFiles()).hasSize(4); | |||
} | |||
|
|||
@TestTemplate | |||
public void testRewriteDataAndAssignOldSequenceNumbersShouldNotDropDeleteFiles() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've gone through this test and I think that the assertions are correct for the sequence of changes.
@@ -833,7 +833,17 @@ public List<ManifestFile> apply(TableMetadata base, Snapshot snapshot) { | |||
filterManager.filterManifests( | |||
SnapshotUtil.schemaFor(base, targetBranch()), | |||
snapshot != null ? snapshot.dataManifests(ops.io()) : null); | |||
long minDataSequenceNumber = | |||
|
|||
long minNewFileSequenceNumber = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jasonf20 can you summarize what the problem was? I didn't see it in the PR description. These changes look correct to me but it would be helpful to know where the issue was.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue Sure, the filtered manifest list could exclude manifests of a certain sequence number because their data files were re-written.
For example the current re-write replaces the manifest that contained the min sequence number, but preserves that file at that SN. The filtered
list won't contain that manifest, but the new manifest created will still have that same sequence number if the rewrite kept the sequence number small enough.
So the existing logic here that just took the the min from the filtered could miss the actual minimum since the manifest containing the actual minimum was re-written as part of this apply. So we need to consider also the minimum SN that will be present in the manifest that will get created with this operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One note on this. Since I created this PR, this PR changed the way the filtered
manifest list is calculated and I don't believe my test reproduces the issue anymore if you undo the code fix.
However, I think the fix is still correct and these the minSequenceNumber of the current operation should still be considered in addition to the filtered list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, that's what I thought was happening, but I wanted to confirm. I think it would be helpful to have a comment in the test that states what would have failed.
So in the test, the file that was missing was the first delete file, FILE_A2_DELETES
that had sequence number 1, right? In the second rewrite, FILE_A
is rewritten as FILE_A2
, but it is rewritten at the same sequence number (1). The min sequence number of the other files is 3 for FILE_D
after the first rewrite.
Why is the first rewrite needed? Couldn't you reproduce the problem by rewriting just FILE_A
to FILE_A2
after separately committing FILE_B
? The test case seems over-complicated to me so I think we need one that:
- Reproduces the problem without this fix, and
- Is a minimal set of steps to reproduce it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the operations were required to reproduce the issue, but It's been a while so I don't remember the specifics. I could try and get back into it and see if I can create a new minimal test. But I'm not sure exactly when I will be free to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue I had some time to look at this today and have managed to reduce the spec to the minimal required.
I pushed a new commit that fixes a bug introduces in the PR mentioned above. After fixing this bug my new simplified test reproduces the issue.
@amogh-jahagirdar Would appreciate if you can look at the first commit since it's related to your PR. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a look, @jasonf20 see my comment here , the change to not rewrite deletes in case of a manifest only having aged out deletes was actually intentional and I don't think we want to add that behavior back. Let me take a look to see how we can simplify the tests in this PR considering that manifests with only aged out deletes will only be rewritten opportunistically if some other changes prompt a rewrite of the manifest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amogh-jahagirdar Responded here.
8f11779
to
bc1d073
Compare
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
Outdated
Show resolved
Hide resolved
b586da5
to
b5fe594
Compare
This got broken when canTrustManifestReferences was added. If the manifest has a small enough minSequenceNumber it still needs to be checked even if canTrustManifestReferences is true
This fixes a bug where delete files that are still required by the re-written file could be deleted prematurely. This happened because `MergingSnapshotProducer` calculated `minDataSequenceNumber` in a way that could skip data sequence numbers that do exist in the final snapshot. The filtered manifest list could exclude manifests of a certain sequence number because their data files were re-written. With specific orders of operations it was possible to skip the sequence number of manifests that were re-written.
b5fe594
to
76954b8
Compare
if (manifest.minSequenceNumber() > 0 && manifest.minSequenceNumber() < minSequenceNumber) { | ||
return true; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to add this, I intentionally removed the behavior to rewrite a manifest in the case it only had aged out deletes, see here for a more detailed description https://github.com/apache/iceberg/pull/11131/files#r1815710103.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's the case then doesn't the last condition here do nothing since it can only reach this check if one of the earlier checks was true anyway:
deletePaths.contains(file.location())
|| deleteFiles.contains(file)
|| dropPartitions.contains(file.specId(), file.partition())
|| (isDelete
&& entry.isLive()
&& entry.dataSequenceNumber() > 0
&& entry.dataSequenceNumber() < minSequenceNumber);
It seems like perhaps dropDeleteFilesOlderThan
has no affect anymore (unless maybe allDeletesReferenceManifests
gets set to false or something).
I think not removing by minSequenceNumber
leaves undeleted delete files that just never get applied to any files at query time, so it's not the end of the world, but it does lead to some wasted storage and slightly longer scan planning times.
Assuming we want to keep this behaviour perhaps we should just not use dropDeleteFilesOlderThan
anymore in mergingSnapshotProducer
then?
We can try getting a minimal test working by doing some actual delete or something from a shared manifest. But it seems like dropDeleteFilesOlderThan
is not exactly doing anything right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a side note I think it's better to make sure dropDeleteFilesOlderThan
does work. We have encountered tables that have many old EQ deleted files that were compacted and no longer apply to any data files, but were not removed from the table's manifests.
Note that some manifests contain only irrelevant deletes and would not opportunistically get re-written for other purposes. Also, ReWriteManifests currently ignores delete manifests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay, yes I believe you are right that the last condition specifically at https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/ManifestFilterManager.java#L427 is redundant, we can remove that! There is no longer any need to rely on that condition because the only time we'd hit that point is the manifest needs to be rewritten for correctness based on the other criteria. Then when the manifest does get rewritten if there happen to be aged out deletes, we will rewrite them here https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/ManifestFilterManager.java#L475
OK so breaking down the pros/cons for eagerly removing aged out deletes during commit where a manifest only has aged out deletes:
The pro of eager removal of aged out deletes is that space consumed by metadata is reduced, and also planning times are at least slightly reduced.
The downside of eager removal is that even if a manifest does not need to be rewritten, the commit process is incurring additional latency and risk of failure (the IO involved) for writing out a new manifest.
IMO I feel like the upside for reclaiming some metadata storage/reducing plan times, doesn't justify the risk of failure/additional latency being incurred by the process on the commit path. Maybe in the most extreme case where some large percentage of manifests only have aged out deletes, is it worth it?
I think you brought up good point though on rewrite manifests, the Java API itself will just retain all the delete manifests but I do think the spark action does support rewriting delete manifests https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteManifestsSparkAction.java#L179
Perhaps it would be better for rewrite manifests to remove the aged out deletes? At that point the work for removing the aged out deletes is more intentional in the context of table maintenance and not on the general commit path for MergingSnapshotProducer
implementations? I'll need to think more on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it can be added as a table property. As an anecdote I have encountered tables with frequent updates that have 100K + inactive delete files in their manifests. And since Spark isn't used in this env performing a cleanup isn't simple.
I think the above issue should be addressed, but perhaps we got a bit side tracked in the context of this bug. How would you like to proceed here. This is still a valid bug fix, can you think of a trick to make the test reproduce the issue easily without reverting the change to dropDeleteFilesOlderThan
? If not can we merge it regardless, it's still a valid fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Amogh here. Unnecessary cleanup is probably not a good idea in the core library when it requires rewriting manifests that would otherwise not be modified.
This fixes a bug where delete files that are still required by the re-written file could be deleted prematurely.
This happened because
MergingSnapshotProducer
calculatedminDataSequenceNumber
in a way that could skip data sequence numbers that do exist in the final snapshot.The filtered manifest list could exclude manifests of a certain sequence number because their data files were re-written. With specific orders of operations it was possible to skip the sequence number of manifests that were re-written.