Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core/RewriteFiles: Duplicate Data Bug - Fixed dropping delete files that are still required #10962

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jasonf20
Copy link
Contributor

This fixes a bug where delete files that are still required by the re-written file could be deleted prematurely.

This happened because MergingSnapshotProducer calculated minDataSequenceNumber in a way that could skip data sequence numbers that do exist in the final snapshot.

The filtered manifest list could exclude manifests of a certain sequence number because their data files were re-written. With specific orders of operations it was possible to skip the sequence number of manifests that were re-written.

@github-actions github-actions bot added the core label Aug 18, 2024
@jasonf20
Copy link
Contributor Author

jasonf20 commented Aug 19, 2024

I'm not sure what the policy on supporting older version of Iceberg is, but this might be something that should be patched / backported to all supported versions.

Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, but would appreciate if someone else took a look too.

@findepi
Copy link
Member

findepi commented Aug 20, 2024

Flink jobs all failed with timeout after 6h (https://github.com/apache/iceberg/actions/runs/10442521766/job/28914906512?pr=10962)
I did restart them.

@jasonf20
Copy link
Contributor Author

@findepi sorry for the ping. This causes duplicate data. Though in rare orders of operation, but I still think it's important to merge. Anyway to move this forward?

@findepi
Copy link
Member

findepi commented Aug 25, 2024

@jasonf20 the CI looks red. Please try rebasing the PR and make sure it's green. I am not aware of any pre-existing flakiness that could cause this.

@jasonf20
Copy link
Contributor Author

jasonf20 commented Aug 26, 2024

@findepi I see it seems to be the flink tests specifically. I can't tell from the logs which test seems to be failing. Trying to run locally but I'm not sure how long that will take or if they will work locally/ever finish. Any way to get more information in the build about what failed?

@findepi
Copy link
Member

findepi commented Aug 26, 2024

The build times out. Can it be that something takes more time as a result of the changes being introduced here?

@jasonf20
Copy link
Contributor Author

jasonf20 commented Aug 26, 2024

@findepi
Locally the test TestChangeLogTable seems to run forever.

 sql("INSERT INTO %s SELECT * FROM %s", tableName, SOURCE_TABLE);

Debugging that I found that it reached the code I changed with addedDataFiles that had a data sequence number of -1 (not null). I don't know why that would cause the hang, but I filtered out those values and locally it seems to run now.

I'm pushing a new version.

EDIT:
Looking a little further it seems to fail on:

Preconditions.checkArgument(
       sequenceNumber >= 0, "Invalid minimum data sequence number: %s", sequenceNumber);

I suppose that fails the SQL execution. The test just seems to hang though instead of failing with an error message. I suppose some improvements to the test infrastructure is required. Unrelated to this though, I think the bug has been fixed. It might also be advisable to see why Flink passes data sequence numbers of -1 but that's also unrelated to this fix

@jasonf20
Copy link
Contributor Author

@findepi Looks good now :)

@findepi
Copy link
Member

findepi commented Aug 26, 2024

Thanks for fixing the build!

@findepi
Copy link
Member

findepi commented Aug 26, 2024

LGTM but someone more experienced with snapshot producers should also take a look.
@Fokko who would be the best person to take the second look?

@jasonf20 jasonf20 changed the title Core/RewriteFiles: Fixed dropping delete files that are still required Core/RewriteFiles: Duplicate Data Bug - Fixed dropping delete files that are still required Sep 2, 2024
@amogh-jahagirdar amogh-jahagirdar self-requested a review September 3, 2024 05:00
@amogh-jahagirdar
Copy link
Contributor

I'll take a pass over this PR tomorrow! thanks for your patience @jasonf20 !

Comment on lines 837 to 927
long minNewFileSequenceNumber =
addedDataFiles().stream()
.filter(x -> x.dataSequenceNumber() != null && x.dataSequenceNumber() >= 0)
.mapToLong(ContentFile::dataSequenceNumber)
.reduce(
newDataFilesDataSequenceNumber != null
? newDataFilesDataSequenceNumber
: base.nextSequenceNumber(),
Math::min);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need to iterate through the addedDataFiles?

If I understood the issue correctly, the problem is that it's possible for a user to commit a rewrite operation and specify an older data sequence number, and the current logic would drop delete files which actually need to still be referenced in the new commit since it's not considering the specified data file sequence number.

So I think all we would need to do here is keep the existing logic for determining minDataSequenceNumber and then also min that with the newDataFilesDataSequenceNumber if it's not null

long minNewDataSequenceNumber = <Existing logic>
if (newDataFilesDataSequenceNumber != null) {
    minNewDataSequenceNumber = Math.min(minNewDataSequenceNumber, newDataFilesDataSequenceNumber);
} 

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried the above and the new unit test still passes, though let me know if you see a flaw in my reasoning. cc @rdblue @aokolnychyi @findepi

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently I think ('m also not 100% sure) you are right as there isn't an API that would allow setting a specific data file to a SN smaller than newDataFilesDataSequenceNumber. But I think that's just a side affect of the current operations inheriting from MergingSnapshotProducer. A new inheritor of MergingSnapshotProducer could simply be using minNewDataSequenceNumber as a default and pass in files with explicit SNs on top.

// `firstRewriteSequenceNumber`
// explicitly which is the same as that of A2_DELETES and before B2_DELETES

// Technically A1_DELETES could be removed since it's an equality delete and should apply on
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry not sure I really follow the comment? There is no A1_DELETES....there's a FILE_A_DELETES but that's a positional delete. I don't see those referenced in the above operations.

I think it's true though that we should be able to drop equality deletes older than the minimum sequence number but that's already happening in the existing MergingSnapshotProducer check no? Don't think anything needs to distinguish there between equality and positional delete

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to say FILE_A2_DELETES. The existing snapshot producer check drops files older than the minimum sequence number, only if their sequence number is strictly smaller. So it leaves equality deletes in place that have a SN equal to the SN of the min SN. While equality deletes are only applied on data files with SNs that are strictly smaller than the delete file's SN. So you can actually delete equality delete files that are the same sequence number as the minimum sequence number of the new snapshot.

In an even broader sense FILE_A2_DELETES could even be ignored completely when doing the initial commit here since there are no data files before it since this is the first commit. So effectively it doesn't do anything.

@jasonf20
Copy link
Contributor Author

@amogh-jahagirdar If you'd like I can change the code to not iterate the files if that will make this quicker to merge, I think it's preferable to iterate them but it should be fine for now either way.

@jasonf20
Copy link
Contributor Author

jasonf20 commented Oct 7, 2024

@amogh-jahagirdar gentle reminder about this

Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Nov 11, 2024
@jasonf20
Copy link
Contributor Author

I'll send a message about this in [email protected]

@nastra nastra added not-stale and removed stale labels Nov 11, 2024
@@ -384,6 +386,116 @@ public void testRewriteDataAndAssignOldSequenceNumber() {
assertThat(listManifestFiles()).hasSize(4);
}

@TestTemplate
public void testRewriteDataAndAssignOldSequenceNumbersShouldNotDropDeleteFiles() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone through this test and I think that the assertions are correct for the sequence of changes.

@@ -833,7 +833,17 @@ public List<ManifestFile> apply(TableMetadata base, Snapshot snapshot) {
filterManager.filterManifests(
SnapshotUtil.schemaFor(base, targetBranch()),
snapshot != null ? snapshot.dataManifests(ops.io()) : null);
long minDataSequenceNumber =

long minNewFileSequenceNumber =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jasonf20 can you summarize what the problem was? I didn't see it in the PR description. These changes look correct to me but it would be helpful to know where the issue was.

Copy link
Contributor Author

@jasonf20 jasonf20 Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue Sure, the filtered manifest list could exclude manifests of a certain sequence number because their data files were re-written.

For example the current re-write replaces the manifest that contained the min sequence number, but preserves that file at that SN. The filtered list won't contain that manifest, but the new manifest created will still have that same sequence number if the rewrite kept the sequence number small enough.

So the existing logic here that just took the the min from the filtered could miss the actual minimum since the manifest containing the actual minimum was re-written as part of this apply. So we need to consider also the minimum SN that will be present in the manifest that will get created with this operation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One note on this. Since I created this PR, this PR changed the way the filtered manifest list is calculated and I don't believe my test reproduces the issue anymore if you undo the code fix.

However, I think the fix is still correct and these the minSequenceNumber of the current operation should still be considered in addition to the filtered list.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that's what I thought was happening, but I wanted to confirm. I think it would be helpful to have a comment in the test that states what would have failed.

So in the test, the file that was missing was the first delete file, FILE_A2_DELETES that had sequence number 1, right? In the second rewrite, FILE_A is rewritten as FILE_A2, but it is rewritten at the same sequence number (1). The min sequence number of the other files is 3 for FILE_D after the first rewrite.

Why is the first rewrite needed? Couldn't you reproduce the problem by rewriting just FILE_A to FILE_A2 after separately committing FILE_B? The test case seems over-complicated to me so I think we need one that:

  1. Reproduces the problem without this fix, and
  2. Is a minimal set of steps to reproduce it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the operations were required to reproduce the issue, but It's been a while so I don't remember the specifics. I could try and get back into it and see if I can create a new minimal test. But I'm not sure exactly when I will be free to do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue I had some time to look at this today and have managed to reduce the spec to the minimal required.
I pushed a new commit that fixes a bug introduces in the PR mentioned above. After fixing this bug my new simplified test reproduces the issue.

@amogh-jahagirdar Would appreciate if you can look at the first commit since it's related to your PR. Thanks!

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a look, @jasonf20 see my comment here , the change to not rewrite deletes in case of a manifest only having aged out deletes was actually intentional and I don't think we want to add that behavior back. Let me take a look to see how we can simplify the tests in this PR considering that manifests with only aged out deletes will only be rewritten opportunistically if some other changes prompt a rewrite of the manifest

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amogh-jahagirdar Responded here.

@jasonf20 jasonf20 force-pushed the rewrite-fix branch 2 times, most recently from b586da5 to b5fe594 Compare November 28, 2024 08:33
This got broken when canTrustManifestReferences was added. If the manifest has a small enough minSequenceNumber it still needs to be checked even if canTrustManifestReferences is true
This fixes a bug where delete files that are still required by the
re-written file could be deleted prematurely.

This happened because `MergingSnapshotProducer` calculated `minDataSequenceNumber` in a way that
could skip data sequence numbers that do exist in the final snapshot.

The filtered manifest list could exclude manifests of
a certain sequence number because their data files were re-written.
With specific orders of operations it was possible to skip the sequence
number of manifests that were re-written.
Comment on lines +366 to +368
if (manifest.minSequenceNumber() > 0 && manifest.minSequenceNumber() < minSequenceNumber) {
return true;
}
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to add this, I intentionally removed the behavior to rewrite a manifest in the case it only had aged out deletes, see here for a more detailed description https://github.com/apache/iceberg/pull/11131/files#r1815710103.

Copy link
Contributor Author

@jasonf20 jasonf20 Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's the case then doesn't the last condition here do nothing since it can only reach this check if one of the earlier checks was true anyway:

 deletePaths.contains(file.location())
                          || deleteFiles.contains(file)
                          || dropPartitions.contains(file.specId(), file.partition())
                          || (isDelete
                              && entry.isLive()
                              && entry.dataSequenceNumber() > 0
                              && entry.dataSequenceNumber() < minSequenceNumber);

It seems like perhaps dropDeleteFilesOlderThan has no affect anymore (unless maybe allDeletesReferenceManifests gets set to false or something).

I think not removing by minSequenceNumber leaves undeleted delete files that just never get applied to any files at query time, so it's not the end of the world, but it does lead to some wasted storage and slightly longer scan planning times.

Assuming we want to keep this behaviour perhaps we should just not use dropDeleteFilesOlderThan anymore in mergingSnapshotProducer then?

We can try getting a minimal test working by doing some actual delete or something from a shared manifest. But it seems like dropDeleteFilesOlderThan is not exactly doing anything right now.

Copy link
Contributor Author

@jasonf20 jasonf20 Dec 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a side note I think it's better to make sure dropDeleteFilesOlderThan does work. We have encountered tables that have many old EQ deleted files that were compacted and no longer apply to any data files, but were not removed from the table's manifests.

Note that some manifests contain only irrelevant deletes and would not opportunistically get re-written for other purposes. Also, ReWriteManifests currently ignores delete manifests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, yes I believe you are right that the last condition specifically at https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/ManifestFilterManager.java#L427 is redundant, we can remove that! There is no longer any need to rely on that condition because the only time we'd hit that point is the manifest needs to be rewritten for correctness based on the other criteria. Then when the manifest does get rewritten if there happen to be aged out deletes, we will rewrite them here https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/ManifestFilterManager.java#L475

OK so breaking down the pros/cons for eagerly removing aged out deletes during commit where a manifest only has aged out deletes:

The pro of eager removal of aged out deletes is that space consumed by metadata is reduced, and also planning times are at least slightly reduced.

The downside of eager removal is that even if a manifest does not need to be rewritten, the commit process is incurring additional latency and risk of failure (the IO involved) for writing out a new manifest.

IMO I feel like the upside for reclaiming some metadata storage/reducing plan times, doesn't justify the risk of failure/additional latency being incurred by the process on the commit path. Maybe in the most extreme case where some large percentage of manifests only have aged out deletes, is it worth it?

I think you brought up good point though on rewrite manifests, the Java API itself will just retain all the delete manifests but I do think the spark action does support rewriting delete manifests https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteManifestsSparkAction.java#L179

Perhaps it would be better for rewrite manifests to remove the aged out deletes? At that point the work for removing the aged out deletes is more intentional in the context of table maintenance and not on the general commit path for MergingSnapshotProducer implementations? I'll need to think more on this

Copy link
Contributor Author

@jasonf20 jasonf20 Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it can be added as a table property. As an anecdote I have encountered tables with frequent updates that have 100K + inactive delete files in their manifests. And since Spark isn't used in this env performing a cleanup isn't simple.

I think the above issue should be addressed, but perhaps we got a bit side tracked in the context of this bug. How would you like to proceed here. This is still a valid bug fix, can you think of a trick to make the test reproduce the issue easily without reverting the change to dropDeleteFilesOlderThan? If not can we merge it regardless, it's still a valid fix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Amogh here. Unnecessary cleanup is probably not a good idea in the core library when it requires rewriting manifests that would otherwise not be modified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants