Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: Enhance remove snapshots efficiency by executing them in bulk #12670

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ricardopereira33
Copy link

This PR changes the remove-snapshots operation to run in bulk instead of single operations on the Rest Catalog server.

The remove-snapshots operation signature already accepts a list of snapshots IDs. However, each snapshot ID is stored in a dedicated Metadata object. This makes the operations not very efficient, causing performance issues, mentioned on: #12642.

The PR changes:

  • Rename MetadataUpdate class RemoveSnapshot to RemoveSnapshots.
  • Compute the entire snapshots IDs list and store them on the RemoveSnapshots instead of an individual snapshot ID.

@github-actions github-actions bot added the core label Mar 27, 2025
Comment on lines +1181 to 1186
- code: "java.class.removed"
old: "class org.apache.iceberg.MetadataUpdate.RemoveSnapshot"
justification: "Changing the RemoveSnapshot class to receive a list of snapshots\
\ IDs instead of a single snapshot ID. This will make the remove snapshots\
\ more efficient."
- code: "java.method.removed"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't remove this, this is a still valid update in the protocol.

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar Mar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wrong here, the protocol is already actually a RemoveSnapshots<list snapshots>. This RemoveSnapshot is really more of how it's mapped internally ,and is unfortunately exposed to the public. I still think we'd need to keep this for library compatibility, but make the implementation less restrictive insetead of only allowing 1 snapshot in the RemoveSnapshots list.

Comment on lines +331 to 344
class RemoveSnapshots implements MetadataUpdate {
private final Set<Long> snapshotIds;

public RemoveSnapshot(long snapshotId) {
this.snapshotId = snapshotId;
public RemoveSnapshots(Set<Long> snapshotIds) {
this.snapshotIds = snapshotIds;
}

public long snapshotId() {
return snapshotId;
public Set<Long> snapshotIds() {
return snapshotIds;
}

@Override
public void applyTo(TableMetadata.Builder metadataBuilder) {
metadataBuilder.removeSnapshots(ImmutableSet.of(snapshotId));
metadataBuilder.removeSnapshots(snapshotIds);
}
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar Mar 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should leaveRemoveSnapshot update as is since this is still a valid update type in the protocol. I think the solution should involve some sort of buffering of the snapshots to remove instead of eagerly rewriting (it's the eager rewriting of the current snapshots in the builder that leads to the n^2 behavior). Only at the end, once all the snapshots to remove have been accumulated, would we want to go ahead and rewrite. Let me take a deeper look and get back with a concrete solution

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should leaveRemoveSnapshot update as is since this is still a valid update type in the protocol.

@amogh-jahagirdar On the protocol, we already defined a list of snapshots to be removed... However, we just pass a single snapshot ID in the list. Why can't we change to have all snapshot IDs? Is there a reason for this to be like that?

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar Mar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ricardopereira33 Ah you are completely right I think, I misread the protocol, RemoveSnapshots https://github.com/apache/iceberg/blame/main/open-api/rest-catalog-open-api.yaml#L2833 already operates on a list of snapshots. it's seems like the implementation should actually be more open. Currently looks like there's some strict validation that there's only 1 element https://github.com/apache/iceberg/blame/main/core/src/main/java/org/apache/iceberg/MetadataUpdateParser.java#L559

I'm not sure the context on this. But fundamentally I think you are correct there's no protocol violation here. I'd need to see why this was implemented this way to begin with I also see this ToDo https://github.com/apache/iceberg/blame/main/core/src/main/java/org/apache/iceberg/MetadataUpdateParser.java#L420

For general library compatibility though we probably need to keep this update type, potentially deprecating it in favor of the new one.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can keep the RemoveSnapshot as a deprecated class 👍🏽
I'm bringing the contributors of that validation to the loop: @nastra @rdblue. Do you know why do we have this validation to restrict just 1 snapshot?

@ricardopereira33 ricardopereira33 force-pushed the CORE-execute-remove-snapshots-in-bulk branch from e3110ee to 48cfacd Compare March 28, 2025 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants