-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: Enhance remove snapshots efficiency by executing them in bulk #12670
base: main
Are you sure you want to change the base?
Core: Enhance remove snapshots efficiency by executing them in bulk #12670
Conversation
- code: "java.class.removed" | ||
old: "class org.apache.iceberg.MetadataUpdate.RemoveSnapshot" | ||
justification: "Changing the RemoveSnapshot class to receive a list of snapshots\ | ||
\ IDs instead of a single snapshot ID. This will make the remove snapshots\ | ||
\ more efficient." | ||
- code: "java.method.removed" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't remove this, this is a still valid update in the protocol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wrong here, the protocol is already actually a RemoveSnapshots<list snapshots>
. This RemoveSnapshot
is really more of how it's mapped internally ,and is unfortunately exposed to the public. I still think we'd need to keep this for library compatibility, but make the implementation less restrictive insetead of only allowing 1 snapshot in the RemoveSnapshots
list.
class RemoveSnapshots implements MetadataUpdate { | ||
private final Set<Long> snapshotIds; | ||
|
||
public RemoveSnapshot(long snapshotId) { | ||
this.snapshotId = snapshotId; | ||
public RemoveSnapshots(Set<Long> snapshotIds) { | ||
this.snapshotIds = snapshotIds; | ||
} | ||
|
||
public long snapshotId() { | ||
return snapshotId; | ||
public Set<Long> snapshotIds() { | ||
return snapshotIds; | ||
} | ||
|
||
@Override | ||
public void applyTo(TableMetadata.Builder metadataBuilder) { | ||
metadataBuilder.removeSnapshots(ImmutableSet.of(snapshotId)); | ||
metadataBuilder.removeSnapshots(snapshotIds); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should leaveRemoveSnapshot
update as is since this is still a valid update type in the protocol. I think the solution should involve some sort of buffering of the snapshots to remove instead of eagerly rewriting (it's the eager rewriting of the current snapshots in the builder that leads to the n^2 behavior). Only at the end, once all the snapshots to remove have been accumulated, would we want to go ahead and rewrite. Let me take a deeper look and get back with a concrete solution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should leaveRemoveSnapshot update as is since this is still a valid update type in the protocol.
@amogh-jahagirdar On the protocol, we already defined a list of snapshots to be removed... However, we just pass a single snapshot ID in the list. Why can't we change to have all snapshot IDs? Is there a reason for this to be like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ricardopereira33 Ah you are completely right I think, I misread the protocol, RemoveSnapshots
https://github.com/apache/iceberg/blame/main/open-api/rest-catalog-open-api.yaml#L2833 already operates on a list of snapshots. it's seems like the implementation should actually be more open. Currently looks like there's some strict validation that there's only 1 element https://github.com/apache/iceberg/blame/main/core/src/main/java/org/apache/iceberg/MetadataUpdateParser.java#L559
I'm not sure the context on this. But fundamentally I think you are correct there's no protocol violation here. I'd need to see why this was implemented this way to begin with I also see this ToDo https://github.com/apache/iceberg/blame/main/core/src/main/java/org/apache/iceberg/MetadataUpdateParser.java#L420
For general library compatibility though we probably need to keep this update type, potentially deprecating it in favor of the new one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can keep the RemoveSnapshot
as a deprecated class 👍🏽
I'm bringing the contributors of that validation to the loop: @nastra @rdblue. Do you know why do we have this validation to restrict just 1 snapshot?
e3110ee
to
48cfacd
Compare
This PR changes the
remove-snapshots
operation to run in bulk instead of single operations on the Rest Catalog server.The remove-snapshots operation signature already accepts a list of snapshots IDs. However, each snapshot ID is stored in a dedicated
Metadata
object. This makes the operations not very efficient, causing performance issues, mentioned on: #12642.The PR changes:
MetadataUpdate
classRemoveSnapshot
toRemoveSnapshots
.RemoveSnapshots
instead of an individual snapshot ID.