-
Notifications
You must be signed in to change notification settings - Fork 344
Object storage operations (proposal) #3256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
API and implementations to perform long-running operations against object stores, mostly to purge files.
adam-christian-software
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a fantastic PR! Thanks for pushing forward with this. I had some smaller questions, but this looks great.
|
|
||
| # Polaris object store operations | ||
|
|
||
| API and implementations to perform long-running operations against object stores, mostly to purge files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I wouldn't limit to purging files. I could see other object store operations which might be helpful in the future.
So, I'd remove "mostly to purge files."
| Purge operations are performed in one or multiple bulk delete operations. | ||
| The implementation takes care of not including the same file more than once within a single bulk delete operation. | ||
|
|
||
| One alternative implementation purges all files within the base location of a table or view metadata. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this alternative implementation inside of the implementation here?
| implementations. | ||
|
|
||
| The actual object store interaction of the current implementation is delegated to Iceberg `FileIO` implementations. | ||
| Only `FileIO` implementations that support prefix-operations (`SupportsPrefixOperations` interface) and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What sort of FileIO implementations are not supported then? From what I'm seeing OSSFileIO, InMemoryFileIO, & EcsFileIO?
| @ApplicationScoped | ||
| class FileOperationsFactoryImpl implements FileOperationsFactory { | ||
|
|
||
| private final Clock clock; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we use this anywhere?
|
|
||
| /** Figure out the hard coded max batch size limit for a particular FileIO implementation. */ | ||
| static int implSpecificDeleteBatchLimit(SupportsBulkOperations bulkOps) { | ||
| var className = bulkOps.getClass().getName(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this work? Like, I think we are trying use getClass().getSimpleName() rather than getClass().getName() .
| * <p>All functions of this interface rather yield incomplete results and continue over throwing | ||
| * exceptions. | ||
| */ | ||
| public interface FileOperations { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: For readability, I would introduce a FilePurgeOperations interface which could contain all of the purge operations and have this interface extend that one.
| return FileType.ICEBERG_METADATA; | ||
| } else if (fileName.startsWith("snap-")) { | ||
| return FileType.ICEBERG_MANIFEST_LIST; | ||
| } else if (fileName.contains("-m")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if this could lead to false positives. Like, could we accidentally purge a file with this name.
| */ | ||
| @Value.Default | ||
| default int deleteBatchSize() { | ||
| return 250; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my understanding, what was your thought process when choosing 250?
| * Callback being invoked right before a file location is being submitted to be purged. | ||
| * | ||
| * <p>Due to API constraints of {@link | ||
| * org.apache.iceberg.io.SupportsBulkOperations#deleteFiles(Iterable)} it's barely possible to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I'd rephrase to "This callback allows users to identify files that failed a deletion."
| /** | ||
| * The number of purged files. | ||
| * | ||
| * <p>The returned value may be wrong and include non-existing files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment for failedPurges. Is there a way that users can understand this number better? If it is wrong, then why is it helpful to return?
flyrain
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! This seems to overlap with the existing TableCleanupTaskHandler. My understanding is that the direction is to invest in a more general task execution framework (e.g., a delegation service outside of Polaris), rather than re-implementing file-level operations in individual features. It would be great to align this work with that direction.
API and implementations to perform long-running operations against object stores, mostly to purge files.