-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Fix S3 createExclusive in case of AWS SDK retries #27388
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
lib/trino-filesystem-s3/src/main/java/io/trino/filesystem/s3/S3OutputStream.java
Outdated
Show resolved
Hide resolved
lib/trino-filesystem-s3/src/main/java/io/trino/filesystem/s3/S3OutputStream.java
Outdated
Show resolved
Hide resolved
c7b6d7c to
e3034b8
Compare
c036c9b to
11e34a7
Compare
435963b to
c276976
Compare
11e34a7 to
05948bb
Compare
c276976 to
742403c
Compare
- register resources with closer as early as possible - use toxic name without spaces; it seems that names with spaces don't work correctly - drop redundant container configuration (network alias)
Network chaos (toxiproxy) setup is specific to assertions going to be performed. While we could test different setups and assertions with separate similar classes, this would be a lot of similar boilerplate code. This commit refactor the test class so that the test method configures network breakers.
Just a refactor. Separate commit to make subsequent logic changes visible.
Simplify the logic flow in `S3OutputStream.createExclusive` and `S3OutputStream.createOrOverwrite`. Before the change a call to any of these mehods could result in single PutObject request or in multi-part upload. Doing multi-part makes sense in streaming context to avoid buffering data in memory. However, in case of `createExclusive` and `createOrOverwrite` the data is already in memory. This reduces memory usage during these calls. There is no longer intermediate buffer within `S3OutputStream` involved.
742403c to
8a8c9ae
Compare
|
CI
|
lib/trino-filesystem-s3/src/main/java/io/trino/filesystem/s3/S3OutputStream.java
Outdated
Show resolved
Hide resolved
AWS S3 supports S3 conditional writes, which Trino uses to implement `S3OutputFile.createExclusive`. AWS SDK also has implicit request retries in case of certain failures, including read timeouts. It is possible, as reproduced by `TestS3Retries` modified in this commit, that object is created successfully, but AWS SDK is not aware of that success and retries the request. The retried request then fails with precondition failure -- from the S3 server side perspective the object already exists. In fact, in case of single PutObject requests, there seems to be no way to distinguish requests on the server side. ETag cannot be used for that, since it's usually content-based. `createExclusive` cannot succeed for two different callers even if they happen to be writing the same key. What's surprising is that this problem is also present in case of multi-part upload, at least in MinIO. It's unclear whether AWS SDK retries combined with multi-part uploads to AWS S3 would also trigger precondition failure. The solution for this implemented in this commit is to tag objects created by Trino with a random value. When precondition fails and AWS SDK reports that the request was retried (`numAttempts > 1`), then tags from the object are checked. `FileAlreadyExistsException` is raised only when this request fails, or when the object does not have a tag indicating it was created by the current thread.
2c1beb6 to
5aed9cd
Compare
lib/trino-filesystem-s3/src/main/java/io/trino/filesystem/s3/S3OutputStream.java
Show resolved
Hide resolved
|
The issue makes sense to me, and the approach of using tags to disambiguate should work- but it's worth noting that you do pay for S3 object tags so this does increase the cost profile. How frequently is You mention that eTags can't be used to distinguish between multiple calls to create the same object with the same content- but maybe the other question is: do we actually need to? If we're creating the same object that some other writer created, is it actually necessary to fail in that case? Because if we can tolerate that ambiguity, we could use eTags for this purpose and avoid the cost of using tags. |
AWS S3 supports S3 conditional writes, which Trino uses to implement
S3OutputFile.createExclusive. AWS SDK also has implicit requestretries in case of certain failures, including read timeouts.
It is possible, as reproduced by
TestS3Retriesmodified in thiscommit, that object is created successfully, but AWS SDK is not aware of
that success and retries the request. The retried request then fails
with precondition failure -- from the S3 server side perspective the
object already exists. In fact, in case of single PutObject requests,
there seems to be no way to distinguish requests on the server side.
ETag cannot be used for that, since it's usually content-based.
createExclusivecannot succeed for two different callers even if theyhappen to be writing the same key.
What's surprising is that this problem is also present in case of
multi-part upload, at least in MinIO. It's unclear whether AWS SDK
retries combined with multi-part uploads to AWS S3 would also trigger
precondition failure.
The solution for this implemented in this commit is to tag objects
created by Trino with a random value. When precondition fails and AWS
SDK reports that the request was retried (
numAttempts > 1), then tagsfrom the object are checked.
FileAlreadyExistsExceptionis raised onlywhen this request fails, or when the object does not have a tag
indicating it was created by the current thread.
FileAlreadyExistsException#27402Release notes considerations
Delta Lake
FileAlreadyExistsException(does not matter when someone useds3.exclusive-create=falseconfiguration)PutObjectTaggingandGetObjectTaggingoperations