Fix S3 createExclusive in case of AWS SDK retries #27388

findepi · 2025-11-20T17:29:00Z

AWS S3 supports S3 conditional writes, which Trino uses to implement
S3OutputFile.createExclusive. AWS SDK also has implicit request
retries in case of certain failures, including read timeouts.

It is possible, as reproduced by TestS3Retries modified in this
commit, that object is created successfully, but AWS SDK is not aware of
that success and retries the request. The retried request then fails
with precondition failure -- from the S3 server side perspective the
object already exists. In fact, in case of single PutObject requests,
there seems to be no way to distinguish requests on the server side.
ETag cannot be used for that, since it's usually content-based.
createExclusive cannot succeed for two different callers even if they
happen to be writing the same key.

What's surprising is that this problem is also present in case of
multi-part upload, at least in MinIO. It's unclear whether AWS SDK
retries combined with multi-part uploads to AWS S3 would also trigger
precondition failure.

The solution for this implemented in this commit is to tag objects
created by Trino with a random value. When precondition fails and AWS
SDK reports that the request was retried (numAttempts > 1), then tags
from the object are checked. FileAlreadyExistsException is raised only
when this request fails, or when the object does not have a tag
indicating it was created by the current thread.

fixes Exclusive write to a new S3 location may fail with FileAlreadyExistsException #27402

Release notes considerations

Delta Lake

prevent log writer from failing w/ FileAlreadyExistsException (does not matter when someone used s3.exclusive-create=false configuration)
writing to s3 now requires permissions for PutObjectTagging and GetObjectTagging operations

lib/trino-filesystem-s3/src/main/java/io/trino/filesystem/s3/S3OutputStream.java

- register resources with closer as early as possible - use toxic name without spaces; it seems that names with spaces don't work correctly - drop redundant container configuration (network alias)

Network chaos (toxiproxy) setup is specific to assertions going to be performed. While we could test different setups and assertions with separate similar classes, this would be a lot of similar boilerplate code. This commit refactor the test class so that the test method configures network breakers.

Just a refactor. Separate commit to make subsequent logic changes visible.

Simplify the logic flow in `S3OutputStream.createExclusive` and `S3OutputStream.createOrOverwrite`. Before the change a call to any of these mehods could result in single PutObject request or in multi-part upload. Doing multi-part makes sense in streaming context to avoid buffering data in memory. However, in case of `createExclusive` and `createOrOverwrite` the data is already in memory. This reduces memory usage during these calls. There is no longer intermediate buffer within `S3OutputStream` involved.

findepi · 2025-11-21T14:11:12Z

CI

Test TestRedshiftUnloadTypeMapping.testVarbinary is failing #27403 & testVersionOnCompilerFailedError

lib/trino-filesystem-s3/src/main/java/io/trino/filesystem/s3/S3OutputStream.java

AWS S3 supports S3 conditional writes, which Trino uses to implement `S3OutputFile.createExclusive`. AWS SDK also has implicit request retries in case of certain failures, including read timeouts. It is possible, as reproduced by `TestS3Retries` modified in this commit, that object is created successfully, but AWS SDK is not aware of that success and retries the request. The retried request then fails with precondition failure -- from the S3 server side perspective the object already exists. In fact, in case of single PutObject requests, there seems to be no way to distinguish requests on the server side. ETag cannot be used for that, since it's usually content-based. `createExclusive` cannot succeed for two different callers even if they happen to be writing the same key. What's surprising is that this problem is also present in case of multi-part upload, at least in MinIO. It's unclear whether AWS SDK retries combined with multi-part uploads to AWS S3 would also trigger precondition failure. The solution for this implemented in this commit is to tag objects created by Trino with a random value. When precondition fails and AWS SDK reports that the request was retried (`numAttempts > 1`), then tags from the object are checked. `FileAlreadyExistsException` is raised only when this request fails, or when the object does not have a tag indicating it was created by the current thread.

lib/trino-filesystem-s3/src/main/java/io/trino/filesystem/s3/S3OutputStream.java

pettyjamesm · 2025-11-24T17:06:47Z

The issue makes sense to me, and the approach of using tags to disambiguate should work- but it's worth noting that you do pay for S3 object tags so this does increase the cost profile. How frequently is createExclusive actually called here? Is it for each data file in Iceberg tables or only when updating a single manifest per table / partition?

You mention that eTags can't be used to distinguish between multiple calls to create the same object with the same content- but maybe the other question is: do we actually need to? If we're creating the same object that some other writer created, is it actually necessary to fail in that case? Because if we can tolerate that ambiguity, we could use eTags for this purpose and avoid the cost of using tags.

findepi requested a review from wendigo November 20, 2025 17:29

cla-bot bot added the cla-signed label Nov 20, 2025

findepi removed the request for review from wendigo November 20, 2025 17:29

findepi marked this pull request as draft November 20, 2025 17:29

losipiuk reviewed Nov 20, 2025

View reviewed changes

lib/trino-filesystem-s3/src/main/java/io/trino/filesystem/s3/S3OutputStream.java Outdated Show resolved Hide resolved

losipiuk reviewed Nov 20, 2025

View reviewed changes

lib/trino-filesystem-s3/src/main/java/io/trino/filesystem/s3/S3OutputStream.java Outdated Show resolved Hide resolved

losipiuk approved these changes Nov 20, 2025

View reviewed changes

findepi force-pushed the findepi/s3-create-exclusive branch from c7b6d7c to e3034b8 Compare November 21, 2025 08:21

findepi changed the base branch from findepi/s3-create-exclusive to findepi/delta-s3 November 21, 2025 08:58

findepi force-pushed the findepi/delta-s3 branch 2 times, most recently from c036c9b to 11e34a7 Compare November 21, 2025 10:03

findepi force-pushed the findepi/s3-exclusive-retries branch from 435963b to c276976 Compare November 21, 2025 10:41

findepi force-pushed the findepi/delta-s3 branch from 11e34a7 to 05948bb Compare November 21, 2025 10:49

findepi force-pushed the findepi/s3-exclusive-retries branch from c276976 to 742403c Compare November 21, 2025 13:04

github-actions bot added docs delta-lake Delta Lake connector labels Nov 21, 2025

Base automatically changed from findepi/delta-s3 to master November 21, 2025 13:28

findepi marked this pull request as ready for review November 21, 2025 13:29

findepi added 4 commits November 21, 2025 14:29

Cleanup TestS3Retries test

f91ea66

- register resources with closer as early as possible - use toxic name without spaces; it seems that names with spaces don't work correctly - drop redundant container configuration (network alias)

Extract putObject method

33283f4

Just a refactor. Separate commit to make subsequent logic changes visible.

findepi force-pushed the findepi/s3-exclusive-retries branch from 742403c to 8a8c9ae Compare November 21, 2025 13:36

findepi requested review from ebyhr, electrum, pettyjamesm, raunaqmorarka and wendigo November 21, 2025 13:37

losipiuk reviewed Nov 24, 2025

View reviewed changes

lib/trino-filesystem-s3/src/main/java/io/trino/filesystem/s3/S3OutputStream.java Outdated Show resolved Hide resolved

losipiuk approved these changes Nov 24, 2025

View reviewed changes

findepi force-pushed the findepi/s3-exclusive-retries branch from 2c1beb6 to 5aed9cd Compare November 24, 2025 12:27

findepi commented Nov 24, 2025

View reviewed changes

lib/trino-filesystem-s3/src/main/java/io/trino/filesystem/s3/S3OutputStream.java Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix S3 createExclusive in case of AWS SDK retries #27388

Fix S3 createExclusive in case of AWS SDK retries #27388

findepi commented Nov 20, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

findepi commented Nov 21, 2025

Uh oh!

Uh oh!

Uh oh!

pettyjamesm commented Nov 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants

Fix S3 createExclusive in case of AWS SDK retries #27388

Are you sure you want to change the base?

Fix S3 createExclusive in case of AWS SDK retries #27388

Conversation

findepi commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Release notes considerations

Uh oh!

Uh oh!

Uh oh!

findepi commented Nov 21, 2025

Uh oh!

Uh oh!

Uh oh!

pettyjamesm commented Nov 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants

findepi commented Nov 20, 2025 •

edited

Loading