Skip to content

Conversation

@findepi
Copy link
Member

@findepi findepi commented Nov 20, 2025

AWS S3 supports S3 conditional writes, which Trino uses to implement
S3OutputFile.createExclusive. AWS SDK also has implicit request
retries in case of certain failures, including read timeouts.

It is possible, as reproduced by TestS3Retries modified in this
commit, that object is created successfully, but AWS SDK is not aware of
that success and retries the request. The retried request then fails
with precondition failure -- from the S3 server side perspective the
object already exists. In fact, in case of single PutObject requests,
there seems to be no way to distinguish requests on the server side.
ETag cannot be used for that, since it's usually content-based.
createExclusive cannot succeed for two different callers even if they
happen to be writing the same key.

What's surprising is that this problem is also present in case of
multi-part upload, at least in MinIO. It's unclear whether AWS SDK
retries combined with multi-part uploads to AWS S3 would also trigger
precondition failure.

The solution for this implemented in this commit is to tag objects
created by Trino with a random value. When precondition fails and AWS
SDK reports that the request was retried (numAttempts > 1), then tags
from the object are checked. FileAlreadyExistsException is raised only
when this request fails, or when the object does not have a tag
indicating it was created by the current thread.

Release notes considerations

Delta Lake

  • prevent log writer from failing w/ FileAlreadyExistsException (does not matter when someone used s3.exclusive-create=false configuration)
  • writing to s3 now requires permissions for PutObjectTagging and GetObjectTagging operations

@findepi findepi requested a review from wendigo November 20, 2025 17:29
@cla-bot cla-bot bot added the cla-signed label Nov 20, 2025
@findepi findepi removed the request for review from wendigo November 20, 2025 17:29
@findepi findepi marked this pull request as draft November 20, 2025 17:29
@findepi findepi force-pushed the findepi/s3-create-exclusive branch from c7b6d7c to e3034b8 Compare November 21, 2025 08:21
@findepi findepi changed the base branch from findepi/s3-create-exclusive to findepi/delta-s3 November 21, 2025 08:58
@findepi findepi force-pushed the findepi/delta-s3 branch 2 times, most recently from c036c9b to 11e34a7 Compare November 21, 2025 10:03
@findepi findepi force-pushed the findepi/s3-exclusive-retries branch from 435963b to c276976 Compare November 21, 2025 10:41
@findepi findepi force-pushed the findepi/s3-exclusive-retries branch from c276976 to 742403c Compare November 21, 2025 13:04
@github-actions github-actions bot added docs delta-lake Delta Lake connector labels Nov 21, 2025
Base automatically changed from findepi/delta-s3 to master November 21, 2025 13:28
@findepi findepi marked this pull request as ready for review November 21, 2025 13:29
- register resources with closer as early as possible
- use toxic name without spaces; it seems that names with spaces don't
  work correctly
- drop redundant container configuration (network alias)
Network chaos (toxiproxy) setup is specific to assertions going to be
performed.  While we could test different setups and assertions with
separate similar classes, this would be a lot of similar boilerplate
code. This commit refactor the test class so that the test method
configures network breakers.
Just a refactor. Separate commit to make subsequent logic changes
visible.
Simplify the logic flow in `S3OutputStream.createExclusive` and
`S3OutputStream.createOrOverwrite`. Before the change a call to any of
these mehods could result in single PutObject request or in multi-part
upload. Doing multi-part makes sense in streaming context to avoid
buffering data in memory. However, in case of `createExclusive` and
`createOrOverwrite` the data is already in memory.

This reduces memory usage during these calls. There is no longer
intermediate buffer within `S3OutputStream` involved.
@findepi
Copy link
Member Author

findepi commented Nov 21, 2025

CI

AWS S3 supports S3 conditional writes, which Trino uses to implement
`S3OutputFile.createExclusive`. AWS SDK also has implicit request
retries in case of certain failures, including read timeouts.

It is possible, as reproduced by `TestS3Retries` modified in this
commit, that object is created successfully, but AWS SDK is not aware of
that success and retries the request. The retried request then fails
with precondition failure -- from the S3 server side perspective the
object already exists. In fact, in case of single PutObject requests,
there seems to be no way to distinguish requests on the server side.
ETag cannot be used for that, since it's usually content-based.
`createExclusive` cannot succeed for two different callers even if they
happen to be writing the same key.

What's surprising is that this problem is also present in case of
multi-part upload, at least in MinIO. It's unclear whether AWS SDK
retries combined with multi-part uploads to AWS S3 would also trigger
precondition failure.

The solution for this implemented in this commit is to tag objects
created by Trino with a random value. When precondition fails and AWS
SDK reports that the request was retried (`numAttempts > 1`), then tags
from the object are checked. `FileAlreadyExistsException` is raised only
when this request fails, or when the object does not have a tag
indicating it was created by the current thread.
@findepi findepi force-pushed the findepi/s3-exclusive-retries branch from 2c1beb6 to 5aed9cd Compare November 24, 2025 12:27
@pettyjamesm
Copy link
Member

The issue makes sense to me, and the approach of using tags to disambiguate should work- but it's worth noting that you do pay for S3 object tags so this does increase the cost profile. How frequently is createExclusive actually called here? Is it for each data file in Iceberg tables or only when updating a single manifest per table / partition?

You mention that eTags can't be used to distinguish between multiple calls to create the same object with the same content- but maybe the other question is: do we actually need to? If we're creating the same object that some other writer created, is it actually necessary to fail in that case? Because if we can tolerate that ambiguity, we could use eTags for this purpose and avoid the cost of using tags.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Development

Successfully merging this pull request may close these issues.

Exclusive write to a new S3 location may fail with FileAlreadyExistsException

4 participants