Skip to content

[RFR] S3 bugfixes #2329

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Apr 2, 2025
Merged

[RFR] S3 bugfixes #2329

merged 14 commits into from
Apr 2, 2025

Conversation

npow
Copy link
Contributor

@npow npow commented Mar 3, 2025

This PR fixes several issues which caused s3op to be stuck:

  1. Need to call queue.cancel_join_thread() so that the workers can exit without flushing the queue, otherwise there is a deadlock
  2. Catch the correct exceptions in download/upload (they don't actually raise ClientError)
  3. Handle InternalError
  4. Handle SSLError
  5. Optimistically assume all other unhandled exceptions are transient

Additional improvements:

  1. Added exponential backoff to jitter_sleep()
  2. Set default retry config in s3op.py to match that in aws_client.py
  3. Fix a bug where the retry setting was not being applied properly if config was missing
  4. Fail early on fatal errors
  5. Don't restart from scratch when there's no progress

@npow npow requested review from savingoyal and romain-intel March 3, 2025 22:01
@npow npow added the ok-to-test label Mar 3, 2025
@npow npow added ok-to-test and removed ok-to-test labels Mar 4, 2025
@npow npow added ok-to-test and removed ok-to-test labels Mar 4, 2025
@npow npow added ok-to-test and removed ok-to-test labels Mar 5, 2025
@npow npow changed the title S3 bugfixes [RFR] S3 bugfixes Mar 5, 2025
Copy link
Collaborator

@savingoyal savingoyal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any testing scripts for me to use to reproduce the issues this PR is trying to address? it's okay if it's clunky etc. also it would be good to run the battery of s3 tests for these changes. are you able to set up the s3 tests against your env?

@npow
Copy link
Contributor Author

npow commented Mar 13, 2025

any testing scripts for me to use to reproduce the issues this PR is trying to address? it's okay if it's clunky etc. also it would be good to run the battery of s3 tests for these changes. are you able to set up the s3 tests against your env?

We run the s3 tests on our end for every commit and have been able to reproduce the issue there. I also have a flow that reproduces the issue by reading and writing data to/from S3 using hundreds of workers, but the data isn't public.

@nflx-mf-bot
Copy link
Collaborator

Testing[4287105] @ 00b184d

@nflx-mf-bot
Copy link
Collaborator

Testing for OSS PR [1011] @ commit 00b184d had 12 FAILUREs.

@npow
Copy link
Contributor Author

npow commented Mar 17, 2025

Testing for OSS PR [1011] @ commit 00b184d had 12 FAILUREs.

Failures are from the card tests and the tag mutation test, which have been flaky.

romain-intel
romain-intel previously approved these changes Mar 18, 2025
Copy link
Contributor

@romain-intel romain-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change in logic looks correct.

romain-intel
romain-intel previously approved these changes Mar 18, 2025
def convert_to_client_error(e):
match = BOTOCORE_MSG_TEMPLATE_MATCH.search(str(e))
if not match:
raise e
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you see a problem here?

romain-intel
romain-intel previously approved these changes Mar 28, 2025
Copy link
Contributor

@romain-intel romain-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please re-run black but otherwise LGTM.

savingoyal
savingoyal previously approved these changes Mar 31, 2025
@npow npow dismissed stale reviews from savingoyal and romain-intel via 55f7e15 April 1, 2025 19:11
@npow npow merged commit b9d0249 into master Apr 2, 2025
29 checks passed
@npow npow deleted the npow/s3-bugfix branch April 2, 2025 05:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants