-
Notifications
You must be signed in to change notification settings - Fork 2.5k
fix: Avoid calling in to Python function while holding lock during FetchedCredentialsCache refresh on expiry
#25763
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…oud credentials refresh
FetchedCredentialsCache refresh on expiry
FetchedCredentialsCache refresh on expiryFetchedCredentialsCache refresh on expiry
| } | ||
| } | ||
|
|
||
| let mut inner = self.0.lock().await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one thread holds GIL while trying to acquire this lock, where this lock is being held by a 2nd thread - and the 2nd thread is trying to acquire GIL to call the Python function.
It should be possible to do a simple fix by just releasing the GIL before this point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to update accordingly, if so!
I'm away for the weekend, so might not be updating anything until Monday.
Out of curiosity I did appeal to the AI gods and asked Claude Code about just releasing the GIL - it has the following to say about the deadlock, and about modifying the current fix ( have not looked at its comments very hard as I'm heading out to dinner):
The deadlock occurs because of lock ordering:
1. Thread A (Rust async task): Holds the credentials cache lock → tries to acquire GIL (via Python::attach)
2. Thread B (Python thread): Holds the GIL → tries to acquire the credentials cache lock (via another credential fetch)
Ref, releasing the GIL instead:
You'd need to ensure the GIL is released before entering `get_maybe_update`.
However, this is tricky because:
- The GIL acquisition happens inside `Python::attach` within `update_func.await`.
- You'd need to structure the caller to explicitly release the GIL, but the callers
(into_aws_provider, etc. at lines 623, 698, 763) already use `Python::attach` internally.
- This would require restructuring the entire call chain.
So... I leave it to your discretion :)
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #25763 +/- ##
=======================================
Coverage 80.56% 80.57%
=======================================
Files 1764 1764
Lines 242683 242685 +2
Branches 3041 3041
=======================================
+ Hits 195528 195544 +16
+ Misses 46372 46358 -14
Partials 783 783 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Possibly fixes #25762.
Looking at the trace indicates the deadlock occurred around
update_funcwhen refreshing credentials, so I read through theget_maybe_updatefunction in cloud/credential_provider.rs. Sinceupdate_funcis a Python function, and we take a lock for the duration of the update/call, I think we might have a mutex+GIL interaction here - concurrent calls could get stuck on credentials update/refresh if they have expired? (So we wouldn't see it often).Definitely needs careful review, but the PR makes the following adjustment to the flow:
This avoids holding the lock while we call in to Python. The tradeoff is we could have some redundant credential fetches if there's a lot of concurrency, and the race in this case is resolved by simply taking the credentials with the latest expiry...
Anyway, have commented the code inline, but am parking the PR in Draft so people who know better than I can take a good look over it and see if it needs further adjustments, or if there's a cleaner solution 😄