Skip to content

Conversation

@alexander-beedie
Copy link
Collaborator

@alexander-beedie alexander-beedie commented Dec 13, 2025

Possibly fixes #25762.

Looking at the trace indicates the deadlock occurred around update_func when refreshing credentials, so I read through the get_maybe_update function in cloud/credential_provider.rs. Since update_func is a Python function, and we take a lock for the duration of the update/call, I think we might have a mutex+GIL interaction here - concurrent calls could get stuck on credentials update/refresh if they have expired? (So we wouldn't see it often).

lock() 
→ check expiry 
→ [if expired: call Python while holding lock]  🔒🐍
→ return

Definitely needs careful review, but the PR makes the following adjustment to the flow:

lock() 
→ check expiry 
→ [if expired: unlock 
  → call Python without lock 
  → lock() 
  → compare expiry 
  → update if newer] 
→ return

This avoids holding the lock while we call in to Python. The tradeoff is we could have some redundant credential fetches if there's a lot of concurrency, and the race in this case is resolved by simply taking the credentials with the latest expiry...

Anyway, have commented the code inline, but am parking the PR in Draft so people who know better than I can take a good look over it and see if it needs further adjustments, or if there's a cleaner solution 😄

@github-actions github-actions bot added A-io-cloud Area: reading/writing to cloud storage fix Bug fix python Related to Python Polars rust Related to Rust Polars labels Dec 13, 2025
@alexander-beedie alexander-beedie changed the title fix: Avoid calling in to Python function while holding lock during cloud credentials refresh fix: Avoid calling in to Python function while holding lock in FetchedCredentialsCache refresh on expiry Dec 13, 2025
@alexander-beedie alexander-beedie changed the title fix: Avoid calling in to Python function while holding lock in FetchedCredentialsCache refresh on expiry fix: Avoid calling in to Python function while holding lock during FetchedCredentialsCache refresh on expiry Dec 13, 2025
}
}

let mut inner = self.0.lock().await;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one thread holds GIL while trying to acquire this lock, where this lock is being held by a 2nd thread - and the 2nd thread is trying to acquire GIL to call the Python function.

It should be possible to do a simple fix by just releasing the GIL before this point?

Copy link
Collaborator Author

@alexander-beedie alexander-beedie Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to update accordingly, if so!
I'm away for the weekend, so might not be updating anything until Monday.

Out of curiosity I did appeal to the AI gods and asked Claude Code about just releasing the GIL - it has the following to say about the deadlock, and about modifying the current fix ( have not looked at its comments very hard as I'm heading out to dinner):

The deadlock occurs because of lock ordering:
1. Thread A (Rust async task): Holds the credentials cache lock → tries to acquire GIL (via Python::attach)
2. Thread B (Python thread): Holds the GIL → tries to acquire the credentials cache lock (via another credential fetch)

Ref, releasing the GIL instead:

You'd need to ensure the GIL is released before entering `get_maybe_update`. 
However, this is tricky because:
- The GIL acquisition happens inside `Python::attach` within `update_func.await`.
- You'd need to structure the caller to explicitly release the GIL, but the callers 
  (into_aws_provider, etc. at lines 623, 698, 763) already use `Python::attach` internally.
- This would require restructuring the entire call chain.

So... I leave it to your discretion :)

@codecov
Copy link

codecov bot commented Dec 13, 2025

Codecov Report

❌ Patch coverage is 76.66667% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.57%. Comparing base (fe85677) to head (4e508af).

Files with missing lines Patch % Lines
crates/polars-io/src/cloud/credential_provider.rs 76.66% 7 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main   #25763   +/-   ##
=======================================
  Coverage   80.56%   80.57%           
=======================================
  Files        1764     1764           
  Lines      242683   242685    +2     
  Branches     3041     3041           
=======================================
+ Hits       195528   195544   +16     
+ Misses      46372    46358   -14     
  Partials      783      783           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-io-cloud Area: reading/writing to cloud storage fix Bug fix python Related to Python Polars rust Related to Rust Polars

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rare S3 scan_parquet deadlock

2 participants