fix: Avoid calling in to Python function while holding lock during `FetchedCredentialsCache` refresh on expiry #25763

alexander-beedie · 2025-12-13T10:09:08Z

Possibly fixes #25762.

Looking at the trace indicates the deadlock occurred around update_func when refreshing credentials, so I read through the get_maybe_update function in cloud/credential_provider.rs. Since update_func is a Python function, and we take a lock for the duration of the update/call, I think we might have a mutex+GIL interaction here - concurrent calls could get stuck on credentials update/refresh if they have expired? (So we wouldn't see it often).

lock() 
→ check expiry 
→ [if expired: call Python while holding lock]  🔒🐍
→ return

Definitely needs careful review, but the PR makes the following adjustment to the flow:

lock() 
→ check expiry 
→ [if expired: unlock 
  → call Python without lock 
  → lock() 
  → compare expiry 
  → update if newer] 
→ return

This avoids holding the lock while we call in to Python. The tradeoff is we could have some redundant credential fetches if there's a lot of concurrency, and the race in this case is resolved by simply taking the credentials with the latest expiry...

Anyway, have commented the code inline, but am parking the PR in Draft so people who know better than I can take a good look over it and see if it needs further adjustments, or if there's a cleaner solution 😄

…oud credentials refresh

nameexhaustion · 2025-12-13T11:48:48Z

crates/polars-io/src/cloud/credential_provider.rs

            }
        }

-        let mut inner = self.0.lock().await;


I think one thread holds GIL while trying to acquire this lock, where this lock is being held by a 2nd thread - and the 2nd thread is trying to acquire GIL to call the Python function.

It should be possible to do a simple fix by just releasing the GIL before this point?

Feel free to update accordingly, if so!
I'm away for the weekend, so might not be updating anything until Monday.

Out of curiosity I did appeal to the AI gods and asked Claude Code about just releasing the GIL - it has the following to say about the deadlock, and about modifying the current fix ( have not looked at its comments very hard as I'm heading out to dinner):

The deadlock occurs because of lock ordering: 1. Thread A (Rust async task): Holds the credentials cache lock → tries to acquire GIL (via Python::attach) 2. Thread B (Python thread): Holds the GIL → tries to acquire the credentials cache lock (via another credential fetch)

Ref, releasing the GIL instead:

You'd need to ensure the GIL is released before entering `get_maybe_update`. However, this is tricky because: - The GIL acquisition happens inside `Python::attach` within `update_func.await`. - You'd need to structure the caller to explicitly release the GIL, but the callers (into_aws_provider, etc. at lines 623, 698, 763) already use `Python::attach` internally. - This would require restructuring the entire call chain.

So... I leave it to your discretion :)

codecov · 2025-12-13T13:21:42Z

Codecov Report

❌ Patch coverage is 76.66667% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.57%. Comparing base (fe85677) to head (4e508af).

Files with missing lines	Patch %	Lines
crates/polars-io/src/cloud/credential_provider.rs	76.66%	7 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main   #25763   +/-   ##
=======================================
  Coverage   80.56%   80.57%           
=======================================
  Files        1764     1764           
  Lines      242683   242685    +2     
  Branches     3041     3041           
=======================================
+ Hits       195528   195544   +16     
+ Misses      46372    46358   -14     
  Partials      783      783

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

fix: Avoid calling in to Python function while holding lock during cl…

4e508af

…oud credentials refresh

github-actions bot added A-io-cloud Area: reading/writing to cloud storage fix Bug fix python Related to Python Polars rust Related to Rust Polars labels Dec 13, 2025

alexander-beedie changed the title ~~fix: Avoid calling in to Python function while holding lock during cloud credentials refresh~~ fix: Avoid calling in to Python function while holding lock in FetchedCredentialsCache refresh on expiry Dec 13, 2025

alexander-beedie changed the title ~~fix: Avoid calling in to Python function while holding lock in FetchedCredentialsCache refresh on expiry~~ fix: Avoid calling in to Python function while holding lock during FetchedCredentialsCache refresh on expiry Dec 13, 2025

nameexhaustion reviewed Dec 13, 2025

View reviewed changes

gusostow mentioned this pull request Dec 15, 2025

Rare S3 scan_parquet deadlock #25762

Open

2 tasks

nameexhaustion mentioned this pull request Dec 30, 2025

fix: Set http client timeouts to 10 minutes #25902

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Avoid calling in to Python function while holding lock during `FetchedCredentialsCache` refresh on expiry #25763

fix: Avoid calling in to Python function while holding lock during `FetchedCredentialsCache` refresh on expiry #25763

Uh oh!

alexander-beedie commented Dec 13, 2025 •

edited

Loading

Uh oh!

nameexhaustion Dec 13, 2025

Uh oh!

alexander-beedie Dec 13, 2025 •

edited

Loading

Uh oh!

codecov bot commented Dec 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: Avoid calling in to Python function while holding lock during FetchedCredentialsCache refresh on expiry #25763

Are you sure you want to change the base?

fix: Avoid calling in to Python function while holding lock during FetchedCredentialsCache refresh on expiry #25763

Uh oh!

Conversation

alexander-beedie commented Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nameexhaustion Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

alexander-beedie Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Dec 13, 2025

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: Avoid calling in to Python function while holding lock during `FetchedCredentialsCache` refresh on expiry #25763

fix: Avoid calling in to Python function while holding lock during `FetchedCredentialsCache` refresh on expiry #25763

alexander-beedie commented Dec 13, 2025 •

edited

Loading

alexander-beedie Dec 13, 2025 •

edited

Loading