Skip to content

Re-authenticate if the session is closed by a concurrent request #1031

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

nolar
Copy link
Owner

@nolar nolar commented Jun 11, 2023

Several concurrent requests can step on each others toes while catching HTTP 401s: one of them will catch 401 and cause the re-auth normally (by invalidating the APIContext via the Vault), while others will be left with the closed session and will not be able even to try executing their retried requests to get that 401. Instead, they will get a generic RuntimeError("Session is closed") from aiohttp.

The concurrent requests are highly likely if the API returned some other errors previously, so that the requests went into the back-off sleep for a few seconds before retrying.

A proper fix would be to retry with a new session, but we currently have no mechanisms to replace the session in a context object (safely). Hence, escalate and try again with new credentials (loosing the retry counter as a side effect).


Under the hood, this concurrency caused conflicts in the vault over the state of the current and invalid credentials:

The credentials invalidation happens by the key only, which is the login handler name. It does not feed the actual credentials object that failed with 401 and thus must be remembered as broken. As a result, the 2nd, 3rd, and following failed API streams invalidated the valid credentials, which was acquired after the 1st failure and re-authentication. Therefore, the following re-authentications that tried to add the same credentials, failed — because that credentials was known to be invalid.


The second contributng factor was the inconsistency between the vault's credentials set (._current) and its boolean readiness (._state), protected by different locks.

Specifically, the case was this:

  • Client A waits for the vault readiness state (the presence of credentials) under Vault._ready._condition — and gets it, and proceeds further.
  • Client B gets 401, invalidates the credentials under the lock Vault._lock, leaves the lock, initiates the re-auth again.
  • Client A reaches the lock Vault._lock and acquires it.
  • Client A tries to get the next avalable valid credentials — but there is nothing. It fails.
  • The re-authentication catches up and populates the vault with a new credentials, but it is too late.

Most likely it was overlooked before in Python 3.10-3.11 or so — because they were relatively slow and the sequence of operations managed to be fast, the event loop control was not going out of the routines fast enough on await. Now everything is fasy, so the control leaves the coroutines more often.

Now, all these operations are protected under the same condition object, and there should be no unprotected inconsistencies between the current credentials and the vault's boolean readiness state.


TODO: Requires tests to simulate the situation — if the hypothesis is correct.

Presumably fixes #980 #1158

@nolar nolar force-pushed the session-closed-in-reauth branch from b3be3e2 to e11dd73 Compare March 26, 2025 17:55
… one by the handler name

Signed-off-by: Sergey Vasilyev <[email protected]>
@nolar nolar force-pushed the session-closed-in-reauth branch from e11dd73 to a305e26 Compare March 26, 2025 18:49
…fety

Otherwise, with the lock & toggle combined, it led to situations where the lock was released after invalidation, leaving the "current" set empty and only triggering the new re-authentication. At the same time, the lock was acquired by a parallel task that has already checked the readiness state and got it "on" (before invalidation), but was waiting on the lock to actually yield the credentials. As a result, the `select()` call failed because the current set of credentials turned out to be empty by that moment.

Now, all operations of both credentials population/invalidation, so as checking for the readiness state, are done under the same lock/condition. This should ensure the lack of inconsistent states between the toggle & the sets of credentials in different moments in time.

Signed-off-by: Sergey Vasilyev <[email protected]>
@nolar
Copy link
Owner Author

nolar commented Apr 25, 2025

Confirmed to be fixing the issue — see the #980 comments. This change was suspected to cause memory leaks, but later it was proven the leak is not related to this particular change and is reproduced with the main branch, too. Hence, this PR is good to go (after a final quick look, for certainty & safety).

@nolar nolar marked this pull request as ready for review April 25, 2025 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Watchers on Custom Resources throw RuntimeError("Session is closed") and permanently die
1 participant