Re-authenticate if the session is closed by a concurrent request #1031
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Several concurrent requests can step on each others toes while catching HTTP 401s: one of them will catch 401 and cause the re-auth normally (by invalidating the APIContext via the Vault), while others will be left with the closed session and will not be able even to try executing their retried requests to get that 401. Instead, they will get a generic RuntimeError("Session is closed") from aiohttp.
The concurrent requests are highly likely if the API returned some other errors previously, so that the requests went into the back-off sleep for a few seconds before retrying.
A proper fix would be to retry with a new session, but we currently have no mechanisms to replace the session in a context object (safely). Hence, escalate and try again with new credentials (loosing the retry counter as a side effect).
Under the hood, this concurrency caused conflicts in the vault over the state of the current and invalid credentials:
The credentials invalidation happens by the key only, which is the login handler name. It does not feed the actual credentials object that failed with 401 and thus must be remembered as broken. As a result, the 2nd, 3rd, and following failed API streams invalidated the valid credentials, which was acquired after the 1st failure and re-authentication. Therefore, the following re-authentications that tried to add the same credentials, failed — because that credentials was known to be invalid.
The second contributng factor was the inconsistency between the vault's credentials set (
._current
) and its boolean readiness (._state
), protected by different locks.Specifically, the case was this:
Vault._ready._condition
— and gets it, and proceeds further.Vault._lock
, leaves the lock, initiates the re-auth again.Vault._lock
and acquires it.Most likely it was overlooked before in Python 3.10-3.11 or so — because they were relatively slow and the sequence of operations managed to be fast, the event loop control was not going out of the routines fast enough on
await
. Now everything is fasy, so the control leaves the coroutines more often.Now, all these operations are protected under the same condition object, and there should be no unprotected inconsistencies between the current credentials and the vault's boolean readiness state.
TODO: Requires tests to simulate the situation — if the hypothesis is correct.
Presumably fixes #980 #1158