fix(framework): Fix token orphan expiry causing run failure#6916
fix(framework): Fix token orphan expiry causing run failure#6916
Conversation
…orphan window failure
There was a problem hiding this comment.
Pull request overview
This PR adjusts SuperLink’s token-expiry cleanup so that expiring an orphaned token does not incorrectly transition a never-started (PENDING) run into a FAILED state, addressing a crash window in the new RequestToken flow.
Changes:
- Narrow run-failure-on-token-expiry logic in
SqlLinkStateto only affect runs that have actually started. - Add a regression test ensuring token expiry does not fail a PENDING run and that the run can be re-claimed after token cleanup.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
framework/py/flwr/server/superlink/linkstate/sql_linkstate.py |
Restricts the token-expiry run-failure UPDATE to runs with starting_at != '' to avoid failing PENDING runs. |
framework/py/flwr/server/superlink/linkstate/linkstate_test.py |
Adds a test validating that token expiry doesn’t fail a never-started run and that a new token can be minted afterward. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…tistics." This reverts commit f5fda2e.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Issue
In the new RequestToken flow, it is possible for SuperLink to crash after minting a token but before setting the run status, which would cause an eventual failure when the token expires, as the PENDING run would be set to FAILED.
Proposal
Token expiry now only sets run status to FAILED if the run already started. This way, if the SuperLink crashes in the window and leaves an orphaned token, the expiry of that token won't set the run to FAILED.
In current operation, this then means that the SuperExec will loop on failure to get a token for a run until the orphaned token expires, at which point SuperExec will get a token and everything will proceed as normal.
The alternative of a completely atomic operation for minting tokens and setting run-states would be a much larger change and would only save 30 seconds.
NOTE: In review, copilot found an issue where federation_manager.report_run_usage() can be called regardless of if an UPDATE occurred. Present implementations for report_run_usage() are no-ops, so I am opting to treat it as an issue to be addressed later, as the brittleness of the correct behavior (by contract in federation_manager.report_run_usage()) is an independent problem.
Checklist
#contributions)