[nexus] retry indefinitely in siu_lock_instance()#10177
Merged
Conversation
As described in #10166, a particularly unlucky sequence of events can result a second execution of the instance-update saga's `siu_lock_instance()` failing and unwinding while the instance is locked by that saga, resulting in the instance record being locked forever (or, until a human being manually picks the lock). This occurs because that action [will currently fail][1] if the query that tries to lock the instance record in the database fails with an error that *does not* indicate that another saga has locked the instance. However, the saga node may execute multiple times in the event of a Nexus crash, so when this node executes, it is possible that the saga is *already* holding the lock but the current execution of the node is unaware of this. Suppose that: 1. A Nexus starts executing this action, successfully locks the instance record, and then crashes *before* marking the saga node as having completed. 2. Subsequently, a new Nexus resumes executing the saga and runs this action again. It hits a query failure trying to lock the instance records, and unwinds. 3. Because the saga node has not *completed*, [its undo action, which releases the lock][2], is not executed. Therefore, the instance is still locked. This commit fixes the bug by changing `siu_lock_instance()` so that database errors that do not positively indicate that the lock is held by another saga are retried forever, rather than failing the action. In #10166, we discussed a few potential solutions to this: 1. Retrying the lock operation forever, as implemented here, 2. Attempting to *release* the lock forever if the lock operation fails with a database error, 3. Changing `steno` so that we can have `siu_lock_instance()`'s `siu_lock_instance_undo()` unwinding action execute if `siu_lock_instance()` fails (which is essentially (2) with extra steps, since `siu_lock_instance_undo()` will retry releasing the lock forever...) Of these options, retrying the lock operation felt like the best solution to me, since they all involve retrying *something* forever, and retrying the lock rather than the unlock means that we will keep moving forwards with the *current* saga in the face of a transient DB error, rather than locking the instance, unwinding, releasing the lock, and having to start a new saga before the state update can make progress --- which just involves a lot more steps. This way, we don't "waste" the already-acquired lock if a transient error occurs after a Nexus crash results in the node executing twice. The retry loop is based on [the one we already have in the `unwind_instance_lock()` function][3], and will complain increasingly loudly if we have been retrying for a long time, or if the database error appears to be a client rather than server error. Fixes #10166 [1]: https://github.com/oxidecomputer/omicron/blob/d7c3b00d743bcc9212b222a74ae27cc970b1ee2c/nexus/src/app/sagas/instance_update/start.rs#L111-L112 [2]: https://github.com/oxidecomputer/omicron/blob/d7c3b00d743bcc9212b222a74ae27cc970b1ee2c/nexus/src/app/sagas/instance_update/start.rs#L116-L139 [3]: https://github.com/oxidecomputer/omicron/blob/d7c3b00d743bcc9212b222a74ae27cc970b1ee2c/nexus/src/app/sagas/instance_update/mod.rs#L1469-L1534
davepacheco
approved these changes
Apr 1, 2026
Co-authored-by: David Pacheco <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
As described in #10166, a particularly unlucky sequence of events can
result a second execution of the instance-update saga's
siu_lock_instance()failing and unwinding while the instance is lockedby that saga, resulting in the instance record being locked forever (or,
until a human being manually picks the lock).
This occurs because that action will currently fail if the query
that tries to lock the instance record in the database fails with an
error that does not indicate that another saga has locked the
instance. However, the saga node may execute multiple times in the event
of a Nexus crash, so when this node executes, it is possible that the
saga is already holding the lock but the current execution of the node
is unaware of this.
Suppose that:
record, and then crashes before marking the saga node as having
completed.
action again. It hits a query failure trying to lock the instance
records, and unwinds.
releases the lock, is not executed. Therefore, the instance is
still locked.
This commit fixes the bug by changing
siu_lock_instance()so thatdatabase errors that do not positively indicate that the lock is held by
another saga are retried forever, rather than failing the action.
In #10166, we discussed a few potential solutions to this:
with a database error,
stenoso that we can havesiu_lock_instance()'ssiu_lock_instance_undo()unwinding action execute ifsiu_lock_instance()fails (which is essentially (2) with extrasteps, since
siu_lock_instance_undo()will retry releasing the lockforever...)
Of these options, retrying the lock operation felt like the best
solution to me, since they all involve retrying something forever, and
retrying the lock rather than the unlock means that we will keep moving
forwards with the current saga in the face of a transient DB error,
rather than locking the instance, unwinding, releasing the lock, and
having to start a new saga before the state update can make progress ---
which just involves a lot more steps. This way, we don't "waste" the
already-acquired lock if a transient error occurs after a Nexus crash
results in the node executing twice. The retry loop is based on the one
we already have in the
unwind_instance_lock()function, and willcomplain increasingly loudly if we have been retrying for a long time,
or if the database error appears to be a client rather than server
error.
Fixes #10166