[nexus] retry indefinitely in `siu_lock_instance()` by hawkw · Pull Request #10177 · oxidecomputer/omicron

hawkw · 2026-03-27T17:13:40Z

As described in #10166, a particularly unlucky sequence of events can
result a second execution of the instance-update saga's
siu_lock_instance() failing and unwinding while the instance is locked
by that saga, resulting in the instance record being locked forever (or,
until a human being manually picks the lock).

This occurs because that action will currently fail if the query
that tries to lock the instance record in the database fails with an
error that does not indicate that another saga has locked the
instance. However, the saga node may execute multiple times in the event
of a Nexus crash, so when this node executes, it is possible that the
saga is already holding the lock but the current execution of the node
is unaware of this.

Suppose that:

A Nexus starts executing this action, successfully locks the instance
record, and then crashes before marking the saga node as having
completed.
Subsequently, a new Nexus resumes executing the saga and runs this
action again. It hits a query failure trying to lock the instance
records, and unwinds.
Because the saga node has not completed, its undo action, which
releases the lock, is not executed. Therefore, the instance is
still locked.

This commit fixes the bug by changing siu_lock_instance() so that
database errors that do not positively indicate that the lock is held by
another saga are retried forever, rather than failing the action.

In #10166, we discussed a few potential solutions to this:

Retrying the lock operation forever, as implemented here,
Attempting to release the lock forever if the lock operation fails
with a database error,
Changing steno so that we can have siu_lock_instance()'s
siu_lock_instance_undo() unwinding action execute if
siu_lock_instance() fails (which is essentially (2) with extra
steps, since siu_lock_instance_undo() will retry releasing the lock
forever...)

Of these options, retrying the lock operation felt like the best
solution to me, since they all involve retrying something forever, and
retrying the lock rather than the unlock means that we will keep moving
forwards with the current saga in the face of a transient DB error,
rather than locking the instance, unwinding, releasing the lock, and
having to start a new saga before the state update can make progress ---
which just involves a lot more steps. This way, we don't "waste" the
already-acquired lock if a transient error occurs after a Nexus crash
results in the node executing twice. The retry loop is based on the one
we already have in the unwind_instance_lock() function, and will
complain increasingly loudly if we have been retrying for a long time,
or if the database error appears to be a client rather than server
error.

Fixes #10166

As described in #10166, a particularly unlucky sequence of events can result a second execution of the instance-update saga's `siu_lock_instance()` failing and unwinding while the instance is locked by that saga, resulting in the instance record being locked forever (or, until a human being manually picks the lock). This occurs because that action [will currently fail][1] if the query that tries to lock the instance record in the database fails with an error that *does not* indicate that another saga has locked the instance. However, the saga node may execute multiple times in the event of a Nexus crash, so when this node executes, it is possible that the saga is *already* holding the lock but the current execution of the node is unaware of this. Suppose that: 1. A Nexus starts executing this action, successfully locks the instance record, and then crashes *before* marking the saga node as having completed. 2. Subsequently, a new Nexus resumes executing the saga and runs this action again. It hits a query failure trying to lock the instance records, and unwinds. 3. Because the saga node has not *completed*, [its undo action, which releases the lock][2], is not executed. Therefore, the instance is still locked. This commit fixes the bug by changing `siu_lock_instance()` so that database errors that do not positively indicate that the lock is held by another saga are retried forever, rather than failing the action. In #10166, we discussed a few potential solutions to this: 1. Retrying the lock operation forever, as implemented here, 2. Attempting to *release* the lock forever if the lock operation fails with a database error, 3. Changing `steno` so that we can have `siu_lock_instance()`'s `siu_lock_instance_undo()` unwinding action execute if `siu_lock_instance()` fails (which is essentially (2) with extra steps, since `siu_lock_instance_undo()` will retry releasing the lock forever...) Of these options, retrying the lock operation felt like the best solution to me, since they all involve retrying *something* forever, and retrying the lock rather than the unlock means that we will keep moving forwards with the *current* saga in the face of a transient DB error, rather than locking the instance, unwinding, releasing the lock, and having to start a new saga before the state update can make progress --- which just involves a lot more steps. This way, we don't "waste" the already-acquired lock if a transient error occurs after a Nexus crash results in the node executing twice. The retry loop is based on [the one we already have in the `unwind_instance_lock()` function][3], and will complain increasingly loudly if we have been retrying for a long time, or if the database error appears to be a client rather than server error. Fixes #10166 [1]: https://github.com/oxidecomputer/omicron/blob/d7c3b00d743bcc9212b222a74ae27cc970b1ee2c/nexus/src/app/sagas/instance_update/start.rs#L111-L112 [2]: https://github.com/oxidecomputer/omicron/blob/d7c3b00d743bcc9212b222a74ae27cc970b1ee2c/nexus/src/app/sagas/instance_update/start.rs#L116-L139 [3]: https://github.com/oxidecomputer/omicron/blob/d7c3b00d743bcc9212b222a74ae27cc970b1ee2c/nexus/src/app/sagas/instance_update/mod.rs#L1469-L1534

jmpesp

Nice find 🚀

Co-authored-by: David Pacheco <[email protected]>

hawkw requested a review from davepacheco March 27, 2026 17:13

hawkw added bug Something that isn't working. nexus Related to nexus labels Mar 27, 2026

jmpesp approved these changes Mar 30, 2026

View reviewed changes

davepacheco approved these changes Apr 1, 2026

View reviewed changes

Comment thread nexus/src/app/sagas/instance_update/start.rs Outdated

Comment thread nexus/src/app/sagas/instance_update/start.rs Outdated

Comment thread nexus/src/app/sagas/instance_update/start.rs Outdated

hawkw and others added 2 commits April 1, 2026 17:03

Update start.rs

54ea300

Co-authored-by: David Pacheco <[email protected]>

@davepacheco's commentary suggestions

5c78acc

hawkw enabled auto-merge (squash) April 2, 2026 17:13

hawkw merged commit 1cdab5e into main Apr 2, 2026
16 checks passed

hawkw deleted the eliza/always-lock branch April 2, 2026 18:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[nexus] retry indefinitely in `siu_lock_instance()`#10177

[nexus] retry indefinitely in `siu_lock_instance()`#10177
hawkw merged 3 commits intomainfrom
eliza/always-lock

hawkw commented Mar 27, 2026 •

edited

Loading

Uh oh!

jmpesp left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hawkw commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jmpesp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hawkw commented Mar 27, 2026 •

edited

Loading