Skip to content

[nexus] retry indefinitely in siu_lock_instance()#10177

Merged
hawkw merged 3 commits intomainfrom
eliza/always-lock
Apr 2, 2026
Merged

[nexus] retry indefinitely in siu_lock_instance()#10177
hawkw merged 3 commits intomainfrom
eliza/always-lock

Conversation

@hawkw
Copy link
Copy Markdown
Member

@hawkw hawkw commented Mar 27, 2026

As described in #10166, a particularly unlucky sequence of events can
result a second execution of the instance-update saga's
siu_lock_instance() failing and unwinding while the instance is locked
by that saga, resulting in the instance record being locked forever (or,
until a human being manually picks the lock).

This occurs because that action will currently fail if the query
that tries to lock the instance record in the database fails with an
error that does not indicate that another saga has locked the
instance. However, the saga node may execute multiple times in the event
of a Nexus crash, so when this node executes, it is possible that the
saga is already holding the lock but the current execution of the node
is unaware of this.

Suppose that:

  1. A Nexus starts executing this action, successfully locks the instance
    record, and then crashes before marking the saga node as having
    completed.
  2. Subsequently, a new Nexus resumes executing the saga and runs this
    action again. It hits a query failure trying to lock the instance
    records, and unwinds.
  3. Because the saga node has not completed, its undo action, which
    releases the lock
    , is not executed. Therefore, the instance is
    still locked.

This commit fixes the bug by changing siu_lock_instance() so that
database errors that do not positively indicate that the lock is held by
another saga are retried forever, rather than failing the action.

In #10166, we discussed a few potential solutions to this:

  1. Retrying the lock operation forever, as implemented here,
  2. Attempting to release the lock forever if the lock operation fails
    with a database error,
  3. Changing steno so that we can have siu_lock_instance()'s
    siu_lock_instance_undo() unwinding action execute if
    siu_lock_instance() fails (which is essentially (2) with extra
    steps, since siu_lock_instance_undo() will retry releasing the lock
    forever...)

Of these options, retrying the lock operation felt like the best
solution to me, since they all involve retrying something forever, and
retrying the lock rather than the unlock means that we will keep moving
forwards with the current saga in the face of a transient DB error,
rather than locking the instance, unwinding, releasing the lock, and
having to start a new saga before the state update can make progress ---
which just involves a lot more steps. This way, we don't "waste" the
already-acquired lock if a transient error occurs after a Nexus crash
results in the node executing twice. The retry loop is based on the one
we already have in the unwind_instance_lock() function
, and will
complain increasingly loudly if we have been retrying for a long time,
or if the database error appears to be a client rather than server
error.

Fixes #10166

As described in #10166, a particularly unlucky sequence of events can
result a second execution of the instance-update saga's
`siu_lock_instance()` failing and unwinding while the instance is locked
by that saga, resulting in the instance record being locked forever (or,
until a human being manually picks the lock).

This occurs because that action [will currently fail][1] if the query
that tries to lock the instance record in the database fails with an
error that *does not* indicate that another saga has locked the
instance. However, the saga node may execute multiple times in the event
of a Nexus crash, so when this node executes, it is possible that the
saga is *already* holding the lock but the current execution of the node
is unaware of this.

Suppose that:

1. A Nexus starts executing this action, successfully locks the instance
   record, and then crashes *before* marking the saga node as having
   completed.
2. Subsequently, a new Nexus resumes executing the saga and runs this
   action again. It hits a query failure trying to lock the instance
   records, and unwinds.
3. Because the saga node has not *completed*, [its undo action, which
   releases the lock][2], is not executed. Therefore, the instance is
   still locked.

This commit fixes the bug by changing `siu_lock_instance()` so that
database errors that do not positively indicate that the lock is held by
another saga are retried forever, rather than failing the action.

In #10166, we discussed a few potential solutions to this:

1. Retrying the lock operation forever, as implemented here,
   2. Attempting to *release* the lock forever if the lock operation
   fails
   with a database error,
3. Changing `steno` so that we can have `siu_lock_instance()`'s
   `siu_lock_instance_undo()` unwinding action execute if
   `siu_lock_instance()` fails (which is essentially (2) with extra
   steps, since `siu_lock_instance_undo()` will retry releasing the lock
   forever...)

Of these options, retrying the lock operation felt like the best
solution to me, since they all involve retrying *something* forever, and
retrying the lock rather than the unlock means that we will keep moving
forwards with the *current* saga in the face of a transient DB error,
rather than locking the instance, unwinding, releasing the lock, and
having to start a new saga before the state update can make progress ---
which just involves a lot more steps. This way, we don't "waste" the
already-acquired lock if a transient error occurs after a Nexus crash
results in the node executing twice. The retry loop is based on [the one
we already have in the `unwind_instance_lock()` function][3], and will
complain increasingly loudly if we have been retrying for a long time,
or if the database error appears to be a client rather than server
error.

Fixes #10166

[1]:
https://github.com/oxidecomputer/omicron/blob/d7c3b00d743bcc9212b222a74ae27cc970b1ee2c/nexus/src/app/sagas/instance_update/start.rs#L111-L112
[2]:
https://github.com/oxidecomputer/omicron/blob/d7c3b00d743bcc9212b222a74ae27cc970b1ee2c/nexus/src/app/sagas/instance_update/start.rs#L116-L139
[3]:
https://github.com/oxidecomputer/omicron/blob/d7c3b00d743bcc9212b222a74ae27cc970b1ee2c/nexus/src/app/sagas/instance_update/mod.rs#L1469-L1534
@hawkw hawkw requested a review from davepacheco March 27, 2026 17:13
@hawkw hawkw added bug Something that isn't working. nexus Related to nexus labels Mar 27, 2026
Copy link
Copy Markdown
Contributor

@jmpesp jmpesp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find 🚀

Comment thread nexus/src/app/sagas/instance_update/start.rs Outdated
Comment thread nexus/src/app/sagas/instance_update/start.rs Outdated
Comment thread nexus/src/app/sagas/instance_update/start.rs Outdated
@hawkw hawkw enabled auto-merge (squash) April 2, 2026 17:13
@hawkw hawkw merged commit 1cdab5e into main Apr 2, 2026
16 checks passed
@hawkw hawkw deleted the eliza/always-lock branch April 2, 2026 18:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something that isn't working. nexus Related to nexus

Projects

None yet

Development

Successfully merging this pull request may close these issues.

a well-timed sequence of a Nexus crash and transient db error can leave an instance locked by a failed update saga forever

3 participants