[1/N] omdb db sagas: list running and inject fault #7732

jmpesp · 2025-03-05T17:34:01Z

Breaking apart #4378 and copying the structure of #7695, add omdb db saga as a command and implement the following sub-commands:

Usage: omdb db saga [OPTIONS] <COMMAND>

Commands:
  running  List running sagas
    fault  Inject an error into a saga's currently running node

This addresses part of the minimum amount required during a release deployment:

after quiescing (Quiesce sagas when parking the rack for updates #6804), omdb can query if there are any running sagas.
if those running sagas are stuck in a loop and cannot be drained (Sagas with extended retry loops may be undrainable during upgrade windows #7623), and the release contains a change to the DAG that causes Nexus to panic after an upgrade (want a tool for saga abandonment #7730), then omdb can inject a fault into the database that would cause that saga to unwind when the affected Nexus is restarted

Note for 2, unwinding a saga that is stuck in this way may not be valid if there were significant changes between releases.

Breaking apart #4378 and copying the structure of #7695, add `omdb db saga` as a command and implement the following sub-commands: Usage: omdb db saga [OPTIONS] <COMMAND> Commands: running List running sagas fault Inject an error into a saga's currently running node This addresses part of the minimum amount required during a release deployment: 1. after quiescing (#6804), omdb can query if there are any running sagas. 2. if those running sagas are stuck in a loop and cannot be drained (#7623), and the release contains a change to the DAG that causes Nexus to panic after an upgrade (#7730), then omdb can inject a fault into the database that would cause that saga to unwind when the affected Nexus is restarted Note for 2, unwinding a saga that is stuck in this way may not be valid if there were significant changes between releases.

gjcolombo · 2025-03-05T21:45:36Z

@jmpesp and I discussed this PR offline for a spell. One noteworthy thing about the fault operation is that it introduces new ways for a node's side effects to be "leaked" in a way that we don't currently test for. Even a very simple (and completely infallible!) node like this one...

fn do_action(ctx: &SagaContext) -> Result<(), ActionError> {
    take_action().await;
    Ok(())
}

...can behave oddly if the user happens to fault the saga at a point where take_action has returned but Steno has not yet recorded that the node succeeded: when the saga is unwound, the unwinder will execute the compensating actions for every predecessor of do_action, but won't undo do_action itself. In fact, do_action may not even have a defined compensating action (consider that it may be the last node of the saga, such that there is never anything after it that could fail and cause the saga to unwind).

While we do have some saga unwind tests, they all verify what happens when a failure is injected before a new node begins to execute, so they're unlikely to help us uncover errors like this.

I think saga abandonment (see #7730) presents similar problems; the main difference is that an "abandoned" saga won't unwind at all, which might leave the system in a more comprehensible state than undoing all of its actions except for a partially-completed final action.

A safer, but less useful, alternative might be a Steno hook that allows Nexus to inject failure into all the nodes in a running saga. This would be similar to what Steno provides for testing, but instead of saying "fail before you begin to execute this specific action," Nexus would say, "fail before you start any subsequent action." This would give running sagas a chance to reach a node boundary before unwinding, which is the case we regularly test. The catch is that this only works for sagas that are actually making forward progress and yielding control to Steno occasionally; if a saga node is stuck in a retry loop then this isn't useful.

None of this is to say that fault injection or abandonment aren't potentially useful support tools--they just need to come with very big warning labels. (Mostly I just wanted to note all of this down before completely forgetting all of it.)

davepacheco · 2025-03-06T00:16:48Z

I'm working on an RFD to discuss some of these options. I hope to have that up for discussion by the end of the week.

davepacheco · 2025-03-07T05:30:07Z

It's still a bit rough but I've moved RFD 555 ("Addressing operational challenges with sagas") into discussion.

As it relates here, I think:

Both inject-error and abandon are unsafe while the corresponding Nexus is running. I think omdb should attempt to detect this case.
We could also have Nexus support a runtime configurable to inject errors into running sagas, but I propose doing that through cooperation with actions/undo actions (i.e., check the flag before you call sleep). The only alternative I can imagine here is using async cancellation to stop a saga action while it's running and then inject the error. That feels untenable for the reasons outlined in RFD 397.

Thanks for putting up this PR and raising these issues. I haven't had a chance to look at this yet but it looks like exactly two of the things we need!

davepacheco · 2025-03-07T05:31:43Z

dev-tools/omdb/src/bin/omdb/db/saga.rs

+    Running,
+
+    /// Inject an error into a saga's currently running node
+    Fault(SagaFaultArgs),


I'd suggest calling this inject-error, only because fault feels like it could mean a bunch of different specific things. I think inject-error is less ambiguous?

done in b1108a3

davepacheco · 2025-03-07T05:32:26Z

dev-tools/omdb/src/bin/omdb/db/saga.rs

+    #[derive(Tabled)]
+    struct SagaRow {
+        id: Uuid,
+        creator_id: Uuid,


I'd suggest putting the current sec here instead of the creator. That's almost always more relevant I think.

done in b1108a3

davepacheco · 2025-03-07T05:33:33Z

dev-tools/omdb/src/bin/omdb/db/saga.rs

+    // Find the most recent node for a given saga
+    let most_recent_node: SagaNodeEvent = {
+        use db::schema::saga_node_event::dsl;
+
+        dsl::saga_node_event
+            .filter(dsl::saga_id.eq(args.saga_id))
+            .order(dsl::event_time.desc())
+            .limit(1)
+            .first_async(&*conn)
+            .await?
+    };


I think you want to enumerate all nodes for which you have an ActionStarted but no ActionDone or ActionFailed.

I say this because:

There might be more than one of them.

They might not be the most recent ones.

done in b1108a3

davepacheco · 2025-03-07T05:34:07Z

dev-tools/omdb/src/bin/omdb/db/saga.rs

+    // Inject a fault for that node, which will cause the saga to unwind
+    let action_error = steno::ActionError::action_failed(String::from(
+        "error injected with omdb",
+    ));


Have you tested that Nexus doesn't choke on this? Like does it have expectations about what the contents of an error from the saga should look like?

Yeah - my testing procedure was to

launch a debug saga

stop Nexus

inject an error

restart Nexus

ensure that the saga unwound correctly

saga id | event time | sub saga | node id | event type | data ------------------------------------ | ------------------------ | -------- | ------------------- | ------------- | --- 02589f1f-b612-46b0-a460-d77442b57345 | 2025-03-11T17:38:51.337Z | | 1: start | started | 02589f1f-b612-46b0-a460-d77442b57345 | 2025-03-11T17:38:51.342Z | | 1: start | succeeded | 02589f1f-b612-46b0-a460-d77442b57345 | 2025-03-11T17:38:51.346Z | | 0: demo.demo_wait | started | 02589f1f-b612-46b0-a460-d77442b57345 | 2025-03-11T17:39:39.077Z | | 0: demo.demo_wait | failed | "demo_wait" => {"ActionFailed":{"source_error":"error injected with omdb"}} 02589f1f-b612-46b0-a460-d77442b57345 | 2025-03-11T17:39:53.753Z | | 1: start | undo_started | 02589f1f-b612-46b0-a460-d77442b57345 | 2025-03-11T17:39:53.767Z | | 1: start | undo_finished |

davepacheco · 2025-03-07T05:35:10Z

dev-tools/omdb/src/bin/omdb/db/saga.rs

+    _destruction_token: DestructiveOperationToken,
+) -> Result<(), anyhow::Error> {
+    let conn = datastore.pool_connection_for_tests().await?;
+


I know it's a bunch more work but I think it's worth adding a safety check here (overrideable, I guess) that tries to contact the Nexus that we think is running this saga and fails if that succeeds.

done in b1108a3

davepacheco · 2025-03-07T05:35:42Z

dev-tools/omdb/src/bin/omdb/db/saga.rs

+        event_time: chrono::Utc::now(),
+        creator: most_recent_node.creator,
+    };
+


Can we print out exactly what it's doing? Something like:

injecting fault into running node ${node_id}

done in b1108a3

davepacheco · 2025-03-07T05:36:21Z

dev-tools/omdb/src/bin/omdb/db/saga.rs

+            .to_string(),
+        data: Some(serde_json::to_value(action_error)?),
+        event_time: chrono::Utc::now(),
+        creator: most_recent_node.creator,


I wonder if we should make a specific well-known uuid for omdb and use that here.

done in b1108a3 (let me know if you think the value is appropriate :))

- change fault to inject-error - show the current sec for a saga instead of the creator - inject an error for all started (but not completed) nodes of a saga: remember, it's a dag! - add a /v1/ping endpoint to the internal api, and ping to see if the current sec is up - it's not normally safe to inject an error while the saga is running - add a bypass for this check - clearly state what errors we're injecting - inject errors using a specific uuid for omdb

gjcolombo · 2025-03-13T00:40:04Z

dev-tools/omdb/src/bin/omdb/db/saga.rs

+    // For each incomplete node, find the current SEC, and ping it to ensure
+    // that the Nexus is down.
+    if !args.bypass_sec_check {
+        for node in &incomplete_nodes {


Does this check need to be per-node? The set of incomplete nodes all came from a single saga (the one in which the error is being injected), and the SEC is being read from that saga, so it seems like this could be done just once. (In fact you might even be able to do it before doing any node lookups.)

nope, and I agree! :) ab81699

…nce!

davepacheco

Thanks for adding the safety check! The changes so far look good and most of my feedback here is on clearer messaging for the user.

Based on this discussion in RFD 555, I'm still pretty worried that having this tool without adequate safeties could too easily result in silent breakage on customer systems. I'm trying to think how to mitigate that.

I'd add a confirmation prompt saying something like:

WARNING: Injecting an error into a saga will cause most of it to be unwound, but if the actions into which errors are injected have taken effect, those effects will not be undone. This can result in corruption of control plane state, even if the Nexus assigned to this saga is not currently running. You should only do this if:

you've stopped Nexus and then verified that the currently-running nodes either have no side effects, have not made any changes to the system, or you've already undone them by hand

this is a development system whose state can be wiped

Also, I know I'm the one that suggested inject-error, but that's when I thought it was totally safe. Now I wonder if the name could communicate some danger? attempt-unwind? force-unsafe-unwind?

dev-tools/omdb/src/bin/omdb/main.rs

internal-dns/resolver/src/resolver.rs

davepacheco · 2025-03-13T16:53:46Z

dev-tools/omdb/src/bin/omdb/db/saga.rs

-    let most_recent_node: SagaNodeEvent = {
+    // Before doing anything: find the current SEC for the saga, and ping it to
+    // ensure that the Nexus is down.
+    if !args.bypass_sec_check {


How about printing a scary warning here if this is set?

By use request, skipping check of whether the Nexus assigned to this saga is running. If this Nexus is running, the control plane state managed by this saga may become corrupted!

And if it is set:

Attempting to verify that the Nexus assigned to this saga is not running before proceeding.

(or something like that)

see also 6cde273

davepacheco · 2025-03-13T16:57:28Z

dev-tools/omdb/src/bin/omdb/db/saga.rs

+                let Some(addr) = resolver.ipv6_lookup(&target).await? else {
+                    bail!("dns lookup for {target} found nothing");
+                };
+
+                let client = nexus_client::Client::new(
+                    &format!("http://[{addr}]:{port}/"),
+                    opctx.log.clone(),
+                );
+
+                match client.ping().await {
+                    Ok(_) => {
+                        bail!("{current_sec} answered a ping");
+                    }
+
+                    Err(e) => match e {
+                        nexus_client::Error::InvalidRequest(_)
+                        | nexus_client::Error::InvalidUpgrade(_)
+                        | nexus_client::Error::ErrorResponse(_)
+                        | nexus_client::Error::ResponseBodyError(_)
+                        | nexus_client::Error::InvalidResponsePayload(_, _)
+                        | nexus_client::Error::UnexpectedResponse(_)
+                        | nexus_client::Error::PreHookError(_)
+                        | nexus_client::Error::PostHookError(_) => {
+                            bail!("{current_sec} failed a ping with {e}");
+                        }
+
+                        nexus_client::Error::CommunicationError(_) => {
+                            // Assume communication error means that it could
+                            // not be contacted.
+                            //
+                            // Note: this could be seen if Nexus is up but
+                            // unreachable from where omdb is run!
+                        }
+                    },
+                }


I think these errors need more context. I'm assuming they might be seen by support engineers on a bad day and we want to be really clear with what's going on.

failed to verify that the Nexus instance running this saga is not currently running: found no DNS record for that Nexus instance

The Nexus instance running this saga appears to be still running. Injecting errors into running sagas is not safe. Please ensure Nexus is stopped before proceeding.

put more descriptive errors in 6cde273, let me know what you think!

dev-tools/omdb/src/bin/omdb/db/saga.rs

jmpesp · 2025-03-26T14:33:31Z

I'd add a confirmation prompt saying something like:

WARNING: Injecting an error into a saga will cause most of it to be unwound, but if the actions into which errors are injected have taken effect, those effects will not be undone. This can result in corruption of control plane state, even if the Nexus assigned to this saga is not currently running. You should only do this if:

you've stopped Nexus and then verified that the currently-running nodes either have no side effects, have not made any changes to the system, or you've already undone them by hand

this is a development system whose state can be wiped

Added in 332a523

Also, I know I'm the one that suggested inject-error, but that's when I thought it was totally safe. Now I wonder if the name could communicate some danger? attempt-unwind? force-unsafe-unwind?

I'm still liking inject-error, especially now that there are plenty of warnings and confirmation prompts.

davepacheco

Also, I know I'm the one that suggested inject-error, but that's when I thought it was totally safe. Now I wonder if the name could communicate some danger? attempt-unwind? force-unsafe-unwind?

I'm still liking inject-error, especially now that there are plenty of warnings and confirmation prompts.

Your call. I still feel like it sounds more innocuous than it is. But I agree the warnings make it harder to go through with it and not realize what it is.

davepacheco · 2025-03-26T20:37:28Z

dev-tools/omdb/src/bin/omdb/db/saga.rs

+                let text = "warning: saga has no assigned SEC, so cannot \
+                verify that the saga is not still running!";


I like these messages. But it's too late at this point, right? Shouldn't we print the confirmation prompt after we report any extra risks?

oh, definitely. added a prompt in 8c3c39b

Sorry, I meant that to apply to all of the warnings here. There are I think three other places where we report that we couldn't verify the safety but we proceed anyway.

I wonder if the simplest thing is to just call read_and_validate() once, before making any changes, right before we make the changes (so like L321 in the current PR). That can cover the general warning up front plus any failures to verify safety.

There are I think three other places where we report that we couldn't verify the safety but we proceed anyway.

Most spots do a bail!. Right now there's a y/n prompt:

at the beginning after the large warning

if the SEC is None

after displaying the ping's CommunicationError

We can elide the first one, and the second and third ones amount to the two ways to exit the match on L181, so yeah - d459190 has one confirmation prompt :)

Ah, sorry I misread those paths.

davepacheco · 2025-03-26T20:39:34Z

dev-tools/omdb/src/bin/omdb/main.rs

@@ -58,7 +61,38 @@ mod oxql;
 mod reconfigurator;
 mod sled_agent;

-const OMDB_UUID: Uuid = Uuid::from_u128(0xAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAu128);
+struct ConfirmationPrompt(Reedline);


nit: I'd probably put this in helpers.rs.

done in 8c3c39b

jmpesp requested a review from davepacheco March 5, 2025 17:34

jmpesp mentioned this pull request Mar 5, 2025

add omdb commands for printing basic saga information #7695

Closed

davepacheco reviewed Mar 7, 2025

View reviewed changes

jmpesp added 2 commits March 11, 2025 15:18

update omdb expectorate output

f4c82d8

gjcolombo reviewed Mar 13, 2025

View reviewed changes

check for SEC being up before looking at saga nodes, and only do it o…

ab81699

…nce!

davepacheco reviewed Mar 13, 2025

View reviewed changes

gjcolombo mentioned this pull request Mar 13, 2025

omdb: add facility for abandoning a saga #7791

Merged

jmpesp added 6 commits March 24, 2025 21:32

scope the OMDB_SEC_UUID to only the saga.rs code, and document it

a6e1d38

disclaimer on ipv6_lookup

8d2de09

many more descriptive errors

6cde273

unwrap from started_nodes

2df69c2

Merge branch 'main' into omdb_sagas_part_1

29582a6

informed consent

332a523

davepacheco reviewed Mar 26, 2025

View reviewed changes

jmpesp added 2 commits March 27, 2025 16:39

move ConfirmationPrompt to helpers; add a prompt when no assigned SEC

8c3c39b

only one confirmation prompt

d459190

davepacheco approved these changes Mar 27, 2025

View reviewed changes

jmpesp enabled auto-merge (squash) March 27, 2025 20:30

jmpesp merged commit 33698ff into main Mar 27, 2025
18 checks passed

jmpesp deleted the omdb_sagas_part_1 branch March 27, 2025 21:50

+                      match saga.current_sec {
+                          None => {
+                              // If there's no current SEC, then we don't need to check if

		let text = "warning: saga has no assigned SEC, so cannot \
		verify that the saga is not still running!";

[1/N] omdb db sagas: list running and inject fault #7732

[1/N] omdb db sagas: list running and inject fault #7732

Uh oh!

Conversation

jmpesp commented Mar 5, 2025

Uh oh!

gjcolombo commented Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davepacheco commented Mar 6, 2025

Uh oh!

davepacheco commented Mar 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davepacheco left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jmpesp commented Mar 26, 2025

Uh oh!

davepacheco left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gjcolombo commented Mar 5, 2025 •

edited

Loading