Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[1/N] omdb db sagas: list running and inject fault #7732

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

jmpesp
Copy link
Contributor

@jmpesp jmpesp commented Mar 5, 2025

Breaking apart #4378 and copying the structure of #7695, add omdb db saga as a command and implement the following sub-commands:

Usage: omdb db saga [OPTIONS] <COMMAND>

Commands:
  running  List running sagas
    fault  Inject an error into a saga's currently running node

This addresses part of the minimum amount required during a release deployment:

  1. after quiescing (Quiesce sagas when parking the rack for updates #6804), omdb can query if there are any running sagas.

  2. if those running sagas are stuck in a loop and cannot be drained (Sagas with extended retry loops may be undrainable during upgrade windows #7623), and the release contains a change to the DAG that causes Nexus to panic after an upgrade (want a tool for saga abandonment #7730), then omdb can inject a fault into the database that would cause that saga to unwind when the affected Nexus is restarted

Note for 2, unwinding a saga that is stuck in this way may not be valid if there were significant changes between releases.

Breaking apart #4378 and copying the structure of #7695, add `omdb db
saga` as a command and implement the following sub-commands:

    Usage: omdb db saga [OPTIONS] <COMMAND>

    Commands:
      running  List running sagas
        fault  Inject an error into a saga's currently running node

This addresses part of the minimum amount required during a release
deployment:

1. after quiescing (#6804), omdb can query if there are any running
   sagas.

2. if those running sagas are stuck in a loop and cannot be drained
   (#7623), and the release contains a change to the DAG that causes
   Nexus to panic after an upgrade (#7730), then omdb can inject a fault
   into the database that would cause that saga to unwind when the
   affected Nexus is restarted

Note for 2, unwinding a saga that is stuck in this way may not be valid
if there were significant changes between releases.
@gjcolombo
Copy link
Contributor

gjcolombo commented Mar 5, 2025

@jmpesp and I discussed this PR offline for a spell. One noteworthy thing about the fault operation is that it introduces new ways for a node's side effects to be "leaked" in a way that we don't currently test for. Even a very simple (and completely infallible!) node like this one...

fn do_action(ctx: &SagaContext) -> Result<(), ActionError> {
    take_action().await;
    Ok(())
}

...can behave oddly if the user happens to fault the saga at a point where take_action has returned but Steno has not yet recorded that the node succeeded: when the saga is unwound, the unwinder will execute the compensating actions for every predecessor of do_action, but won't undo do_action itself. In fact, do_action may not even have a defined compensating action (consider that it may be the last node of the saga, such that there is never anything after it that could fail and cause the saga to unwind).

While we do have some saga unwind tests, they all verify what happens when a failure is injected before a new node begins to execute, so they're unlikely to help us uncover errors like this.

I think saga abandonment (see #7730) presents similar problems; the main difference is that an "abandoned" saga won't unwind at all, which might leave the system in a more comprehensible state than undoing all of its actions except for a partially-completed final action.

A safer, but less useful, alternative might be a Steno hook that allows Nexus to inject failure into all the nodes in a running saga. This would be similar to what Steno provides for testing, but instead of saying "fail before you begin to execute this specific action," Nexus would say, "fail before you start any subsequent action." This would give running sagas a chance to reach a node boundary before unwinding, which is the case we regularly test. The catch is that this only works for sagas that are actually making forward progress and yielding control to Steno occasionally; if a saga node is stuck in a retry loop then this isn't useful.

None of this is to say that fault injection or abandonment aren't potentially useful support tools--they just need to come with very big warning labels. (Mostly I just wanted to note all of this down before completely forgetting all of it.)

@davepacheco
Copy link
Collaborator

I'm working on an RFD to discuss some of these options. I hope to have that up for discussion by the end of the week.

@davepacheco
Copy link
Collaborator

It's still a bit rough but I've moved RFD 555 ("Addressing operational challenges with sagas") into discussion.

As it relates here, I think:

  • Both inject-error and abandon are unsafe while the corresponding Nexus is running. I think omdb should attempt to detect this case.
  • We could also have Nexus support a runtime configurable to inject errors into running sagas, but I propose doing that through cooperation with actions/undo actions (i.e., check the flag before you call sleep). The only alternative I can imagine here is using async cancellation to stop a saga action while it's running and then inject the error. That feels untenable for the reasons outlined in RFD 397.

Thanks for putting up this PR and raising these issues. I haven't had a chance to look at this yet but it looks like exactly two of the things we need!

Running,

/// Inject an error into a saga's currently running node
Fault(SagaFaultArgs),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest calling this inject-error, only because fault feels like it could mean a bunch of different specific things. I think inject-error is less ambiguous?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in b1108a3

#[derive(Tabled)]
struct SagaRow {
id: Uuid,
creator_id: Uuid,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest putting the current sec here instead of the creator. That's almost always more relevant I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in b1108a3

Comment on lines 112 to 122
// Find the most recent node for a given saga
let most_recent_node: SagaNodeEvent = {
use db::schema::saga_node_event::dsl;

dsl::saga_node_event
.filter(dsl::saga_id.eq(args.saga_id))
.order(dsl::event_time.desc())
.limit(1)
.first_async(&*conn)
.await?
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you want to enumerate all nodes for which you have an ActionStarted but no ActionDone or ActionFailed.

I say this because:

  • There might be more than one of them.
  • They might not be the most recent ones.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in b1108a3

Comment on lines 124 to 127
// Inject a fault for that node, which will cause the saga to unwind
let action_error = steno::ActionError::action_failed(String::from(
"error injected with omdb",
));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested that Nexus doesn't choke on this? Like does it have expectations about what the contents of an error from the saga should look like?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - my testing procedure was to

  • launch a debug saga
  • stop Nexus
  • inject an error
  • restart Nexus
  • ensure that the saga unwound correctly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

                             saga id | event time               | sub saga | node id             | event type    | data
------------------------------------ | ------------------------ | -------- | ------------------- | ------------- | ---
02589f1f-b612-46b0-a460-d77442b57345 | 2025-03-11T17:38:51.337Z |          |   1: start          | started       | 
02589f1f-b612-46b0-a460-d77442b57345 | 2025-03-11T17:38:51.342Z |          |   1: start          | succeeded     | 
02589f1f-b612-46b0-a460-d77442b57345 | 2025-03-11T17:38:51.346Z |          |   0: demo.demo_wait | started       | 
02589f1f-b612-46b0-a460-d77442b57345 | 2025-03-11T17:39:39.077Z |          |   0: demo.demo_wait | failed        | "demo_wait" => {"ActionFailed":{"source_error":"error injected with omdb"}}
02589f1f-b612-46b0-a460-d77442b57345 | 2025-03-11T17:39:53.753Z |          |   1: start          | undo_started  | 
02589f1f-b612-46b0-a460-d77442b57345 | 2025-03-11T17:39:53.767Z |          |   1: start          | undo_finished | 

_destruction_token: DestructiveOperationToken,
) -> Result<(), anyhow::Error> {
let conn = datastore.pool_connection_for_tests().await?;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's a bunch more work but I think it's worth adding a safety check here (overrideable, I guess) that tries to contact the Nexus that we think is running this saga and fails if that succeeds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in b1108a3

event_time: chrono::Utc::now(),
creator: most_recent_node.creator,
};

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we print out exactly what it's doing? Something like:

injecting fault into running node ${node_id}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in b1108a3

.to_string(),
data: Some(serde_json::to_value(action_error)?),
event_time: chrono::Utc::now(),
creator: most_recent_node.creator,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should make a specific well-known uuid for omdb and use that here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in b1108a3 (let me know if you think the value is appropriate :))

jmpesp added 2 commits March 11, 2025 15:18
- change fault to inject-error
- show the current sec for a saga instead of the creator
- inject an error for all started (but not completed) nodes of a saga:
  remember, it's a dag!
- add a /v1/ping endpoint to the internal api, and ping to see if the
  current sec is up
   - it's not normally safe to inject an error while the saga is running
   - add a bypass for this check
- clearly state what errors we're injecting
- inject errors using a specific uuid for omdb
// For each incomplete node, find the current SEC, and ping it to ensure
// that the Nexus is down.
if !args.bypass_sec_check {
for node in &incomplete_nodes {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this check need to be per-node? The set of incomplete nodes all came from a single saga (the one in which the error is being injected), and the SEC is being read from that saga, so it seems like this could be done just once. (In fact you might even be able to do it before doing any node lookups.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, and I agree! :) ab81699

Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the safety check! The changes so far look good and most of my feedback here is on clearer messaging for the user.

Based on this discussion in RFD 555, I'm still pretty worried that having this tool without adequate safeties could too easily result in silent breakage on customer systems. I'm trying to think how to mitigate that.

I'd add a confirmation prompt saying something like:

WARNING: Injecting an error into a saga will cause most of it to be unwound, but if the actions into which errors are injected have taken effect, those effects will not be undone. This can result in corruption of control plane state, even if the Nexus assigned to this saga is not currently running. You should only do this if:

  • you've stopped Nexus and then verified that the currently-running nodes either have no side effects, have not made any changes to the system, or you've already undone them by hand
  • this is a development system whose state can be wiped

Also, I know I'm the one that suggested inject-error, but that's when I thought it was totally safe. Now I wonder if the name could communicate some danger? attempt-unwind? force-unsafe-unwind?

@@ -57,6 +58,8 @@ mod oxql;
mod reconfigurator;
mod sled_agent;

const OMDB_UUID: Uuid = Uuid::from_u128(0xAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAu128);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I suggested omdb having a uuid I was only thinking of that specific purpose (a creator for saga node events) so I think this can live in db/saga.rs unless you think it will be more generally useful? Either way, let's document it.

@@ -385,6 +385,19 @@ impl Resolver {
})
.flatten()
}

pub async fn ipv6_lookup(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you document this and make sure to add a comment that it's for omdb and that it's generally not the right thing for deployed software? (I'm worried people will reach for this when they should be using one of the ServiceName variants that uses the SRV records.)

let most_recent_node: SagaNodeEvent = {
// Before doing anything: find the current SEC for the saga, and ping it to
// ensure that the Nexus is down.
if !args.bypass_sec_check {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about printing a scary warning here if this is set?

By use request, skipping check of whether the Nexus assigned to this saga is running. If this Nexus is running, the control plane state managed by this saga may become corrupted!

And if it is set:

Attempting to verify that the Nexus assigned to this saga is not running before proceeding.

(or something like that)


match saga.current_sec {
None => {
// If there's no current SEC, then we don't need to check if
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Print a warning here? Something like:

warn: saga has no assigned SEC, so cannot verify that the saga is not still running

I believe this case is impossible. I think the schema allowed this for takeover, but I think today a saga always has this set.

Comment on lines +158 to +192
let Some(addr) = resolver.ipv6_lookup(&target).await? else {
bail!("dns lookup for {target} found nothing");
};

let client = nexus_client::Client::new(
&format!("http://[{addr}]:{port}/"),
opctx.log.clone(),
);

match client.ping().await {
Ok(_) => {
bail!("{current_sec} answered a ping");
}

Err(e) => match e {
nexus_client::Error::InvalidRequest(_)
| nexus_client::Error::InvalidUpgrade(_)
| nexus_client::Error::ErrorResponse(_)
| nexus_client::Error::ResponseBodyError(_)
| nexus_client::Error::InvalidResponsePayload(_, _)
| nexus_client::Error::UnexpectedResponse(_)
| nexus_client::Error::PreHookError(_)
| nexus_client::Error::PostHookError(_) => {
bail!("{current_sec} failed a ping with {e}");
}

nexus_client::Error::CommunicationError(_) => {
// Assume communication error means that it could
// not be contacted.
//
// Note: this could be seen if Nexus is up but
// unreachable from where omdb is run!
}
},
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these errors need more context. I'm assuming they might be seen by support engineers on a bad day and we want to be really clear with what's going on.

failed to verify that the Nexus instance running this saga is not currently running: found no DNS record for that Nexus instance

The Nexus instance running this saga appears to be still running. Injecting errors into running sagas is not safe. Please ensure Nexus is stopped before proceeding.

.iter()
.find(|node| node.node_id.0 == node_id.into())
else {
bail!("could not find node?");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a panic? When would this ever happen?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants