-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[1/N] omdb db sagas: list running and inject fault #7732
base: main
Are you sure you want to change the base?
Conversation
Breaking apart #4378 and copying the structure of #7695, add `omdb db saga` as a command and implement the following sub-commands: Usage: omdb db saga [OPTIONS] <COMMAND> Commands: running List running sagas fault Inject an error into a saga's currently running node This addresses part of the minimum amount required during a release deployment: 1. after quiescing (#6804), omdb can query if there are any running sagas. 2. if those running sagas are stuck in a loop and cannot be drained (#7623), and the release contains a change to the DAG that causes Nexus to panic after an upgrade (#7730), then omdb can inject a fault into the database that would cause that saga to unwind when the affected Nexus is restarted Note for 2, unwinding a saga that is stuck in this way may not be valid if there were significant changes between releases.
@jmpesp and I discussed this PR offline for a spell. One noteworthy thing about the fn do_action(ctx: &SagaContext) -> Result<(), ActionError> {
take_action().await;
Ok(())
} ...can behave oddly if the user happens to fault the saga at a point where While we do have some saga unwind tests, they all verify what happens when a failure is injected before a new node begins to execute, so they're unlikely to help us uncover errors like this. I think saga abandonment (see #7730) presents similar problems; the main difference is that an "abandoned" saga won't unwind at all, which might leave the system in a more comprehensible state than undoing all of its actions except for a partially-completed final action. A safer, but less useful, alternative might be a Steno hook that allows Nexus to inject failure into all the nodes in a running saga. This would be similar to what Steno provides for testing, but instead of saying "fail before you begin to execute this specific action," Nexus would say, "fail before you start any subsequent action." This would give running sagas a chance to reach a node boundary before unwinding, which is the case we regularly test. The catch is that this only works for sagas that are actually making forward progress and yielding control to Steno occasionally; if a saga node is stuck in a retry loop then this isn't useful. None of this is to say that fault injection or abandonment aren't potentially useful support tools--they just need to come with very big warning labels. (Mostly I just wanted to note all of this down before completely forgetting all of it.) |
I'm working on an RFD to discuss some of these options. I hope to have that up for discussion by the end of the week. |
It's still a bit rough but I've moved RFD 555 ("Addressing operational challenges with sagas") into discussion. As it relates here, I think:
Thanks for putting up this PR and raising these issues. I haven't had a chance to look at this yet but it looks like exactly two of the things we need! |
Running, | ||
|
||
/// Inject an error into a saga's currently running node | ||
Fault(SagaFaultArgs), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest calling this inject-error
, only because fault feels like it could mean a bunch of different specific things. I think inject-error
is less ambiguous?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in b1108a3
#[derive(Tabled)] | ||
struct SagaRow { | ||
id: Uuid, | ||
creator_id: Uuid, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest putting the current sec here instead of the creator. That's almost always more relevant I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in b1108a3
// Find the most recent node for a given saga | ||
let most_recent_node: SagaNodeEvent = { | ||
use db::schema::saga_node_event::dsl; | ||
|
||
dsl::saga_node_event | ||
.filter(dsl::saga_id.eq(args.saga_id)) | ||
.order(dsl::event_time.desc()) | ||
.limit(1) | ||
.first_async(&*conn) | ||
.await? | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you want to enumerate all nodes for which you have an ActionStarted but no ActionDone or ActionFailed.
I say this because:
- There might be more than one of them.
- They might not be the most recent ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in b1108a3
// Inject a fault for that node, which will cause the saga to unwind | ||
let action_error = steno::ActionError::action_failed(String::from( | ||
"error injected with omdb", | ||
)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tested that Nexus doesn't choke on this? Like does it have expectations about what the contents of an error from the saga should look like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah - my testing procedure was to
- launch a debug saga
- stop Nexus
- inject an error
- restart Nexus
- ensure that the saga unwound correctly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
saga id | event time | sub saga | node id | event type | data
------------------------------------ | ------------------------ | -------- | ------------------- | ------------- | ---
02589f1f-b612-46b0-a460-d77442b57345 | 2025-03-11T17:38:51.337Z | | 1: start | started |
02589f1f-b612-46b0-a460-d77442b57345 | 2025-03-11T17:38:51.342Z | | 1: start | succeeded |
02589f1f-b612-46b0-a460-d77442b57345 | 2025-03-11T17:38:51.346Z | | 0: demo.demo_wait | started |
02589f1f-b612-46b0-a460-d77442b57345 | 2025-03-11T17:39:39.077Z | | 0: demo.demo_wait | failed | "demo_wait" => {"ActionFailed":{"source_error":"error injected with omdb"}}
02589f1f-b612-46b0-a460-d77442b57345 | 2025-03-11T17:39:53.753Z | | 1: start | undo_started |
02589f1f-b612-46b0-a460-d77442b57345 | 2025-03-11T17:39:53.767Z | | 1: start | undo_finished |
_destruction_token: DestructiveOperationToken, | ||
) -> Result<(), anyhow::Error> { | ||
let conn = datastore.pool_connection_for_tests().await?; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's a bunch more work but I think it's worth adding a safety check here (overrideable, I guess) that tries to contact the Nexus that we think is running this saga and fails if that succeeds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in b1108a3
event_time: chrono::Utc::now(), | ||
creator: most_recent_node.creator, | ||
}; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we print out exactly what it's doing? Something like:
injecting fault into running node ${node_id}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in b1108a3
.to_string(), | ||
data: Some(serde_json::to_value(action_error)?), | ||
event_time: chrono::Utc::now(), | ||
creator: most_recent_node.creator, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should make a specific well-known uuid for omdb
and use that here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in b1108a3 (let me know if you think the value is appropriate :))
- change fault to inject-error - show the current sec for a saga instead of the creator - inject an error for all started (but not completed) nodes of a saga: remember, it's a dag! - add a /v1/ping endpoint to the internal api, and ping to see if the current sec is up - it's not normally safe to inject an error while the saga is running - add a bypass for this check - clearly state what errors we're injecting - inject errors using a specific uuid for omdb
// For each incomplete node, find the current SEC, and ping it to ensure | ||
// that the Nexus is down. | ||
if !args.bypass_sec_check { | ||
for node in &incomplete_nodes { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this check need to be per-node? The set of incomplete nodes all came from a single saga (the one in which the error is being injected), and the SEC is being read from that saga, so it seems like this could be done just once. (In fact you might even be able to do it before doing any node lookups.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope, and I agree! :) ab81699
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the safety check! The changes so far look good and most of my feedback here is on clearer messaging for the user.
Based on this discussion in RFD 555, I'm still pretty worried that having this tool without adequate safeties could too easily result in silent breakage on customer systems. I'm trying to think how to mitigate that.
I'd add a confirmation prompt saying something like:
WARNING: Injecting an error into a saga will cause most of it to be unwound, but if the actions into which errors are injected have taken effect, those effects will not be undone. This can result in corruption of control plane state, even if the Nexus assigned to this saga is not currently running. You should only do this if:
- you've stopped Nexus and then verified that the currently-running nodes either have no side effects, have not made any changes to the system, or you've already undone them by hand
- this is a development system whose state can be wiped
Also, I know I'm the one that suggested inject-error
, but that's when I thought it was totally safe. Now I wonder if the name could communicate some danger? attempt-unwind
? force-unsafe-unwind
?
@@ -57,6 +58,8 @@ mod oxql; | |||
mod reconfigurator; | |||
mod sled_agent; | |||
|
|||
const OMDB_UUID: Uuid = Uuid::from_u128(0xAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAu128); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I suggested omdb having a uuid I was only thinking of that specific purpose (a creator for saga node events) so I think this can live in db/saga.rs unless you think it will be more generally useful? Either way, let's document it.
@@ -385,6 +385,19 @@ impl Resolver { | |||
}) | |||
.flatten() | |||
} | |||
|
|||
pub async fn ipv6_lookup( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you document this and make sure to add a comment that it's for omdb and that it's generally not the right thing for deployed software? (I'm worried people will reach for this when they should be using one of the ServiceName
variants that uses the SRV records.)
let most_recent_node: SagaNodeEvent = { | ||
// Before doing anything: find the current SEC for the saga, and ping it to | ||
// ensure that the Nexus is down. | ||
if !args.bypass_sec_check { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about printing a scary warning here if this is set?
By use request, skipping check of whether the Nexus assigned to this saga is running. If this Nexus is running, the control plane state managed by this saga may become corrupted!
And if it is set:
Attempting to verify that the Nexus assigned to this saga is not running before proceeding.
(or something like that)
|
||
match saga.current_sec { | ||
None => { | ||
// If there's no current SEC, then we don't need to check if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Print a warning here? Something like:
warn: saga has no assigned SEC, so cannot verify that the saga is not still running
I believe this case is impossible. I think the schema allowed this for takeover, but I think today a saga always has this set.
let Some(addr) = resolver.ipv6_lookup(&target).await? else { | ||
bail!("dns lookup for {target} found nothing"); | ||
}; | ||
|
||
let client = nexus_client::Client::new( | ||
&format!("http://[{addr}]:{port}/"), | ||
opctx.log.clone(), | ||
); | ||
|
||
match client.ping().await { | ||
Ok(_) => { | ||
bail!("{current_sec} answered a ping"); | ||
} | ||
|
||
Err(e) => match e { | ||
nexus_client::Error::InvalidRequest(_) | ||
| nexus_client::Error::InvalidUpgrade(_) | ||
| nexus_client::Error::ErrorResponse(_) | ||
| nexus_client::Error::ResponseBodyError(_) | ||
| nexus_client::Error::InvalidResponsePayload(_, _) | ||
| nexus_client::Error::UnexpectedResponse(_) | ||
| nexus_client::Error::PreHookError(_) | ||
| nexus_client::Error::PostHookError(_) => { | ||
bail!("{current_sec} failed a ping with {e}"); | ||
} | ||
|
||
nexus_client::Error::CommunicationError(_) => { | ||
// Assume communication error means that it could | ||
// not be contacted. | ||
// | ||
// Note: this could be seen if Nexus is up but | ||
// unreachable from where omdb is run! | ||
} | ||
}, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these errors need more context. I'm assuming they might be seen by support engineers on a bad day and we want to be really clear with what's going on.
failed to verify that the Nexus instance running this saga is not currently running: found no DNS record for that Nexus instance
The Nexus instance running this saga appears to be still running. Injecting errors into running sagas is not safe. Please ensure Nexus is stopped before proceeding.
.iter() | ||
.find(|node| node.node_id.0 == node_id.into()) | ||
else { | ||
bail!("could not find node?"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a panic? When would this ever happen?
Breaking apart #4378 and copying the structure of #7695, add
omdb db saga
as a command and implement the following sub-commands:This addresses part of the minimum amount required during a release deployment:
after quiescing (Quiesce sagas when parking the rack for updates #6804), omdb can query if there are any running sagas.
if those running sagas are stuck in a loop and cannot be drained (Sagas with extended retry loops may be undrainable during upgrade windows #7623), and the release contains a change to the DAG that causes Nexus to panic after an upgrade (want a tool for saga abandonment #7730), then omdb can inject a fault into the database that would cause that saga to unwind when the affected Nexus is restarted
Note for 2, unwinding a saga that is stuck in this way may not be valid if there were significant changes between releases.