Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

omdb: add facility for abandoning a saga #7791

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

gjcolombo
Copy link
Contributor

(N.B. This reuses a bunch of omdb code added by #7732 that I've merged into this branch. If we add more warnings or prompts to the inject-error command before that PR merges, I'll probably want to reuse them here, so this PR is just a draft until 7732 is in main and I can clean up the history here. I don't expect the schema or Nexus bits to change that much more, though, unless review and testing reveal I've messed something up.)

Add an Abandoned saga state. This state disqualifies a saga from being picked up by Nexus saga recovery. (A running saga will continue running if it is Abandoned, and continued saga execution may end up clobbering the Abandoned state entirely.) Add an omdb subcommand to move a saga to this state (and refactor a bit to avoid duplicating code with the inject-error subcommand).

Tested (so far) by:

  • amending the datastore test that lists candidates for saga recovery
  • starting a demo saga in a dev cluster and verifying (via Nexus logs) that the saga is normally recovered when its Nexus is restarted (via svcadm restart), but is no longer recovered once abandoned

Fixes #7730.

jmpesp and others added 9 commits March 5, 2025 17:29
Breaking apart #4378 and copying the structure of #7695, add `omdb db
saga` as a command and implement the following sub-commands:

    Usage: omdb db saga [OPTIONS] <COMMAND>

    Commands:
      running  List running sagas
        fault  Inject an error into a saga's currently running node

This addresses part of the minimum amount required during a release
deployment:

1. after quiescing (#6804), omdb can query if there are any running
   sagas.

2. if those running sagas are stuck in a loop and cannot be drained
   (#7623), and the release contains a change to the DAG that causes
   Nexus to panic after an upgrade (#7730), then omdb can inject a fault
   into the database that would cause that saga to unwind when the
   affected Nexus is restarted

Note for 2, unwinding a saga that is stuck in this way may not be valid
if there were significant changes between releases.
- change fault to inject-error
- show the current sec for a saga instead of the creator
- inject an error for all started (but not completed) nodes of a saga:
  remember, it's a dag!
- add a /v1/ping endpoint to the internal api, and ping to see if the
  current sec is up
   - it's not normally safe to inject an error while the saga is running
   - add a bypass for this check
- clearly state what errors we're injecting
- inject errors using a specific uuid for omdb
Define an "abandoned" saga state. An abandoned saga will not begin to be
executed by any SEC. Technicians mark sagas as abandoned using omdb;
this requires the saga's current executor not to be running (otherwise
it could receive a state update from Steno that will clobber the
Abandoned state).

This commit defines the new state in the database schema and fixes up
the DB crates accordingly, but adds no affordances for applying the new
saga state or considering it when deciding what sagas to recover.
This picks up the `omdb db sagas` subcommand and some other useful omdb
bits (like common logic for determining whether a saga's SEC is likely
to be running).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

want a tool for saga abandonment
2 participants