Skip to content

sm-executor: stop dropping completed job results#163

Merged
Zk2u merged 5 commits intomainfrom
fix/153
Apr 9, 2026
Merged

sm-executor: stop dropping completed job results#163
Zk2u merged 5 commits intomainfrom
fix/153

Conversation

@Zk2u
Copy link
Copy Markdown
Collaborator

@Zk2u Zk2u commented Apr 8, 2026

Closes #153.

Summary

This change closes the completion-loss gaps behind #153 while keeping the scheduler/SM boundary aligned with the current design assumptions.

First, sm-executor no longer drops completed results when applying them fails transiently. Completed jobs are now queued locally and retried with backoff for Storage and Commit failures instead of being logged and discarded.

Second, completion-side STF rejections are no longer treated as infrastructure failures. If a worker completion is delivered successfully but the state machine rejects it (for example UnexpectedInput from a stale completion), sm-executor now logs and drops that completion instead of shutting down the executor.

Third, the scheduler no longer silently loses a completed result if the completion channel is closed. At that point the underlying job has already run and may already have caused protocol-visible side effects, so replaying it is unsafe. Instead, pool workers and garbling workers now signal an internal fatal scheduler fault, and the scheduler shuts down fail-closed.

Why

The old behavior had two bad outcomes:

  • sm-executor could lose a completion after the work had already finished, purely because applying or committing state failed transiently.
  • the scheduler could lose a completion if delivery back to the SM path failed, and it would do so silently.

Those need different fixes:

  • apply-time failure inside sm-executor is safe to retry because we are retrying completion application, not re-running the job
  • completion-side STF rejection is an SM/protocol semantic problem, not a scheduler correctness failure, so it should be logged and dropped rather than crashing the executor
  • delivery failure after the job already executed is not safe to replay blindly, because many jobs can already have produced network/protocol side effects

What changed

  • sm-executor now keeps a pending-completion queue with exponential backoff and retries Storage / Commit failures
  • completion-side SmExecutorError::Stf now logs and drops the offending completion instead of stopping the executor
  • dropped/requeued completion logs now include the action id for debugging
  • JobCompletion and ActionCompletion are now Clone so completions can be retained across retries
  • the job scheduler now has an internal SchedulerFault path
  • pool workers and garbling workers report fatal completion-delivery failures instead of silently discarding completions
  • the scheduler shuts down fail-closed when that happens
  • formatted the repo TOML files that were blocking taplo format --check
  • added regression tests for:
    • retry classification and requeue behavior in sm-executor
    • full executor-loop retry after a transient completion-application failure
    • drop behavior for completion-side STF errors
    • pool-worker completion-channel closure
    • garbling-worker completion-channel closure

Validation

Passed:

  • just -f .justfile ci
  • PATH="/usr/local/libexec:$PATH" ./run_tests.sh -t tests/fn_mosaic_setup.py
  • cargo test -p mosaic-sm-executor
  • cargo clippy -p mosaic-sm-executor --tests -- -D warnings

The focused functional test still exposes the separate protocol bug tracked in #165, but it no longer fails by shutting down sm-executor on a stale completion.

Reviewer notes

This PR intentionally does not try to recover from scheduler-side completion-channel closure without shutdown. Once the job has already completed, replaying it can duplicate side effects. In the current architecture, fail loud / fail closed is the safe behavior there.

The completion policy is now intentionally split three ways:

  • retry transient apply failures (Storage, Commit)
  • log/drop SM semantic rejections (Stf)
  • fail closed on infrastructure/runtime breakage (SourceClosed, NetRecv, StfPanic, scheduler delivery failure, etc.)

sapinb added a commit that referenced this pull request Apr 9, 2026
This avoids adaptor chunks getting sent before garbler is ready to receive them in e2e tests.
Workaround until #163 is merged
Copy link
Copy Markdown
Collaborator

@sapinb sapinb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK

@sapinb
Copy link
Copy Markdown
Collaborator

sapinb commented Apr 9, 2026

@Zk2u Please rebase over latest main and resolve conflicts

@Zk2u Zk2u merged commit 94f6393 into main Apr 9, 2026
14 checks passed
@Zk2u Zk2u deleted the fix/153 branch April 9, 2026 13:15
Zk2u pushed a commit that referenced this pull request Apr 9, 2026
This avoids adaptor chunks getting sent before garbler is ready to receive them in e2e tests.
Workaround until #163 is merged
Zk2u added a commit that referenced this pull request Apr 9, 2026
* add more stf logging

* fix: ack protocol messages from peers in later steps

When receiving messages from peers, if the message is already seen,
or received in a later step after the expecting step has completed,
ignore the message but send an ack back.

fix: headers processed only once

refactor: remove unused validation checks

* add tests for ack and ignore behavior

* feat: add comparable phase for each step

* feat: duplicate or late commit ack should not error

* refactor: simplify check for later steps

* fix: concurrency test should check all pairs of nodes

* nit: correct logging filter for fn tests

* fix: init garbler deposit first

This avoids adaptor chunks getting sent before garbler is ready to receive them in e2e tests.
Workaround until #163 is merged

* fix: reject unchallenged response chunks

---------

Co-authored-by: azz <azz@alpenlabs.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sm-executor: stop dropping completed job results on transient apply/commit failures

2 participants