sm-executor: stop dropping completed job results by Zk2u · Pull Request #163 · alpenlabs/mosaic

Zk2u · 2026-04-08T13:45:13Z

Closes #153.

Summary

This change closes the completion-loss gaps behind #153 while keeping the scheduler/SM boundary aligned with the current design assumptions.

First, sm-executor no longer drops completed results when applying them fails transiently. Completed jobs are now queued locally and retried with backoff for Storage and Commit failures instead of being logged and discarded.

Second, completion-side STF rejections are no longer treated as infrastructure failures. If a worker completion is delivered successfully but the state machine rejects it (for example UnexpectedInput from a stale completion), sm-executor now logs and drops that completion instead of shutting down the executor.

Third, the scheduler no longer silently loses a completed result if the completion channel is closed. At that point the underlying job has already run and may already have caused protocol-visible side effects, so replaying it is unsafe. Instead, pool workers and garbling workers now signal an internal fatal scheduler fault, and the scheduler shuts down fail-closed.

Why

The old behavior had two bad outcomes:

sm-executor could lose a completion after the work had already finished, purely because applying or committing state failed transiently.
the scheduler could lose a completion if delivery back to the SM path failed, and it would do so silently.

Those need different fixes:

apply-time failure inside sm-executor is safe to retry because we are retrying completion application, not re-running the job
completion-side STF rejection is an SM/protocol semantic problem, not a scheduler correctness failure, so it should be logged and dropped rather than crashing the executor
delivery failure after the job already executed is not safe to replay blindly, because many jobs can already have produced network/protocol side effects

What changed

sm-executor now keeps a pending-completion queue with exponential backoff and retries Storage / Commit failures
completion-side SmExecutorError::Stf now logs and drops the offending completion instead of stopping the executor
dropped/requeued completion logs now include the action id for debugging
JobCompletion and ActionCompletion are now Clone so completions can be retained across retries
the job scheduler now has an internal SchedulerFault path
pool workers and garbling workers report fatal completion-delivery failures instead of silently discarding completions
the scheduler shuts down fail-closed when that happens
formatted the repo TOML files that were blocking taplo format --check
added regression tests for:
- retry classification and requeue behavior in sm-executor
- full executor-loop retry after a transient completion-application failure
- drop behavior for completion-side STF errors
- pool-worker completion-channel closure
- garbling-worker completion-channel closure

Validation

Passed:

just -f .justfile ci
PATH="/usr/local/libexec:$PATH" ./run_tests.sh -t tests/fn_mosaic_setup.py
cargo test -p mosaic-sm-executor
cargo clippy -p mosaic-sm-executor --tests -- -D warnings

The focused functional test still exposes the separate protocol bug tracked in #165, but it no longer fails by shutting down sm-executor on a stale completion.

Reviewer notes

This PR intentionally does not try to recover from scheduler-side completion-channel closure without shutdown. Once the job has already completed, replaying it can duplicate side effects. In the current architecture, fail loud / fail closed is the safe behavior there.

The completion policy is now intentionally split three ways:

retry transient apply failures (Storage, Commit)
log/drop SM semantic rejections (Stf)
fail closed on infrastructure/runtime breakage (SourceClosed, NetRecv, StfPanic, scheduler delivery failure, etc.)

crates/sm-executor/src/lib.rs

This avoids adaptor chunks getting sent before garbler is ready to receive them in e2e tests. Workaround until #163 is merged

sapinb

ACK

sapinb · 2026-04-09T08:03:55Z

@Zk2u Please rebase over latest main and resolve conflicts

This avoids adaptor chunks getting sent before garbler is ready to receive them in e2e tests. Workaround until #163 is merged

* add more stf logging * fix: ack protocol messages from peers in later steps When receiving messages from peers, if the message is already seen, or received in a later step after the expecting step has completed, ignore the message but send an ack back. fix: headers processed only once refactor: remove unused validation checks * add tests for ack and ignore behavior * feat: add comparable phase for each step * feat: duplicate or late commit ack should not error * refactor: simplify check for later steps * fix: concurrency test should check all pairs of nodes * nit: correct logging filter for fn tests * fix: init garbler deposit first This avoids adaptor chunks getting sent before garbler is ready to receive them in e2e tests. Workaround until #163 is merged * fix: reject unchallenged response chunks --------- Co-authored-by: azz <azz@alpenlabs.io>

Zk2u force-pushed the fix/153 branch from cdb3670 to 86aa79f Compare April 8, 2026 13:56

sapinb reviewed Apr 8, 2026

View reviewed changes

crates/sm-executor/src/lib.rs Outdated Show resolved Hide resolved

sapinb reviewed Apr 8, 2026

View reviewed changes

crates/sm-executor/src/lib.rs Show resolved Hide resolved

sapinb added a commit that referenced this pull request Apr 9, 2026

fix: init garbler deposit first

ab38292

This avoids adaptor chunks getting sent before garbler is ready to receive them in e2e tests. Workaround until #163 is merged

sapinb approved these changes Apr 9, 2026

View reviewed changes

Zk2u force-pushed the fix/153 branch from 6565e4a to 165e9cb Compare April 9, 2026 12:35

Zk2u added 5 commits April 9, 2026 14:07

sm-executor: stop dropping completed job results

9f1ff18

style: fix nightly rustfmt import grouping

2f69ae2

sm-executor: drop STF completion errors

24e5cda

test: fix rebased sm-executor helper call

e6fe2d3

chore: ignore sharded functional test artifacts

9b04968

Zk2u force-pushed the fix/153 branch from 26d567b to 9b04968 Compare April 9, 2026 13:08

Zk2u merged commit 94f6393 into main Apr 9, 2026
14 checks passed

Zk2u deleted the fix/153 branch April 9, 2026 13:15

Zk2u pushed a commit that referenced this pull request Apr 9, 2026

fix: init garbler deposit first

c0520b0

This avoids adaptor chunks getting sent before garbler is ready to receive them in e2e tests. Workaround until #163 is merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sm-executor: stop dropping completed job results#163

sm-executor: stop dropping completed job results#163
Zk2u merged 5 commits intomainfrom
fix/153

Zk2u commented Apr 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

sapinb left a comment

Uh oh!

sapinb commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Zk2u commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

What changed

Validation

Reviewer notes

Uh oh!

Uh oh!

Uh oh!

sapinb left a comment

Choose a reason for hiding this comment

Uh oh!

sapinb commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Zk2u commented Apr 8, 2026 •

edited

Loading