Conversation
sapinb
reviewed
Apr 8, 2026
sapinb
reviewed
Apr 8, 2026
sapinb
added a commit
that referenced
this pull request
Apr 9, 2026
This avoids adaptor chunks getting sent before garbler is ready to receive them in e2e tests. Workaround until #163 is merged
Collaborator
|
@Zk2u Please rebase over latest main and resolve conflicts |
Zk2u
pushed a commit
that referenced
this pull request
Apr 9, 2026
This avoids adaptor chunks getting sent before garbler is ready to receive them in e2e tests. Workaround until #163 is merged
Zk2u
added a commit
that referenced
this pull request
Apr 9, 2026
* add more stf logging * fix: ack protocol messages from peers in later steps When receiving messages from peers, if the message is already seen, or received in a later step after the expecting step has completed, ignore the message but send an ack back. fix: headers processed only once refactor: remove unused validation checks * add tests for ack and ignore behavior * feat: add comparable phase for each step * feat: duplicate or late commit ack should not error * refactor: simplify check for later steps * fix: concurrency test should check all pairs of nodes * nit: correct logging filter for fn tests * fix: init garbler deposit first This avoids adaptor chunks getting sent before garbler is ready to receive them in e2e tests. Workaround until #163 is merged * fix: reject unchallenged response chunks --------- Co-authored-by: azz <azz@alpenlabs.io>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #153.
Summary
This change closes the completion-loss gaps behind #153 while keeping the scheduler/SM boundary aligned with the current design assumptions.
First,
sm-executorno longer drops completed results when applying them fails transiently. Completed jobs are now queued locally and retried with backoff forStorageandCommitfailures instead of being logged and discarded.Second, completion-side STF rejections are no longer treated as infrastructure failures. If a worker completion is delivered successfully but the state machine rejects it (for example
UnexpectedInputfrom a stale completion),sm-executornow logs and drops that completion instead of shutting down the executor.Third, the scheduler no longer silently loses a completed result if the completion channel is closed. At that point the underlying job has already run and may already have caused protocol-visible side effects, so replaying it is unsafe. Instead, pool workers and garbling workers now signal an internal fatal scheduler fault, and the scheduler shuts down fail-closed.
Why
The old behavior had two bad outcomes:
sm-executorcould lose a completion after the work had already finished, purely because applying or committing state failed transiently.Those need different fixes:
sm-executoris safe to retry because we are retrying completion application, not re-running the jobWhat changed
sm-executornow keeps a pending-completion queue with exponential backoff and retriesStorage/CommitfailuresSmExecutorError::Stfnow logs and drops the offending completion instead of stopping the executorJobCompletionandActionCompletionare nowCloneso completions can be retained across retriesSchedulerFaultpathtaplo format --checksm-executorValidation
Passed:
just -f .justfile ciPATH="/usr/local/libexec:$PATH" ./run_tests.sh -t tests/fn_mosaic_setup.pycargo test -p mosaic-sm-executorcargo clippy -p mosaic-sm-executor --tests -- -D warningsThe focused functional test still exposes the separate protocol bug tracked in #165, but it no longer fails by shutting down
sm-executoron a stale completion.Reviewer notes
This PR intentionally does not try to recover from scheduler-side completion-channel closure without shutdown. Once the job has already completed, replaying it can duplicate side effects. In the current architecture, fail loud / fail closed is the safe behavior there.
The completion policy is now intentionally split three ways:
Storage,Commit)Stf)SourceClosed,NetRecv,StfPanic, scheduler delivery failure, etc.)