Summary
The garbler state machine can transition out of SendingCommit too early.
Today, RecvChallengeMsg is accepted while the garbler is still in Step::SendingCommit, not only after it has fully reached WaitingForChallenge. If any SendCommitMsgChunk(...) completions are still in flight when the inbound Challenge arrives, the garbler advances to SendingChallengeResponse and those late commit-chunk acknowledgments then become UnexpectedInput.
On current branches, that propagates as a fatal completion-handling error in sm-executor, which shuts down the executor and then cascades into scheduler shutdown.
This is a protocol/state-machine ordering bug.
Severity
Recommended priority: P0
This can occur during normal setup flow and can shut down the executor. It is not dependent on worker crash or storage failure.
Recommended size: M
Exact interaction sequence
- Garbler enters
SendingCommit and emits:
SendCommitMsgHeader
SendCommitMsgChunk(0..N-1)
-
Some commit sends have completed, but not all commit-chunk ack completions have been delivered back to the garbler yet.
-
Evaluator receives enough of the commit data to proceed and sends Challenge.
-
Garbler receives that inbound Challenge while still in SendingCommit.
-
Current garbler STF accepts RecvChallengeMsg in both:
Step::SendingCommit
Step::WaitingForChallenge
-
On receipt of Challenge, the garbler immediately transitions to SendingChallengeResponse and emits challenge-response send actions.
-
A late completion then arrives for one of the old commit sends, e.g. SendCommitMsgChunk(101) -> CommitMsgChunkAcked.
-
That completion is routed back into garbler STF as TrackedActionCompleted, but garbler STF only accepts CommitMsgChunkAcked while still in SendingCommit.
-
Because the garbler has already transitioned to SendingChallengeResponse, the late commit ack becomes UnexpectedInput.
-
sm-executor treats completion-side STF errors as fatal, so the executor shuts down.
Evidence
Relevant code:
RecvChallengeMsg currently accepted while still in SendingCommit:
crates/cac/protocol/src/garbler/stf.rs
- commit ack completions only accepted in
SendingCommit:
crates/cac/protocol/src/garbler/stf.rs
- completion-side STF errors propagate through
sm-executor and are currently fatal:
crates/sm-executor/src/lib.rs
Observed in functional test:
functional-tests/run_tests.sh -t tests/fn_mosaic_setup.py
Observed failure path:
- garbler receives inbound
Challenge
- garbler transitions and emits challenge-response actions
- late
SendCommitMsgChunk(101) completion arrives
- garbler STF returns
UnexpectedInput
sm-executor shuts down
Correct model
A challenge must not be able to advance the garbler past SendingCommit while commit-send completions for the current commit phase are still valid and in flight.
The protocol must be robust to delayed completion delivery. Late commit acks must either:
- still be valid to consume after the transition, or
- the transition itself must not happen until commit completion is fully drained.
Desired solution
Pick one coherent model and make the STF match it:
- Strict phase boundary
- Only accept
RecvChallengeMsg in WaitingForChallenge
- Do not allow the garbler to leave
SendingCommit until all commit acks have been consumed
- Idempotent late commit acks
- If
RecvChallengeMsg is intentionally allowed during SendingCommit, then late CommitMsgChunkAcked / CommitMsgHeaderAcked completions must remain harmless after the transition
- In other words, the completion side must tolerate delayed commit acks across the phase boundary
I suspect option 1 is the cleaner fix unless there is a strong protocol reason to overlap the phases.
Acceptance criteria
- A garbler cannot transition out of
SendingCommit in a way that makes valid in-flight commit acks become UnexpectedInput
functional-tests/run_tests.sh -t tests/fn_mosaic_setup.py no longer fails with executor shutdown from a late SendCommitMsgChunk(...) completion
- Delayed commit-send completions do not wedge or kill the executor
- The chosen phase-boundary semantics are documented clearly in the STF/tests
Related
This is related to the broader phase-boundary/liveness work tracked in #154, but this issue is a concrete root cause that is independently reproducible and should be fixed directly.
Summary
The garbler state machine can transition out of
SendingCommittoo early.Today,
RecvChallengeMsgis accepted while the garbler is still inStep::SendingCommit, not only after it has fully reachedWaitingForChallenge. If anySendCommitMsgChunk(...)completions are still in flight when the inboundChallengearrives, the garbler advances toSendingChallengeResponseand those late commit-chunk acknowledgments then becomeUnexpectedInput.On current branches, that propagates as a fatal completion-handling error in
sm-executor, which shuts down the executor and then cascades into scheduler shutdown.This is a protocol/state-machine ordering bug.
Severity
Recommended priority:
P0This can occur during normal setup flow and can shut down the executor. It is not dependent on worker crash or storage failure.
Recommended size:
MExact interaction sequence
SendingCommitand emits:SendCommitMsgHeaderSendCommitMsgChunk(0..N-1)Some commit sends have completed, but not all commit-chunk ack completions have been delivered back to the garbler yet.
Evaluator receives enough of the commit data to proceed and sends
Challenge.Garbler receives that inbound
Challengewhile still inSendingCommit.Current garbler STF accepts
RecvChallengeMsgin both:Step::SendingCommitStep::WaitingForChallengeOn receipt of
Challenge, the garbler immediately transitions toSendingChallengeResponseand emits challenge-response send actions.A late completion then arrives for one of the old commit sends, e.g.
SendCommitMsgChunk(101) -> CommitMsgChunkAcked.That completion is routed back into garbler STF as
TrackedActionCompleted, but garbler STF only acceptsCommitMsgChunkAckedwhile still inSendingCommit.Because the garbler has already transitioned to
SendingChallengeResponse, the late commit ack becomesUnexpectedInput.sm-executortreats completion-side STF errors as fatal, so the executor shuts down.Evidence
Relevant code:
RecvChallengeMsgcurrently accepted while still inSendingCommit:crates/cac/protocol/src/garbler/stf.rsSendingCommit:crates/cac/protocol/src/garbler/stf.rssm-executorand are currently fatal:crates/sm-executor/src/lib.rsObserved in functional test:
functional-tests/run_tests.sh -t tests/fn_mosaic_setup.pyObserved failure path:
ChallengeSendCommitMsgChunk(101)completion arrivesUnexpectedInputsm-executorshuts downCorrect model
A challenge must not be able to advance the garbler past
SendingCommitwhile commit-send completions for the current commit phase are still valid and in flight.The protocol must be robust to delayed completion delivery. Late commit acks must either:
Desired solution
Pick one coherent model and make the STF match it:
RecvChallengeMsginWaitingForChallengeSendingCommituntil all commit acks have been consumedRecvChallengeMsgis intentionally allowed duringSendingCommit, then lateCommitMsgChunkAcked/CommitMsgHeaderAckedcompletions must remain harmless after the transitionI suspect option 1 is the cleaner fix unless there is a strong protocol reason to overlap the phases.
Acceptance criteria
SendingCommitin a way that makes valid in-flight commit acks becomeUnexpectedInputfunctional-tests/run_tests.sh -t tests/fn_mosaic_setup.pyno longer fails with executor shutdown from a lateSendCommitMsgChunk(...)completionRelated
This is related to the broader phase-boundary/liveness work tracked in #154, but this issue is a concrete root cause that is independently reproducible and should be fixed directly.