Skip to content

cac/protocol: garbler can accept Challenge before commit acks are drained, making late commit acks UnexpectedInput #165

@Zk2u

Description

@Zk2u

Summary

The garbler state machine can transition out of SendingCommit too early.

Today, RecvChallengeMsg is accepted while the garbler is still in Step::SendingCommit, not only after it has fully reached WaitingForChallenge. If any SendCommitMsgChunk(...) completions are still in flight when the inbound Challenge arrives, the garbler advances to SendingChallengeResponse and those late commit-chunk acknowledgments then become UnexpectedInput.

On current branches, that propagates as a fatal completion-handling error in sm-executor, which shuts down the executor and then cascades into scheduler shutdown.

This is a protocol/state-machine ordering bug.

Severity

Recommended priority: P0

This can occur during normal setup flow and can shut down the executor. It is not dependent on worker crash or storage failure.

Recommended size: M

Exact interaction sequence

  1. Garbler enters SendingCommit and emits:
  • SendCommitMsgHeader
  • SendCommitMsgChunk(0..N-1)
  1. Some commit sends have completed, but not all commit-chunk ack completions have been delivered back to the garbler yet.

  2. Evaluator receives enough of the commit data to proceed and sends Challenge.

  3. Garbler receives that inbound Challenge while still in SendingCommit.

  4. Current garbler STF accepts RecvChallengeMsg in both:

  • Step::SendingCommit
  • Step::WaitingForChallenge
  1. On receipt of Challenge, the garbler immediately transitions to SendingChallengeResponse and emits challenge-response send actions.

  2. A late completion then arrives for one of the old commit sends, e.g. SendCommitMsgChunk(101) -> CommitMsgChunkAcked.

  3. That completion is routed back into garbler STF as TrackedActionCompleted, but garbler STF only accepts CommitMsgChunkAcked while still in SendingCommit.

  4. Because the garbler has already transitioned to SendingChallengeResponse, the late commit ack becomes UnexpectedInput.

  5. sm-executor treats completion-side STF errors as fatal, so the executor shuts down.

Evidence

Relevant code:

  • RecvChallengeMsg currently accepted while still in SendingCommit:
    • crates/cac/protocol/src/garbler/stf.rs
  • commit ack completions only accepted in SendingCommit:
    • crates/cac/protocol/src/garbler/stf.rs
  • completion-side STF errors propagate through sm-executor and are currently fatal:
    • crates/sm-executor/src/lib.rs

Observed in functional test:

  • functional-tests/run_tests.sh -t tests/fn_mosaic_setup.py

Observed failure path:

  • garbler receives inbound Challenge
  • garbler transitions and emits challenge-response actions
  • late SendCommitMsgChunk(101) completion arrives
  • garbler STF returns UnexpectedInput
  • sm-executor shuts down

Correct model

A challenge must not be able to advance the garbler past SendingCommit while commit-send completions for the current commit phase are still valid and in flight.

The protocol must be robust to delayed completion delivery. Late commit acks must either:

  • still be valid to consume after the transition, or
  • the transition itself must not happen until commit completion is fully drained.

Desired solution

Pick one coherent model and make the STF match it:

  1. Strict phase boundary
  • Only accept RecvChallengeMsg in WaitingForChallenge
  • Do not allow the garbler to leave SendingCommit until all commit acks have been consumed
  1. Idempotent late commit acks
  • If RecvChallengeMsg is intentionally allowed during SendingCommit, then late CommitMsgChunkAcked / CommitMsgHeaderAcked completions must remain harmless after the transition
  • In other words, the completion side must tolerate delayed commit acks across the phase boundary

I suspect option 1 is the cleaner fix unless there is a strong protocol reason to overlap the phases.

Acceptance criteria

  • A garbler cannot transition out of SendingCommit in a way that makes valid in-flight commit acks become UnexpectedInput
  • functional-tests/run_tests.sh -t tests/fn_mosaic_setup.py no longer fails with executor shutdown from a late SendCommitMsgChunk(...) completion
  • Delayed commit-send completions do not wedge or kill the executor
  • The chosen phase-boundary semantics are documented clearly in the STF/tests

Related

This is related to the broader phase-boundary/liveness work tracked in #154, but this issue is a concrete root cause that is independently reproducible and should be fixed directly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions