Summary
The garbling coordinator currently has explicit failure paths where active sessions are lost if a worker dies, fails to report, or exits mid-pass. In those cases, the original PendingCircuitJobs are not reconstructed and requeued, so in-flight work can be lost permanently.
This issue is about making the coordinator the durable owner of in-flight circuit jobs so worker loss becomes recoverable rather than terminal.
This issue is specifically about garbling coordinator ownership/recovery for circuit-session work. It does not cover the separate GeneratingPolynomialCommitments wedge in the regular worker-pool / sm-executor completion path; that is now tracked separately in #166.
Correct model
A worker is an execution vehicle, not the owner of job durability.
If a worker disappears, times out, or stops reporting, the coordinator must still know which jobs were assigned to that worker and must be able to requeue them.
The correct behavior is:
- Worker/report failure
- recover and requeue all affected in-flight jobs
- Transient session execution failure
- return the job to pending retry
- True permanent setup/programming failure
- classify explicitly and drop only if the error is genuinely permanent
The key point is that the coordinator must never move a job into a worker and then lose the ability to reconstruct it.
Desired solution
- Keep authoritative ownership of in-flight
PendingCircuitJobs at the coordinator level even after assignment to workers.
- Track which jobs are assigned to each worker for the current pass.
- On assignment failure, missing chunk report, missing finish report, or worker exit:
- recover all sessions known to be on that worker
- requeue them into
pending_retry
- Audit and narrow
SetupFailed classification so only truly permanent/setup-programming failures are dropped.
- Add tests that simulate worker loss or report timeout mid-pass and verify all affected jobs are retried.
The implementation can choose the exact tracking structure, but it must no longer be possible for worker disappearance to imply unrecoverable job loss.
Scope of this issue
- Audit worker assignment and pass bookkeeping in the garbling coordinator.
- Introduce recoverable tracking for assigned/in-flight circuit jobs.
- Rework worker failure/report-timeout handling to requeue affected jobs.
- Audit permanent-error classification for circuit session creation and finish paths.
- Add failure-injection tests for assignment failure, missing chunk report, missing finish report, and worker exit with active sessions.
Acceptance criteria
- Worker death or report timeout does not cause unrecoverable garbling job loss.
- All in-flight circuit jobs remain reconstructible by the coordinator.
- Existing “sessions are lost” paths are removed or replaced with retry/recovery behavior.
- Only genuinely permanent setup/programming failures are dropped.
- Tests cover the major worker/report failure modes and prove retry/recovery.
Summary
The garbling coordinator currently has explicit failure paths where active sessions are lost if a worker dies, fails to report, or exits mid-pass. In those cases, the original
PendingCircuitJobs are not reconstructed and requeued, so in-flight work can be lost permanently.This issue is about making the coordinator the durable owner of in-flight circuit jobs so worker loss becomes recoverable rather than terminal.
This issue is specifically about garbling coordinator ownership/recovery for circuit-session work. It does not cover the separate
GeneratingPolynomialCommitmentswedge in the regular worker-pool /sm-executorcompletion path; that is now tracked separately in #166.Correct model
A worker is an execution vehicle, not the owner of job durability.
If a worker disappears, times out, or stops reporting, the coordinator must still know which jobs were assigned to that worker and must be able to requeue them.
The correct behavior is:
The key point is that the coordinator must never move a job into a worker and then lose the ability to reconstruct it.
Desired solution
PendingCircuitJobs at the coordinator level even after assignment to workers.pending_retrySetupFailedclassification so only truly permanent/setup-programming failures are dropped.The implementation can choose the exact tracking structure, but it must no longer be possible for worker disappearance to imply unrecoverable job loss.
Scope of this issue
Acceptance criteria