job-scheduler: recover garbling sessions on worker/report failure instead of losing jobs

## Summary

The garbling coordinator currently has explicit failure paths where active sessions are lost if a worker dies, fails to report, or exits mid-pass. In those cases, the original `PendingCircuitJob`s are not reconstructed and requeued, so in-flight work can be lost permanently.

This issue is about making the coordinator the durable owner of in-flight circuit jobs so worker loss becomes recoverable rather than terminal.

This issue is specifically about garbling coordinator ownership/recovery for circuit-session work. It does **not** cover the separate `GeneratingPolynomialCommitments` wedge in the regular worker-pool / `sm-executor` completion path; that is now tracked separately in #166.

## Correct model

A worker is an execution vehicle, not the owner of job durability.

If a worker disappears, times out, or stops reporting, the coordinator must still know which jobs were assigned to that worker and must be able to requeue them.

The correct behavior is:

1. **Worker/report failure**
- recover and requeue all affected in-flight jobs

2. **Transient session execution failure**
- return the job to pending retry

3. **True permanent setup/programming failure**
- classify explicitly and drop only if the error is genuinely permanent

The key point is that the coordinator must never move a job into a worker and then lose the ability to reconstruct it.

## Desired solution

1. Keep authoritative ownership of in-flight `PendingCircuitJob`s at the coordinator level even after assignment to workers.
2. Track which jobs are assigned to each worker for the current pass.
3. On assignment failure, missing chunk report, missing finish report, or worker exit:
- recover all sessions known to be on that worker
- requeue them into `pending_retry`
4. Audit and narrow `SetupFailed` classification so only truly permanent/setup-programming failures are dropped.
5. Add tests that simulate worker loss or report timeout mid-pass and verify all affected jobs are retried.

The implementation can choose the exact tracking structure, but it must no longer be possible for worker disappearance to imply unrecoverable job loss.

## Scope of this issue

1. Audit worker assignment and pass bookkeeping in the garbling coordinator.
2. Introduce recoverable tracking for assigned/in-flight circuit jobs.
3. Rework worker failure/report-timeout handling to requeue affected jobs.
4. Audit permanent-error classification for circuit session creation and finish paths.
5. Add failure-injection tests for assignment failure, missing chunk report, missing finish report, and worker exit with active sessions.

## Acceptance criteria

- Worker death or report timeout does not cause unrecoverable garbling job loss.
- All in-flight circuit jobs remain reconstructible by the coordinator.
- Existing “sessions are lost” paths are removed or replaced with retry/recovery behavior.
- Only genuinely permanent setup/programming failures are dropped.
- Tests cover the major worker/report failure modes and prove retry/recovery.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

job-scheduler: recover garbling sessions on worker/report failure instead of losing jobs #156

Summary

Correct model

Desired solution

Scope of this issue

Acceptance criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

job-scheduler: recover garbling sessions on worker/report failure instead of losing jobs #156

Description

Summary

Correct model

Desired solution

Scope of this issue

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions