job-system: encode protocol/domain failure in action results while reserving retries for transient execution failures

## Summary

We need to distinguish two different failure classes in the job system:

1. `Retry`
- the job did **not** complete execution successfully
- the failure is transient/infrastructural
- the scheduler should keep retrying

2. Domain/protocol failure result
- the job **did** complete execution successfully
- the outcome of that execution is that the protocol/state-machine-level check failed
- this must be delivered back to the state machine as a normal completion payload

The current system already has the right shape for (1): `HandlerOutcome::Retry`.
The work needed is to ensure (2) is consistently represented in action result types and routed back through normal `ActionCompletion` handling.

This issue also subsumes the review work: all existing jobs need to be audited so we classify current error handling correctly, including cases where the right state-machine behavior is to abort/fail closed rather than retry indefinitely.

## Correct model

### Scheduler/executor concern: transient execution failure

This remains:
- `HandlerOutcome::Retry`

Examples:
- network temporarily unavailable
- storage/session not yet available
- peer not ready / expected stream not yet registered
- backpressure / temporary runtime unavailability

These are not protocol outcomes. The job has not finished meaningfully, so the scheduler should retry.

### State machine concern: completed execution with a failing protocol result

This should be represented in the action result payload returned via `ActionCompletion`.

Examples:
- adaptor verification returned false
- opened-share verification failed
- garbling/evaluation validity check failed
- commitment/consistency validation completed and produced a negative verdict

These are not scheduler failures. The job completed and produced a result that the STF must handle.

## Abort / fail-closed requirement

As part of this work, every existing job path must be checked to determine whether the current behavior is correct when an error occurs.

For each current `Retry` or equivalent error path, classify it as one of:

1. **Transient execution failure**
- keep as `Retry`

2. **Completed domain/protocol failure**
- return a normal `ActionCompletion` with an action result that encodes the negative verdict

3. **Abort / fail-closed condition**
- the resulting completion/error semantics must allow the state machine to transition to its aborted/failure state if that is what the protocol requires
- this must not silently degrade into infinite retry

The key point is that indefinite retry is only correct for genuinely transient execution problems.
If the meaning of the failure is “the protocol instance should abort”, then the job system must surface that in a way the STF can consume and act on.

## Why not add a generic scheduler-level `Failed` variant first

A generic terminal `Failed` in `HandlerOutcome` / `JobCompletion` would conflate:
- execution failure
- domain failure

That is the wrong abstraction for the majority of cases we care about here.

For Mosaic, the primary requirement is:
- transient execution problems => retry
- completed verification/processing with a negative verdict => return that verdict in the action result
- cases whose meaning is “abort this protocol instance” => surface that through the completion/result model so the STF can fail closed

Only if we later identify true terminal execution failures that are:
- non-transient,
- not safely retryable, and
- not representable as a domain result,

should we consider adding a third scheduler-level terminal failure channel.

## Scope of this issue

1. Audit all existing job handlers and classify every `Retry` / equivalent error site as either:
   - genuinely transient execution failure,
   - completed domain/protocol failure that should be encoded in the action result, or
   - abort/fail-closed condition that must be surfaced to the STF
2. Update action result enums and handler implementations where negative protocol outcomes are currently being treated incorrectly.
3. Ensure abort-required conditions are surfaced in a way the state machines can consume to transition into aborted/failure states.
4. Preserve the invariant that real executor/runtime/setup bugs are **not** silently disguised as protocol failures.
5. Add tests proving:
   - transient failures are retried
   - domain/protocol failures are delivered to the SM as normal completions and are **not** retried forever
   - abort-relevant failures lead to STF-visible outcomes rather than silent infinite retry

## Acceptance criteria

- `HandlerOutcome::Retry` is reserved for transient execution failures.
- Protocol/domain negative outcomes are represented in `ActionResult` types where appropriate.
- Existing jobs have been audited and reclassified where necessary.
- No job that has completed meaningful verification/processing is retried forever solely because the verdict was negative.
- Fail-closed / abort-required conditions are surfaced in a form the STF can act on.
- Tests cover retry, negative-result completion, and abort-relevant paths.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

job-system: encode protocol/domain failure in action results while reserving retries for transient execution failures #116

Summary

Correct model

Scheduler/executor concern: transient execution failure

State machine concern: completed execution with a failing protocol result

Abort / fail-closed requirement

Why not add a generic scheduler-level `Failed` variant first

Scope of this issue

Acceptance criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

job-system: encode protocol/domain failure in action results while reserving retries for transient execution failures #116

Description

Summary

Correct model

Scheduler/executor concern: transient execution failure

State machine concern: completed execution with a failing protocol result

Abort / fail-closed requirement

Why not add a generic scheduler-level Failed variant first

Scope of this issue

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Why not add a generic scheduler-level `Failed` variant first