Summary
We need to distinguish two different failure classes in the job system:
Retry
- the job did not complete execution successfully
- the failure is transient/infrastructural
- the scheduler should keep retrying
- Domain/protocol failure result
- the job did complete execution successfully
- the outcome of that execution is that the protocol/state-machine-level check failed
- this must be delivered back to the state machine as a normal completion payload
The current system already has the right shape for (1): HandlerOutcome::Retry.
The work needed is to ensure (2) is consistently represented in action result types and routed back through normal ActionCompletion handling.
This issue also subsumes the review work: all existing jobs need to be audited so we classify current error handling correctly, including cases where the right state-machine behavior is to abort/fail closed rather than retry indefinitely.
Correct model
Scheduler/executor concern: transient execution failure
This remains:
Examples:
- network temporarily unavailable
- storage/session not yet available
- peer not ready / expected stream not yet registered
- backpressure / temporary runtime unavailability
These are not protocol outcomes. The job has not finished meaningfully, so the scheduler should retry.
State machine concern: completed execution with a failing protocol result
This should be represented in the action result payload returned via ActionCompletion.
Examples:
- adaptor verification returned false
- opened-share verification failed
- garbling/evaluation validity check failed
- commitment/consistency validation completed and produced a negative verdict
These are not scheduler failures. The job completed and produced a result that the STF must handle.
Abort / fail-closed requirement
As part of this work, every existing job path must be checked to determine whether the current behavior is correct when an error occurs.
For each current Retry or equivalent error path, classify it as one of:
- Transient execution failure
- Completed domain/protocol failure
- return a normal
ActionCompletion with an action result that encodes the negative verdict
- Abort / fail-closed condition
- the resulting completion/error semantics must allow the state machine to transition to its aborted/failure state if that is what the protocol requires
- this must not silently degrade into infinite retry
The key point is that indefinite retry is only correct for genuinely transient execution problems.
If the meaning of the failure is “the protocol instance should abort”, then the job system must surface that in a way the STF can consume and act on.
Why not add a generic scheduler-level Failed variant first
A generic terminal Failed in HandlerOutcome / JobCompletion would conflate:
- execution failure
- domain failure
That is the wrong abstraction for the majority of cases we care about here.
For Mosaic, the primary requirement is:
- transient execution problems => retry
- completed verification/processing with a negative verdict => return that verdict in the action result
- cases whose meaning is “abort this protocol instance” => surface that through the completion/result model so the STF can fail closed
Only if we later identify true terminal execution failures that are:
- non-transient,
- not safely retryable, and
- not representable as a domain result,
should we consider adding a third scheduler-level terminal failure channel.
Scope of this issue
- Audit all existing job handlers and classify every
Retry / equivalent error site as either:
- genuinely transient execution failure,
- completed domain/protocol failure that should be encoded in the action result, or
- abort/fail-closed condition that must be surfaced to the STF
- Update action result enums and handler implementations where negative protocol outcomes are currently being treated incorrectly.
- Ensure abort-required conditions are surfaced in a way the state machines can consume to transition into aborted/failure states.
- Preserve the invariant that real executor/runtime/setup bugs are not silently disguised as protocol failures.
- Add tests proving:
- transient failures are retried
- domain/protocol failures are delivered to the SM as normal completions and are not retried forever
- abort-relevant failures lead to STF-visible outcomes rather than silent infinite retry
Acceptance criteria
HandlerOutcome::Retry is reserved for transient execution failures.
- Protocol/domain negative outcomes are represented in
ActionResult types where appropriate.
- Existing jobs have been audited and reclassified where necessary.
- No job that has completed meaningful verification/processing is retried forever solely because the verdict was negative.
- Fail-closed / abort-required conditions are surfaced in a form the STF can act on.
- Tests cover retry, negative-result completion, and abort-relevant paths.
Summary
We need to distinguish two different failure classes in the job system:
RetryThe current system already has the right shape for (1):
HandlerOutcome::Retry.The work needed is to ensure (2) is consistently represented in action result types and routed back through normal
ActionCompletionhandling.This issue also subsumes the review work: all existing jobs need to be audited so we classify current error handling correctly, including cases where the right state-machine behavior is to abort/fail closed rather than retry indefinitely.
Correct model
Scheduler/executor concern: transient execution failure
This remains:
HandlerOutcome::RetryExamples:
These are not protocol outcomes. The job has not finished meaningfully, so the scheduler should retry.
State machine concern: completed execution with a failing protocol result
This should be represented in the action result payload returned via
ActionCompletion.Examples:
These are not scheduler failures. The job completed and produced a result that the STF must handle.
Abort / fail-closed requirement
As part of this work, every existing job path must be checked to determine whether the current behavior is correct when an error occurs.
For each current
Retryor equivalent error path, classify it as one of:RetryActionCompletionwith an action result that encodes the negative verdictThe key point is that indefinite retry is only correct for genuinely transient execution problems.
If the meaning of the failure is “the protocol instance should abort”, then the job system must surface that in a way the STF can consume and act on.
Why not add a generic scheduler-level
Failedvariant firstA generic terminal
FailedinHandlerOutcome/JobCompletionwould conflate:That is the wrong abstraction for the majority of cases we care about here.
For Mosaic, the primary requirement is:
Only if we later identify true terminal execution failures that are:
should we consider adding a third scheduler-level terminal failure channel.
Scope of this issue
Retry/ equivalent error site as either:Acceptance criteria
HandlerOutcome::Retryis reserved for transient execution failures.ActionResulttypes where appropriate.