Skip to content

job-system: encode protocol/domain failure in action results while reserving retries for transient execution failures #116

@Zk2u

Description

@Zk2u

Summary

We need to distinguish two different failure classes in the job system:

  1. Retry
  • the job did not complete execution successfully
  • the failure is transient/infrastructural
  • the scheduler should keep retrying
  1. Domain/protocol failure result
  • the job did complete execution successfully
  • the outcome of that execution is that the protocol/state-machine-level check failed
  • this must be delivered back to the state machine as a normal completion payload

The current system already has the right shape for (1): HandlerOutcome::Retry.
The work needed is to ensure (2) is consistently represented in action result types and routed back through normal ActionCompletion handling.

This issue also subsumes the review work: all existing jobs need to be audited so we classify current error handling correctly, including cases where the right state-machine behavior is to abort/fail closed rather than retry indefinitely.

Correct model

Scheduler/executor concern: transient execution failure

This remains:

  • HandlerOutcome::Retry

Examples:

  • network temporarily unavailable
  • storage/session not yet available
  • peer not ready / expected stream not yet registered
  • backpressure / temporary runtime unavailability

These are not protocol outcomes. The job has not finished meaningfully, so the scheduler should retry.

State machine concern: completed execution with a failing protocol result

This should be represented in the action result payload returned via ActionCompletion.

Examples:

  • adaptor verification returned false
  • opened-share verification failed
  • garbling/evaluation validity check failed
  • commitment/consistency validation completed and produced a negative verdict

These are not scheduler failures. The job completed and produced a result that the STF must handle.

Abort / fail-closed requirement

As part of this work, every existing job path must be checked to determine whether the current behavior is correct when an error occurs.

For each current Retry or equivalent error path, classify it as one of:

  1. Transient execution failure
  • keep as Retry
  1. Completed domain/protocol failure
  • return a normal ActionCompletion with an action result that encodes the negative verdict
  1. Abort / fail-closed condition
  • the resulting completion/error semantics must allow the state machine to transition to its aborted/failure state if that is what the protocol requires
  • this must not silently degrade into infinite retry

The key point is that indefinite retry is only correct for genuinely transient execution problems.
If the meaning of the failure is “the protocol instance should abort”, then the job system must surface that in a way the STF can consume and act on.

Why not add a generic scheduler-level Failed variant first

A generic terminal Failed in HandlerOutcome / JobCompletion would conflate:

  • execution failure
  • domain failure

That is the wrong abstraction for the majority of cases we care about here.

For Mosaic, the primary requirement is:

  • transient execution problems => retry
  • completed verification/processing with a negative verdict => return that verdict in the action result
  • cases whose meaning is “abort this protocol instance” => surface that through the completion/result model so the STF can fail closed

Only if we later identify true terminal execution failures that are:

  • non-transient,
  • not safely retryable, and
  • not representable as a domain result,

should we consider adding a third scheduler-level terminal failure channel.

Scope of this issue

  1. Audit all existing job handlers and classify every Retry / equivalent error site as either:
    • genuinely transient execution failure,
    • completed domain/protocol failure that should be encoded in the action result, or
    • abort/fail-closed condition that must be surfaced to the STF
  2. Update action result enums and handler implementations where negative protocol outcomes are currently being treated incorrectly.
  3. Ensure abort-required conditions are surfaced in a way the state machines can consume to transition into aborted/failure states.
  4. Preserve the invariant that real executor/runtime/setup bugs are not silently disguised as protocol failures.
  5. Add tests proving:
    • transient failures are retried
    • domain/protocol failures are delivered to the SM as normal completions and are not retried forever
    • abort-relevant failures lead to STF-visible outcomes rather than silent infinite retry

Acceptance criteria

  • HandlerOutcome::Retry is reserved for transient execution failures.
  • Protocol/domain negative outcomes are represented in ActionResult types where appropriate.
  • Existing jobs have been audited and reclassified where necessary.
  • No job that has completed meaningful verification/processing is retried forever solely because the verdict was negative.
  • Fail-closed / abort-required conditions are surfaced in a form the STF can act on.
  • Tests cover retry, negative-result completion, and abort-relevant paths.

Metadata

Metadata

Labels

bugSomething isn't workingenhancementNew feature or requesthelp wantedExtra attention is needed

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions