Skip to content

sm-executor: stop dropping completed job results on transient apply/commit failures #153

@Zk2u

Description

@Zk2u

Summary

A job can complete successfully and still never update the state machine.

Today, if sm-executor receives a completion and then hits a non-fatal error while applying it, it logs and drops the completion permanently. The same general problem exists if a successful completion cannot be delivered back through the completion channel. This means meaningful completed work can be lost after execution, which violates the core durability/correctness invariant of the job system.

This is also the most likely explanation for the original GeneratingPolynomialCommitments wedge where a small number of wire results were missing until restart/restore re-emitted the work.

This issue is about preserving the invariant that once meaningful work has completed, its result is not silently lost during delivery or application.

Correct model

There are three distinct phases:

  1. job execution
  2. completion delivery
  3. completion application to SM state

If phase 1 succeeded, the system must not lose the result during phases 2 or 3 just because storage, commit, or application hit a transient failure.

The correct behavior is:

  1. Transient execution failure before a completion exists
  • retry the job
  1. Transient failure while delivering or applying an already-produced completion
  • internally retry/requeue application of that completion inside sm-executor
  • do not drop the completion
  1. Completed domain/protocol failure verdict
  • surface it through the normal completion/result model
  • do not silently convert it into a dropped completion or infinite retry

The key point is that “work completed” and “state durably updated” are separate concerns, and the system currently treats some failures in the second category as if the result can be discarded.

Desired solution

  1. Audit all completion handling paths in sm-executor and classify which apply-time errors are transient and must be retried rather than dropped.
  2. Implement internal retry/requeue for already-produced completions inside sm-executor.
  3. Ensure storage/commit failures during completion application do not permanently wedge the protocol instance.
  4. Ensure successful completions are not silently lost if a completion channel closes after the work has already finished.
  5. Add explicit logging/metrics around:
  • completion produced
  • completion delivery failed
  • completion application failed transiently
  • completion requeued
  • completion durably applied

The intended fix here is internal retry/requeue in sm-executor, not a separate external durability mechanism. Once a completion exists, sm-executor should keep trying to apply it until it either succeeds or hits a genuinely terminal condition.

Scope of this issue

  1. Audit completion handling in sm-executor.
  2. Audit completion send/drop behavior in the regular worker pool and garbling coordinator.
  3. Reclassify transient apply-time failures so they trigger internal requeue behavior rather than “warn and drop”.
  4. Add tests covering completion delivery/application failure after successful job execution.

Acceptance criteria

  • A transient storage or commit failure during completion application cannot permanently wedge setup.
  • Successful polynomial-commitment jobs are not lost due to apply-time failures.
  • Restart/restore is no longer required to recover from transient completion-application failures.
  • Successful job completions are not silently discarded if delivery/application temporarily fails.
  • sm-executor internally requeues/retries failed completion application until success or terminal classification.
  • Tests cover transient failure after successful job execution and prove eventual durable application.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions