Summary
A job can complete successfully and still never update the state machine.
Today, if sm-executor receives a completion and then hits a non-fatal error while applying it, it logs and drops the completion permanently. The same general problem exists if a successful completion cannot be delivered back through the completion channel. This means meaningful completed work can be lost after execution, which violates the core durability/correctness invariant of the job system.
This is also the most likely explanation for the original GeneratingPolynomialCommitments wedge where a small number of wire results were missing until restart/restore re-emitted the work.
This issue is about preserving the invariant that once meaningful work has completed, its result is not silently lost during delivery or application.
Correct model
There are three distinct phases:
- job execution
- completion delivery
- completion application to SM state
If phase 1 succeeded, the system must not lose the result during phases 2 or 3 just because storage, commit, or application hit a transient failure.
The correct behavior is:
- Transient execution failure before a completion exists
- Transient failure while delivering or applying an already-produced completion
- internally retry/requeue application of that completion inside
sm-executor
- do not drop the completion
- Completed domain/protocol failure verdict
- surface it through the normal completion/result model
- do not silently convert it into a dropped completion or infinite retry
The key point is that “work completed” and “state durably updated” are separate concerns, and the system currently treats some failures in the second category as if the result can be discarded.
Desired solution
- Audit all completion handling paths in
sm-executor and classify which apply-time errors are transient and must be retried rather than dropped.
- Implement internal retry/requeue for already-produced completions inside
sm-executor.
- Ensure storage/commit failures during completion application do not permanently wedge the protocol instance.
- Ensure successful completions are not silently lost if a completion channel closes after the work has already finished.
- Add explicit logging/metrics around:
- completion produced
- completion delivery failed
- completion application failed transiently
- completion requeued
- completion durably applied
The intended fix here is internal retry/requeue in sm-executor, not a separate external durability mechanism. Once a completion exists, sm-executor should keep trying to apply it until it either succeeds or hits a genuinely terminal condition.
Scope of this issue
- Audit completion handling in
sm-executor.
- Audit completion send/drop behavior in the regular worker pool and garbling coordinator.
- Reclassify transient apply-time failures so they trigger internal requeue behavior rather than “warn and drop”.
- Add tests covering completion delivery/application failure after successful job execution.
Acceptance criteria
- A transient storage or commit failure during completion application cannot permanently wedge setup.
- Successful polynomial-commitment jobs are not lost due to apply-time failures.
- Restart/restore is no longer required to recover from transient completion-application failures.
- Successful job completions are not silently discarded if delivery/application temporarily fails.
sm-executor internally requeues/retries failed completion application until success or terminal classification.
- Tests cover transient failure after successful job execution and prove eventual durable application.
Summary
A job can complete successfully and still never update the state machine.
Today, if
sm-executorreceives a completion and then hits a non-fatal error while applying it, it logs and drops the completion permanently. The same general problem exists if a successful completion cannot be delivered back through the completion channel. This means meaningful completed work can be lost after execution, which violates the core durability/correctness invariant of the job system.This is also the most likely explanation for the original
GeneratingPolynomialCommitmentswedge where a small number of wire results were missing until restart/restore re-emitted the work.This issue is about preserving the invariant that once meaningful work has completed, its result is not silently lost during delivery or application.
Correct model
There are three distinct phases:
If phase 1 succeeded, the system must not lose the result during phases 2 or 3 just because storage, commit, or application hit a transient failure.
The correct behavior is:
sm-executorThe key point is that “work completed” and “state durably updated” are separate concerns, and the system currently treats some failures in the second category as if the result can be discarded.
Desired solution
sm-executorand classify which apply-time errors are transient and must be retried rather than dropped.sm-executor.The intended fix here is internal retry/requeue in
sm-executor, not a separate external durability mechanism. Once a completion exists,sm-executorshould keep trying to apply it until it either succeeds or hits a genuinely terminal condition.Scope of this issue
sm-executor.Acceptance criteria
sm-executorinternally requeues/retries failed completion application until success or terminal classification.