sm-executor: stop dropping completed job results on transient apply/commit failures

## Summary

A job can complete successfully and still never update the state machine.

Today, if `sm-executor` receives a completion and then hits a non-fatal error while applying it, it logs and drops the completion permanently. The same general problem exists if a successful completion cannot be delivered back through the completion channel. This means meaningful completed work can be lost after execution, which violates the core durability/correctness invariant of the job system.

This is also the most likely explanation for the original `GeneratingPolynomialCommitments` wedge where a small number of wire results were missing until restart/restore re-emitted the work.

This issue is about preserving the invariant that once meaningful work has completed, its result is not silently lost during delivery or application.

## Correct model

There are three distinct phases:

1. job execution
2. completion delivery
3. completion application to SM state

If phase 1 succeeded, the system must not lose the result during phases 2 or 3 just because storage, commit, or application hit a transient failure.

The correct behavior is:

1. **Transient execution failure before a completion exists**
- retry the job

2. **Transient failure while delivering or applying an already-produced completion**
- internally retry/requeue application of that completion inside `sm-executor`
- do not drop the completion

3. **Completed domain/protocol failure verdict**
- surface it through the normal completion/result model
- do not silently convert it into a dropped completion or infinite retry

The key point is that “work completed” and “state durably updated” are separate concerns, and the system currently treats some failures in the second category as if the result can be discarded.

## Desired solution

1. Audit all completion handling paths in `sm-executor` and classify which apply-time errors are transient and must be retried rather than dropped.
2. Implement internal retry/requeue for already-produced completions inside `sm-executor`.
3. Ensure storage/commit failures during completion application do not permanently wedge the protocol instance.
4. Ensure successful completions are not silently lost if a completion channel closes after the work has already finished.
5. Add explicit logging/metrics around:
- completion produced
- completion delivery failed
- completion application failed transiently
- completion requeued
- completion durably applied

The intended fix here is internal retry/requeue in `sm-executor`, not a separate external durability mechanism. Once a completion exists, `sm-executor` should keep trying to apply it until it either succeeds or hits a genuinely terminal condition.

## Scope of this issue

1. Audit completion handling in `sm-executor`.
2. Audit completion send/drop behavior in the regular worker pool and garbling coordinator.
3. Reclassify transient apply-time failures so they trigger internal requeue behavior rather than “warn and drop”.
4. Add tests covering completion delivery/application failure after successful job execution.

## Acceptance criteria

- A transient storage or commit failure during completion application cannot permanently wedge setup.
- Successful polynomial-commitment jobs are not lost due to apply-time failures.
- Restart/restore is no longer required to recover from transient completion-application failures.
- Successful job completions are not silently discarded if delivery/application temporarily fails.
- `sm-executor` internally requeues/retries failed completion application until success or terminal classification.
- Tests cover transient failure after successful job execution and prove eventual durable application.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sm-executor: stop dropping completed job results on transient apply/commit failures #153

Summary

Correct model

Desired solution

Scope of this issue

Acceptance criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

sm-executor: stop dropping completed job results on transient apply/commit failures #153

Description

Summary

Correct model

Desired solution

Scope of this issue

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions