Skip to content

Recover garbling jobs and make sm-executor receives cancel-safe#167

Merged
sapinb merged 2 commits intomainfrom
fix/156
Apr 9, 2026
Merged

Recover garbling jobs and make sm-executor receives cancel-safe#167
sapinb merged 2 commits intomainfrom
fix/156

Conversation

@Zk2u
Copy link
Copy Markdown
Collaborator

@Zk2u Zk2u commented Apr 8, 2026

Summary

This PR addresses two independent reliability bugs in the scheduler/executor layer:

What changed

Garbling coordinator recovery

  • the coordinator now tracks assigned jobs per worker instead of letting them become unrecoverable once handed off
  • assignment failure, missing chunk report, missing finish report, and unexpected worker exit now requeue the affected jobs
  • finish completions are forwarded back through the coordinator while preserving retry ownership
  • PendingCircuitJob is clonable so coordinator-owned recovery bookkeeping stays explicit

sm-executor cancel safety

  • sm-executor no longer recreates recv() futures inside monoio::select! every loop iteration
  • it now keeps persistent pinned receive futures for:
    • job completions
    • inbound network requests
    • executor commands
    • shutdown
  • this avoids the proven kanal direct-handoff cancellation hole where a completion can be dropped if a waiting recv() future is canceled before being polled to completion

Tests

Added/updated tests for:

  • garbling coordinator recovery on assignment failure
  • garbling coordinator recovery on missing chunk report
  • garbling coordinator recovery on missing finish report
  • finish report forwarding plus retry preservation
  • kanal waiting-receive drop loses direct-handoff message
  • persistent receive future preserves direct-handoff message

Validation

Passed:

  • cargo test -p mosaic-sm-executor
  • cargo clippy -p mosaic-sm-executor --tests -- -D warnings
  • cargo test -p mosaic-job-scheduler
  • cargo clippy -p mosaic-job-scheduler --tests -- -D warnings
  • cargo test -p mosaic-job-api
  • cargo fmt --check --all

just -f .justfile ci got through formatting, clippy, docs, and the scheduler/executor portions cleanly, but timed out in unrelated long-running mosaic-storage-fdb tests in this environment.

Notes

I also updated issue #156 to clarify that the rare GeneratingPolynomialCommitments wedge is tracked separately in #166 and is not part of the coordinator worker-loss bug.

Copy link
Copy Markdown
Collaborator

@sapinb sapinb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sapinb sapinb merged commit f786c88 into main Apr 9, 2026
14 checks passed
@sapinb sapinb deleted the fix/156 branch April 9, 2026 06:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants