Recover garbling jobs and make sm-executor receives cancel-safe by Zk2u · Pull Request #167 · alpenlabs/mosaic

Zk2u · 2026-04-08T22:50:16Z

Summary

This PR addresses two independent reliability bugs in the scheduler/executor layer:

fixes job-scheduler: recover garbling sessions on worker/report failure instead of losing jobs #156 by making the garbling coordinator retain authoritative ownership of in-flight PendingCircuitJobs and requeue them on assignment/report/worker failure
fixes sm-executor: avoid cancel-unsafe job completion receives in main select loop #166 by making sm-executor main-loop receives cancel-safe so completed jobs cannot be lost when another select! branch wins

What changed

Garbling coordinator recovery

the coordinator now tracks assigned jobs per worker instead of letting them become unrecoverable once handed off
assignment failure, missing chunk report, missing finish report, and unexpected worker exit now requeue the affected jobs
finish completions are forwarded back through the coordinator while preserving retry ownership
PendingCircuitJob is clonable so coordinator-owned recovery bookkeeping stays explicit

sm-executor cancel safety

sm-executor no longer recreates recv() futures inside monoio::select! every loop iteration
it now keeps persistent pinned receive futures for:
- job completions
- inbound network requests
- executor commands
- shutdown
this avoids the proven kanal direct-handoff cancellation hole where a completion can be dropped if a waiting recv() future is canceled before being polled to completion

Tests

Added/updated tests for:

garbling coordinator recovery on assignment failure
garbling coordinator recovery on missing chunk report
garbling coordinator recovery on missing finish report
finish report forwarding plus retry preservation
kanal waiting-receive drop loses direct-handoff message
persistent receive future preserves direct-handoff message

Validation

Passed:

cargo test -p mosaic-sm-executor
cargo clippy -p mosaic-sm-executor --tests -- -D warnings
cargo test -p mosaic-job-scheduler
cargo clippy -p mosaic-job-scheduler --tests -- -D warnings
cargo test -p mosaic-job-api
cargo fmt --check --all

just -f .justfile ci got through formatting, clippy, docs, and the scheduler/executor portions cleanly, but timed out in unrelated long-running mosaic-storage-fdb tests in this environment.

Notes

I also updated issue #156 to clarify that the rare GeneratingPolynomialCommitments wedge is tracked separately in #166 and is not part of the coordinator worker-loss bug.

sapinb

LGTM

sapinb approved these changes Apr 9, 2026

View reviewed changes

Zk2u and others added 2 commits April 9, 2026 06:16

scheduler: recover garbling jobs and preserve completions

4665672

lint: cargo fmt

1d55280

sapinb force-pushed the fix/156 branch from 7525bf9 to 1d55280 Compare April 9, 2026 06:16

sapinb merged commit f786c88 into main Apr 9, 2026
14 checks passed

sapinb deleted the fix/156 branch April 9, 2026 06:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recover garbling jobs and make sm-executor receives cancel-safe#167

Recover garbling jobs and make sm-executor receives cancel-safe#167
sapinb merged 2 commits intomainfrom
fix/156

Zk2u commented Apr 8, 2026

Uh oh!

sapinb left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Zk2u commented Apr 8, 2026

Summary

What changed

Garbling coordinator recovery

sm-executor cancel safety

Tests

Validation

Notes

Uh oh!

sapinb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants