Summary
sm-executor can lose completed jobs because it awaits job_handle.recv() directly inside monoio::select!.
kanal::AsyncReceiver::recv() is not cancel-safe in the relevant direct-handoff case. If a worker sends a completion while sm-executor has a pending recv() future registered as a waiter, and some other select! branch wins before that recv() future is polled to completion, dropping the receive future can drop the completion payload itself.
This explains the remaining rare GeneratingPolynomialCommitments wedge that still occurs during staggered setup even after issue #153. The worker finishes the polynomial job, completion_tx.send(...).await returns Ok, but sm-executor never receives the completion and the garbler remains stuck waiting for that wire.
Severity
Recommended priority: P0
This loses internally completed work during normal operation without a worker crash, storage failure, or process crash. It can wedge setup and is difficult to reason about from logs unless the send/receive path is instrumented.
Recommended size: M
Evidence
Concrete failing run
In a reproduced staggered setup failure, the missing completion was:
GeneratePolynomialCommitments(..., Input(91))
For that completion:
- the worker finished it and logged the send
completion_tx.send(...).await returned Ok
sm-executor never logged receiving it
This was observed in the instrumented run under:
functional-tests/_dd/8-23-nhscc/_fn_mosaic_setup_staggered/mosaic-1/service.log
The important point is that the completion send succeeded, but the completion never reached STF application.
Code path
sm-executor recreates job_handle.recv() inside monoio::select! every loop iteration:
crates/sm-executor/src/lib.rs
monoio::select! cancels losing branches each iteration:
monoio-0.2.4/src/macros/select.rs
kanal::SendFuture directly hands off a value to a waiting receiver instead of queueing it:
kanal-0.1.1/src/future.rs
kanal::ReceiveFuture::drop() will wait for that handoff and then drop the received payload locally if the waiting receive future is dropped:
kanal-0.1.1/src/future.rs
Minimal standalone repro
A minimal standalone program that:
- polls
rx.recv() once so it registers as a waiter
- sends a value with
tx.send(1)
- drops the waiting receive future
- checks
rx.try_recv()
prints LOST_MESSAGE.
That proves the direct-handoff cancellation hole exists independently of Mosaic protocol/state-machine logic.
Exact failure sequence
sm-executor enters one iteration of its main loop and polls job_handle.recv().
- No completion is currently buffered, so
kanal registers the receive future as a waiting receiver.
- A worker completes a job and sends its
JobCompletion.
- Because a receiver is waiting,
kanal hands the completion directly to that waiting receive future instead of buffering it.
- In the same
monoio::select!, another branch such as inbound network traffic wins first.
monoio::select! cancels the losing job_handle.recv() future.
- Dropping that waiting receive future causes
kanal to drop the handed-off completion payload.
- The completion is now gone permanently.
- The SM remains stuck waiting for a completion that already happened.
Correct model
The executor must not lose a completion just because another branch in the main loop won the race for that iteration.
Any receive future used inside the main loop must be cancel-safe, or else it must be kept alive across loop iterations until it resolves.
Desired solution
Do not construct fresh cancel-unsafe receive futures inside select! each loop iteration.
Instead, keep persistent receive futures alive across loop iterations for all kanal-backed inputs used by sm-executor, especially:
job_handle.recv()
command_rx.recv()
- shutdown receive helper if it remains
kanal-backed
The concrete shape should be:
- create/pin the receive future outside the loop
select! on the pinned future by mutable reference
- when it resolves, handle the item and replace just that one future with a fresh receive future
- never drop a still-pending receive future merely because some other branch won the current
select!
It is also worth auditing net_client.recv() and any other receives used similarly, but the proven data-loss bug here is the job-completion path.
Acceptance criteria
sm-executor can no longer lose a completion because another select! branch won the current loop iteration.
- The staggered setup repro no longer wedges in
GeneratingPolynomialCommitments due to a missing locally completed wire.
- A regression test or minimal reproducer exists for the cancellation-safe receive pattern.
- The executor main loop no longer recreates cancel-unsafe
recv() futures for kanal channels on every iteration.
Related
Summary
sm-executorcan lose completed jobs because it awaitsjob_handle.recv()directly insidemonoio::select!.kanal::AsyncReceiver::recv()is not cancel-safe in the relevant direct-handoff case. If a worker sends a completion whilesm-executorhas a pendingrecv()future registered as a waiter, and some otherselect!branch wins before thatrecv()future is polled to completion, dropping the receive future can drop the completion payload itself.This explains the remaining rare
GeneratingPolynomialCommitmentswedge that still occurs during staggered setup even after issue #153. The worker finishes the polynomial job,completion_tx.send(...).awaitreturnsOk, butsm-executornever receives the completion and the garbler remains stuck waiting for that wire.Severity
Recommended priority:
P0This loses internally completed work during normal operation without a worker crash, storage failure, or process crash. It can wedge setup and is difficult to reason about from logs unless the send/receive path is instrumented.
Recommended size:
MEvidence
Concrete failing run
In a reproduced staggered setup failure, the missing completion was:
GeneratePolynomialCommitments(..., Input(91))For that completion:
completion_tx.send(...).awaitreturnedOksm-executornever logged receiving itThis was observed in the instrumented run under:
functional-tests/_dd/8-23-nhscc/_fn_mosaic_setup_staggered/mosaic-1/service.logThe important point is that the completion send succeeded, but the completion never reached STF application.
Code path
sm-executorrecreatesjob_handle.recv()insidemonoio::select!every loop iteration:crates/sm-executor/src/lib.rsmonoio::select!cancels losing branches each iteration:monoio-0.2.4/src/macros/select.rskanal::SendFuturedirectly hands off a value to a waiting receiver instead of queueing it:kanal-0.1.1/src/future.rskanal::ReceiveFuture::drop()will wait for that handoff and then drop the received payload locally if the waiting receive future is dropped:kanal-0.1.1/src/future.rsMinimal standalone repro
A minimal standalone program that:
rx.recv()once so it registers as a waitertx.send(1)rx.try_recv()prints
LOST_MESSAGE.That proves the direct-handoff cancellation hole exists independently of Mosaic protocol/state-machine logic.
Exact failure sequence
sm-executorenters one iteration of its main loop and pollsjob_handle.recv().kanalregisters the receive future as a waiting receiver.JobCompletion.kanalhands the completion directly to that waiting receive future instead of buffering it.monoio::select!, another branch such as inbound network traffic wins first.monoio::select!cancels the losingjob_handle.recv()future.kanalto drop the handed-off completion payload.Correct model
The executor must not lose a completion just because another branch in the main loop won the race for that iteration.
Any receive future used inside the main loop must be cancel-safe, or else it must be kept alive across loop iterations until it resolves.
Desired solution
Do not construct fresh cancel-unsafe receive futures inside
select!each loop iteration.Instead, keep persistent receive futures alive across loop iterations for all
kanal-backed inputs used bysm-executor, especially:job_handle.recv()command_rx.recv()kanal-backedThe concrete shape should be:
select!on the pinned future by mutable referenceselect!It is also worth auditing
net_client.recv()and any other receives used similarly, but the proven data-loss bug here is the job-completion path.Acceptance criteria
sm-executorcan no longer lose a completion because anotherselect!branch won the current loop iteration.GeneratingPolynomialCommitmentsdue to a missing locally completed wire.recv()futures forkanalchannels on every iteration.Related