Skip to content

sm-executor: avoid cancel-unsafe job completion receives in main select loop #166

@Zk2u

Description

@Zk2u

Summary

sm-executor can lose completed jobs because it awaits job_handle.recv() directly inside monoio::select!.

kanal::AsyncReceiver::recv() is not cancel-safe in the relevant direct-handoff case. If a worker sends a completion while sm-executor has a pending recv() future registered as a waiter, and some other select! branch wins before that recv() future is polled to completion, dropping the receive future can drop the completion payload itself.

This explains the remaining rare GeneratingPolynomialCommitments wedge that still occurs during staggered setup even after issue #153. The worker finishes the polynomial job, completion_tx.send(...).await returns Ok, but sm-executor never receives the completion and the garbler remains stuck waiting for that wire.

Severity

Recommended priority: P0

This loses internally completed work during normal operation without a worker crash, storage failure, or process crash. It can wedge setup and is difficult to reason about from logs unless the send/receive path is instrumented.

Recommended size: M

Evidence

Concrete failing run

In a reproduced staggered setup failure, the missing completion was:

  • GeneratePolynomialCommitments(..., Input(91))

For that completion:

  1. the worker finished it and logged the send
  2. completion_tx.send(...).await returned Ok
  3. sm-executor never logged receiving it

This was observed in the instrumented run under:

  • functional-tests/_dd/8-23-nhscc/_fn_mosaic_setup_staggered/mosaic-1/service.log

The important point is that the completion send succeeded, but the completion never reached STF application.

Code path

sm-executor recreates job_handle.recv() inside monoio::select! every loop iteration:

  • crates/sm-executor/src/lib.rs

monoio::select! cancels losing branches each iteration:

  • monoio-0.2.4/src/macros/select.rs

kanal::SendFuture directly hands off a value to a waiting receiver instead of queueing it:

  • kanal-0.1.1/src/future.rs

kanal::ReceiveFuture::drop() will wait for that handoff and then drop the received payload locally if the waiting receive future is dropped:

  • kanal-0.1.1/src/future.rs

Minimal standalone repro

A minimal standalone program that:

  1. polls rx.recv() once so it registers as a waiter
  2. sends a value with tx.send(1)
  3. drops the waiting receive future
  4. checks rx.try_recv()

prints LOST_MESSAGE.

That proves the direct-handoff cancellation hole exists independently of Mosaic protocol/state-machine logic.

Exact failure sequence

  1. sm-executor enters one iteration of its main loop and polls job_handle.recv().
  2. No completion is currently buffered, so kanal registers the receive future as a waiting receiver.
  3. A worker completes a job and sends its JobCompletion.
  4. Because a receiver is waiting, kanal hands the completion directly to that waiting receive future instead of buffering it.
  5. In the same monoio::select!, another branch such as inbound network traffic wins first.
  6. monoio::select! cancels the losing job_handle.recv() future.
  7. Dropping that waiting receive future causes kanal to drop the handed-off completion payload.
  8. The completion is now gone permanently.
  9. The SM remains stuck waiting for a completion that already happened.

Correct model

The executor must not lose a completion just because another branch in the main loop won the race for that iteration.

Any receive future used inside the main loop must be cancel-safe, or else it must be kept alive across loop iterations until it resolves.

Desired solution

Do not construct fresh cancel-unsafe receive futures inside select! each loop iteration.

Instead, keep persistent receive futures alive across loop iterations for all kanal-backed inputs used by sm-executor, especially:

  • job_handle.recv()
  • command_rx.recv()
  • shutdown receive helper if it remains kanal-backed

The concrete shape should be:

  1. create/pin the receive future outside the loop
  2. select! on the pinned future by mutable reference
  3. when it resolves, handle the item and replace just that one future with a fresh receive future
  4. never drop a still-pending receive future merely because some other branch won the current select!

It is also worth auditing net_client.recv() and any other receives used similarly, but the proven data-loss bug here is the job-completion path.

Acceptance criteria

  • sm-executor can no longer lose a completion because another select! branch won the current loop iteration.
  • The staggered setup repro no longer wedges in GeneratingPolynomialCommitments due to a missing locally completed wire.
  • A regression test or minimal reproducer exists for the cancellation-safe receive pattern.
  • The executor main loop no longer recreates cancel-unsafe recv() futures for kanal channels on every iteration.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions