Skip to content

Clean up in-flight dedup entries when the request owner is dropped#6577

Draft
ndr-ds wants to merge 1 commit into
mainfrom
ndr-ds/in-flight-tracker-drop-guard
Draft

Clean up in-flight dedup entries when the request owner is dropped#6577
ndr-ds wants to merge 1 commit into
mainfrom
ndr-ds/in-flight-tracker-drop-guard

Conversation

@ndr-ds

@ndr-ds ndr-ds commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Motivation

The client's request scheduler deduplicates concurrent requests for the same
data (RequestsScheduler::deduplicated_request): the first caller becomes the
owner and executes the request, while later callers subscribe and wait on a
broadcast channel for the result. InFlightTracker holds the map of in-flight
entries; only complete_and_broadcast ever removed an entry.

This is not cancel-safe. If the owning task is dropped before it completes —
which happens routinely: synchronize_up_to runs download_certificates_from
against every validator under communicate_with_quorum, and once a quorum
responds the remaining (losing) futures are dropped mid-flight — the entry is
never removed. Its broadcast sender stays alive, so every subscriber blocks
forever on receiver.recv().await (there is no timeout on that wait). A
subscriber living in a different task tree (e.g. the spawned download pipeline)
then starves the whole sync: the chain never advances, the client sits at 0%
CPU with no pending timers, and only a process restart recovers it.

The race is timing-dependent (the subscription join window is
max_request_ttl, 200 ms by default), so it is latent under today's request
timing. It is reliably reproducible under workloads that line the cert-download
tasks up in lockstep — see Test Plan.

Proposal

Make the in-flight tracker cancel-safe with an ownership guard:

  • insert_new now returns an InFlightGuard that owns the entry. The owner
    holds it across the cancellation-prone try_staggered_parallel().await.
  • Completing via InFlightGuard::complete_and_broadcast broadcasts the result
    and removes the entry (unchanged success path).
  • If the guard is instead dropped without completing (the owning task was
    cancelled), its Drop removes the entry. Dropping the entry drops the
    broadcast sender, so subscribers wake with RecvError::Closed and fall
    through to executing the request themselves — a path the code already had but
    could never reach while the entry leaked.
  • A per-entry generation (from an AtomicU64) means a guard only ever removes
    its own entry, never a newer owner's entry that reused the same key after
    this one went stale and was replaced. Both complete_and_broadcast and Drop
    are generation-checked, making them idempotent and race-safe.

To let Drop clean up synchronously, the entries map moves from
tokio::sync::RwLock to std::sync::Mutex; it is never held across an .await
(the per-entry alternative_peers list is cloned out from under the lock before
awaiting). This matches the existing std::sync::Mutex usage and discipline in
linera-core.

Client-side only; no protocol, storage, or validator change.

Test Plan

  • New unit regression test test_owner_drop_wakes_subscribers: an owner
    registers an in-flight request, a second caller subscribes, the owner guard is
    dropped without completing — the subscriber observes RecvError::Closed
    (rather than hanging) and the entry is gone. cargo test -p linera-core --lib requests_scheduler → 28 passed; cargo clippy --all-targets clean.

  • End-to-end, real workload. Reproduced the hang with a fresh-wallet,
    empty-DB cold sync of a blob-heavy testnet_conway market chain, and measured a
    client built with vs. without this fix, interleaved under identical network
    conditions (12 cold syncs each, per-run hang watchdog):

    Build Cold syncs Hung
    without this fix 12 5 (~42%)
    with this fix 12 0

    Every time the unfixed build hung, the very next fixed run on the same
    conditions completed in ~12 s. Stacks/logs from the hung runs confirm the
    root cause above (subscriber parked on recv().await, leaked entry, no
    pending timer).

Release Plan

  • These changes should be backported to the latest testnet branch (a companion
    backport PR targets testnet_conway). Client-side only, so it ships with the
    usual client/SDK release — no validator hotfix.

Links

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant