Skip to content

[testnet] Clean up in-flight dedup entries when the request owner is dropped#6579

Draft
ndr-ds wants to merge 1 commit into
testnet_conwayfrom
ndr-ds/in-flight-tracker-drop-guard-conway
Draft

[testnet] Clean up in-flight dedup entries when the request owner is dropped#6579
ndr-ds wants to merge 1 commit into
testnet_conwayfrom
ndr-ds/in-flight-tracker-drop-guard-conway

Conversation

@ndr-ds

@ndr-ds ndr-ds commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Motivation

Backport of #6577 to testnet_conway.

The client's request scheduler deduplicates concurrent requests for the same
data (RequestsScheduler::deduplicated_request): the first caller becomes the
owner and executes the request, while later callers subscribe and wait on a
broadcast channel. InFlightTracker holds the in-flight entries; only
complete_and_broadcast ever removed one.

This is not cancel-safe. If the owning task is dropped before completing —
which happens routinely: synchronize_up_to runs download_certificates_from
against every validator under communicate_with_quorum, and once a quorum
responds the losing futures are dropped mid-flight — the entry leaks. Its
broadcast sender stays alive, so every subscriber blocks forever on
receiver.recv().await (no timeout). A subscriber in a different task tree then
starves the whole sync: the chain never advances, the client sits at 0% CPU, and
only a process restart recovers it.

Latent under today's request timing (the subscription join window is
max_request_ttl, 200 ms), but reliably reproducible under workloads that line
the cert-download tasks up in lockstep.

Proposal

Make the tracker cancel-safe with an ownership guard (identical to #6577; cherry-pick):

  • insert_new returns an InFlightGuard that owns the entry, held across the
    cancellation-prone try_staggered_parallel().await.
  • InFlightGuard::complete_and_broadcast broadcasts + removes (success path).
  • Dropping the guard without completing removes the entry, dropping the sender so
    subscribers wake with RecvError::Closed and execute the request themselves —
    a fall-through path that already existed but could never be reached while the
    entry leaked.
  • A per-entry generation (AtomicU64) means a guard only removes its own
    entry, never a newer owner's after this one went stale. Both completion and
    Drop are generation-checked (idempotent, race-safe).
  • Entries map moves to std::sync::Mutex so Drop can clean synchronously;
    never held across an .await.

Client-side only; no protocol, storage, or validator change.

Test Plan

  • New unit regression test test_owner_drop_wakes_subscribers; cargo test -p linera-core --lib requests_scheduler → 28 passed; cargo clippy --all-targets
    and cargo fmt --check clean.

  • End-to-end, real workload (measured on a testnet_conway client build): a
    fresh-wallet/empty-DB cold sync of a blob-heavy market chain, interleaved with
    vs. without this fix under identical network conditions:

    Build Cold syncs Hung
    without this fix 12 5 (~42%)
    with this fix 12 0

    Every unfixed hang was immediately followed by a clean ~12 s fixed run on the
    same conditions.

Release Plan

  • These changes follow the usual testnet_conway release cycle. Client-side
    only — no validator hotfix.

Links

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant