[testnet] Clean up in-flight dedup entries when the request owner is dropped by ndr-ds · Pull Request #6579 · linera-io/linera-protocol

ndr-ds · 2026-07-01T22:23:21Z

Motivation

Backport of #6577 to testnet_conway.

The client's request scheduler deduplicates concurrent requests for the same
data (RequestsScheduler::deduplicated_request): the first caller becomes the
owner and executes the request, while later callers subscribe and wait on a
broadcast channel. InFlightTracker holds the in-flight entries; only
complete_and_broadcast ever removed one.

This is not cancel-safe. If the owning task is dropped before completing —
which happens routinely: synchronize_up_to runs download_certificates_from
against every validator under communicate_with_quorum, and once a quorum
responds the losing futures are dropped mid-flight — the entry leaks. Its
broadcast sender stays alive, so every subscriber blocks forever on
receiver.recv().await (no timeout). A subscriber in a different task tree then
starves the whole sync: the chain never advances, the client sits at 0% CPU, and
only a process restart recovers it.

Latent under today's request timing (the subscription join window is
max_request_ttl, 200 ms), but reliably reproducible under workloads that line
the cert-download tasks up in lockstep.

Proposal

Make the tracker cancel-safe with an ownership guard (identical to #6577; cherry-pick):

insert_new returns an InFlightGuard that owns the entry, held across the
cancellation-prone try_staggered_parallel().await.
InFlightGuard::complete_and_broadcast broadcasts + removes (success path).
Dropping the guard without completing removes the entry, dropping the sender so
subscribers wake with RecvError::Closed and execute the request themselves —
a fall-through path that already existed but could never be reached while the
entry leaked.
A per-entry generation (AtomicU64) means a guard only removes its own
entry, never a newer owner's after this one went stale. Both completion and
Drop are generation-checked (idempotent, race-safe).
Entries map moves to std::sync::Mutex so Drop can clean synchronously;
never held across an .await.

Client-side only; no protocol, storage, or validator change.

Test Plan

New unit regression test test_owner_drop_wakes_subscribers; cargo test -p linera-core --lib requests_scheduler → 28 passed; cargo clippy --all-targets
and cargo fmt --check clean.
End-to-end, real workload (measured on a testnet_conway client build): a
fresh-wallet/empty-DB cold sync of a blob-heavy market chain, interleaved with
vs. without this fix under identical network conditions:

Build Cold syncs Hung

without this fix 12 5 (~42%)

with this fix 12 0

Every unfixed hang was immediately followed by a clean ~12 s fixed run on the
same conditions.

Release Plan

These changes follow the usual testnet_conway release cycle. Client-side
only — no validator hotfix.

Links

Main PR: Clean up in-flight dedup entries when the request owner is dropped #6577
Exposed by, and a prerequisite for, the streaming DownloadBlobs caller
backport [testnet] Use streaming DownloadBlobs RPC for batch blob downloads #6476.
reviewer checklist

Clean up in-flight dedup entries when the request owner is dropped

348bc18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[testnet] Clean up in-flight dedup entries when the request owner is dropped#6579

[testnet] Clean up in-flight dedup entries when the request owner is dropped#6579
ndr-ds wants to merge 1 commit into
testnet_conwayfrom
ndr-ds/in-flight-tracker-drop-guard-conway

ndr-ds commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Build	Cold syncs	Hung
without this fix	12	5 (~42%)
with this fix	12	0

Uh oh!

Conversation

ndr-ds commented Jul 1, 2026

Motivation

Proposal

Test Plan

Release Plan

Links

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant