[testnet] Clean up in-flight dedup entries when the request owner is dropped#6579
Draft
ndr-ds wants to merge 1 commit into
Draft
[testnet] Clean up in-flight dedup entries when the request owner is dropped#6579ndr-ds wants to merge 1 commit into
ndr-ds wants to merge 1 commit into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Backport of #6577 to
testnet_conway.The client's request scheduler deduplicates concurrent requests for the same
data (
RequestsScheduler::deduplicated_request): the first caller becomes theowner and executes the request, while later callers subscribe and wait on a
broadcast channel.
InFlightTrackerholds the in-flight entries; onlycomplete_and_broadcastever removed one.This is not cancel-safe. If the owning task is dropped before completing —
which happens routinely:
synchronize_up_torunsdownload_certificates_fromagainst every validator under
communicate_with_quorum, and once a quorumresponds the losing futures are dropped mid-flight — the entry leaks. Its
broadcast sender stays alive, so every subscriber blocks forever on
receiver.recv().await(no timeout). A subscriber in a different task tree thenstarves the whole sync: the chain never advances, the client sits at 0% CPU, and
only a process restart recovers it.
Latent under today's request timing (the subscription join window is
max_request_ttl, 200 ms), but reliably reproducible under workloads that linethe cert-download tasks up in lockstep.
Proposal
Make the tracker cancel-safe with an ownership guard (identical to #6577; cherry-pick):
insert_newreturns anInFlightGuardthat owns the entry, held across thecancellation-prone
try_staggered_parallel().await.InFlightGuard::complete_and_broadcastbroadcasts + removes (success path).subscribers wake with
RecvError::Closedand execute the request themselves —a fall-through path that already existed but could never be reached while the
entry leaked.
generation(AtomicU64) means a guard only removes its ownentry, never a newer owner's after this one went stale. Both completion and
Dropare generation-checked (idempotent, race-safe).std::sync::MutexsoDropcan clean synchronously;never held across an
.await.Client-side only; no protocol, storage, or validator change.
Test Plan
New unit regression test
test_owner_drop_wakes_subscribers;cargo test -p linera-core --lib requests_scheduler→ 28 passed;cargo clippy --all-targetsand
cargo fmt --checkclean.End-to-end, real workload (measured on a testnet_conway client build): a
fresh-wallet/empty-DB cold sync of a blob-heavy market chain, interleaved with
vs. without this fix under identical network conditions:
Every unfixed hang was immediately followed by a clean ~12 s fixed run on the
same conditions.
Release Plan
testnet_conwayrelease cycle. Client-sideonly — no validator hotfix.
Links
DownloadBlobscallerbackport [testnet] Use streaming DownloadBlobs RPC for batch blob downloads #6476.