fix(kad-dht): emit FINAL_PEER events inline during getClosestPeers query#3420
fix(kad-dht): emit FINAL_PEER events inline during getClosestPeers query#3420paschal533 wants to merge 8 commits intolibp2p:mainfrom
Conversation
This doesn't sound correct. You only know what the closest peers were after the network has been traversed, you can't emit |
Thank you @achingbrain for pointing this out..... I just fixed it to go back to post-traversal emission using PeerDistanceList, contacted peers are accumulated during the query sorted by XOR distance, then FINAL_PEER is emitted for the K-closest after the query completes. The AbortError case is handled by catching it, emitting whatever was found, then re-throwing so the abort still propagates to callers |
I would be careful with this approach too. For the DHT to work there must be crossover in the set of closest peers that two independent nodes can resolve. It's not enough for one node to say "here are the closest peers I found in X amount of time". |
That's right, and I should have caught that earlier |
|
@paschal533 this is very technical code being touched. It maybe be better, for now, to see if a failing test can be built to show exactly how the code is getting stuck. Then we can target the issue and make sure the fix is inline with kademlia dht correctness. |
@tabcat yeah that makes sense. we actually do have the failing tests already in the fix in the AdaptiveTimeout and onPathComplete stuff is more of a performance improvement on top of that. happy to move those to a separate PR if it makes this easier to review |
Summary
Fixes
getClosestPeersconsistently returning zeroFINAL_PEERevents on the Amino DHT, even with 50+ connected peers and a 30s timeout. Three compounding issues caused this:Root cause 1: AdaptiveTimeout escalation (
network.ts)The defaults
maxTimeout=60sandfailureMultiplier=2caused per-dial timeouts to escalate rapidly after failures. With concurrent paths, individual peers could stall a path for a full minute before failing. Fixed by capping atmaxTimeout: 5_000ms,minTimeout: 2_000ms,failureMultiplier: 1.5.Also fixed a silent bug:
networkDialTimeoutconfig option existed onKadDHTInitbut was never wired to theNetworkconstructor. User-provided timeout config was silently ignored.Root cause 2: Merge iterator bottleneck (
query/manager.ts,query/query-path.ts)queryManager.run()usesit-merge(backed byit-queueless-pushable, an unbuffered single-item channel) to merge events from all disjoint paths.PATH_ENDEDevents, which signal path convergence, had to queue behind hundreds ofPEER_RESPONSE/QUERY_ERRORevents and could never flow through fast enough for the manager to terminate before the user's timeout fired.Fixed by adding an
onPathCompletecallback toQueryPathOptionsthat fires directly on queue idle, bypassing the merge iterator entirely. The query manager uses this out-of-band signal to abort the merge loop when ≥60% of paths complete, without waiting forPATH_ENDEDto propagate through the channel.Root cause 3: FINAL_PEER correctness (
peer-routing/index.ts)Per @achingbrain's review: the DHT requires crossover multiple independent paths resolving the same key must agree on the closest peers. Emitting
FINAL_PEERevents inline for every peer that responds (regardless of distance) breaks this invariant. Partially-completed queries should not emit results.Fixed by accumulating contacted peers in a
PeerDistanceListduring traversal and only emittingFINAL_PEERevents after the query fully converges. If the query is aborted (timeout),AbortErrorpropagates naturally and no partial results are emitted callers get an error, not incorrect peer suggestions.Stale value fix (
content-fetching/index.ts)The early termination optimization broke value retrieval: queries were cut short before all close peers responded, causing callers to receive older/stale records. Fixed by passing
disableEarlyTermination: truefor value-retrieval queries so all K closest peers are always consulted.Files changed
src/network.ts: capAdaptiveTimeoutdefaults; wirenetworkDialTimeoutconfigsrc/query/query-path.ts: addonPathComplete?(pathIndex: number): voidcallbacksrc/query/manager.ts: out-of-band early termination viaonPathComplete; adddisableEarlyTerminationoptionsrc/peer-routing/index.ts: accumulate inPeerDistanceList, emitFINAL_PEERonly after convergencesrc/content-fetching/index.ts:disableEarlyTermination: trueforgetManytest/peer-routing.spec.ts: 4 new unit tests forgetClosestPeerscorrectnessTest plan
should emit FINAL_PEER events for peers successfully contacted during a query✓should propagate AbortError from queryManager without emitting FINAL_PEER✓should propagate non-AbortError from queryManager✓should not emit FINAL_PEER for peers that returned a query error✓Real-world test results
Tested against the live Amino DHT with 88 connected peers:
FINAL_PEERevents in 8.6s ✓kBucketSize/prefixLengthsettings. This is a separate concern as noted by @achingbrain.Fixes #3419