Skip to content

fix(kad-dht): emit FINAL_PEER events inline during getClosestPeers query#3420

Draft
paschal533 wants to merge 8 commits intolibp2p:mainfrom
paschal533:fix/kad-dht-emit-FINAL_PEER-events
Draft

fix(kad-dht): emit FINAL_PEER events inline during getClosestPeers query#3420
paschal533 wants to merge 8 commits intolibp2p:mainfrom
paschal533:fix/kad-dht-emit-FINAL_PEER-events

Conversation

@paschal533
Copy link
Copy Markdown
Contributor

@paschal533 paschal533 commented Mar 24, 2026

Summary

Fixes getClosestPeers consistently returning zero FINAL_PEER events on the Amino DHT, even with 50+ connected peers and a 30s timeout. Three compounding issues caused this:

Root cause 1: AdaptiveTimeout escalation (network.ts)

The defaults maxTimeout=60s and failureMultiplier=2 caused per-dial timeouts to escalate rapidly after failures. With concurrent paths, individual peers could stall a path for a full minute before failing. Fixed by capping at maxTimeout: 5_000ms, minTimeout: 2_000ms, failureMultiplier: 1.5.

Also fixed a silent bug: networkDialTimeout config option existed on KadDHTInit but was never wired to the Network constructor. User-provided timeout config was silently ignored.

Root cause 2: Merge iterator bottleneck (query/manager.ts, query/query-path.ts)

queryManager.run() uses it-merge (backed by it-queueless-pushable, an unbuffered single-item channel) to merge events from all disjoint paths. PATH_ENDED events, which signal path convergence, had to queue behind hundreds of PEER_RESPONSE/QUERY_ERROR events and could never flow through fast enough for the manager to terminate before the user's timeout fired.

Fixed by adding an onPathComplete callback to QueryPathOptions that fires directly on queue idle, bypassing the merge iterator entirely. The query manager uses this out-of-band signal to abort the merge loop when ≥60% of paths complete, without waiting for PATH_ENDED to propagate through the channel.

Root cause 3: FINAL_PEER correctness (peer-routing/index.ts)

Per @achingbrain's review: the DHT requires crossover multiple independent paths resolving the same key must agree on the closest peers. Emitting FINAL_PEER events inline for every peer that responds (regardless of distance) breaks this invariant. Partially-completed queries should not emit results.

Fixed by accumulating contacted peers in a PeerDistanceList during traversal and only emitting FINAL_PEER events after the query fully converges. If the query is aborted (timeout), AbortError propagates naturally and no partial results are emitted callers get an error, not incorrect peer suggestions.

Stale value fix (content-fetching/index.ts)

The early termination optimization broke value retrieval: queries were cut short before all close peers responded, causing callers to receive older/stale records. Fixed by passing disableEarlyTermination: true for value-retrieval queries so all K closest peers are always consulted.

Files changed

  • src/network.ts: cap AdaptiveTimeout defaults; wire networkDialTimeout config
  • src/query/query-path.ts: add onPathComplete?(pathIndex: number): void callback
  • src/query/manager.ts: out-of-band early termination via onPathComplete; add disableEarlyTermination option
  • src/peer-routing/index.ts: accumulate in PeerDistanceList, emit FINAL_PEER only after convergence
  • src/content-fetching/index.ts: disableEarlyTermination: true for getMany
  • test/peer-routing.spec.ts: 4 new unit tests for getClosestPeers correctness

Test plan

  • should emit FINAL_PEER events for peers successfully contacted during a query
  • should propagate AbortError from queryManager without emitting FINAL_PEER
  • should propagate non-AbortError from queryManager
  • should not emit FINAL_PEER for peers that returned a query error
  • All 153 existing tests pass

Real-world test results

Tested against the live Amino DHT with 88 connected peers:

  • Own peer ID lookup (control): 20 FINAL_PEER events in 8.6s
  • Random key lookup: still times out after 30s this is a routing table density issue (pre-existing, not a regression). The routing table is too sparse to route to arbitrary keyspace regions with the current kBucketSize/prefixLength settings. This is a separate concern as noted by @achingbrain.

Fixes #3419

@achingbrain
Copy link
Copy Markdown
Member

peer-routing/index.ts: emit FINAL_PEER inline inside getCloserPeersQuery immediately after each successful
PEER_RESPONSE, peers are delivered as the query runs, not batched at the end; AbortError still propagates naturally

This doesn't sound correct. You only know what the closest peers were after the network has been traversed, you can't emit FINAL_PEER events before then.

@paschal533
Copy link
Copy Markdown
Contributor Author

peer-routing/index.ts: emit FINAL_PEER inline inside getCloserPeersQuery immediately after each successful
PEER_RESPONSE, peers are delivered as the query runs, not batched at the end; AbortError still propagates naturally

This doesn't sound correct. You only know what the closest peers were after the network has been traversed, you can't emit FINAL_PEER events before then.

Thank you @achingbrain for pointing this out..... I just fixed it to go back to post-traversal emission using PeerDistanceList, contacted peers are accumulated during the query sorted by XOR distance, then FINAL_PEER is emitted for the K-closest after the query completes. The AbortError case is handled by catching it, emitting whatever was found, then re-throwing so the abort still propagates to callers

@achingbrain
Copy link
Copy Markdown
Member

The AbortError case is handled by catching it, emitting whatever was found, then re-throwing so the abort still propagates to callers

I would be careful with this approach too. For the DHT to work there must be crossover in the set of closest peers that two independent nodes can resolve. It's not enough for one node to say "here are the closest peers I found in X amount of time".

@paschal533
Copy link
Copy Markdown
Contributor Author

I would be careful with this approach too. For the DHT to work there must be crossover in the set of closest peers that two independent nodes can resolve. It's not enough for one node to say "here are the closest peers I found in X amount of time".

That's right, and I should have caught that earlier
I've reverted the AbortError catch. FINAL_PEER events are now only emitted after the query fully converges. If the query times out, the error propagates as-is, callers get an honest failure rather than a subtly incorrect result. The real answer to the timeout issue is making convergence actually happen within the budget, which the AdaptiveTimeout cap and early termination are meant to address. If those aren't enough, that's where your routingtable density suggestion comes in

@tabcat
Copy link
Copy Markdown
Collaborator

tabcat commented Mar 25, 2026

@paschal533 this is very technical code being touched. It maybe be better, for now, to see if a failing test can be built to show exactly how the code is getting stuck. Then we can target the issue and make sure the fix is inline with kademlia dht correctness.

@paschal533
Copy link
Copy Markdown
Contributor Author

@paschal533 this is very technical code being touched. It maybe be better, for now, to see if a failing test can be built to show exactly how the code is getting stuck. Then we can target the issue and make sure the fix is inline with kademlia dht correctness.

@tabcat yeah that makes sense. we actually do have the failing tests already in test/peer-routing.spec.ts tests 1 and 4 would both fail on main. test 4 catches that addWithKadId was being called unconditionally, so peers that never responded were still showing up as FINAL_PEER. test 1 catches that peerStore.getInfo was being called twice in the emission loop and the result was used directly as the peer in finalPeerEvent, which gave wrong peer info

the fix in peer-routing/index.ts is pretty targeted, only add a peer to the list when we actually got a PEER_RESPONSE, call peerStore.getInfo once max, and only emit FINAL_PEER after the query fully converges (not on timeout) to preserve the crossover invariant

the AdaptiveTimeout and onPathComplete stuff is more of a performance improvement on top of that. happy to move those to a separate PR if it makes this easier to review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

kad-dht: getClosestPeers consistently times out

3 participants