fix(network): fix three simultaneous-dial session bugs#3529
fix(network): fix three simultaneous-dial session bugs#3529peilun-conflux wants to merge 3 commits into
Conversation
`deregister_stream` recorded `node_db.note_failure` for every expired session it removed, but the simultaneous-dial dedup kill passes `op = None` (since Conflux-Chain#3436 / 4ab4a79) precisely to avoid recording a failure — so the dropped redundant connection demoted the healthy peer anyway. The same blanket call also dragged a peer back down after it had already reconnected. Remove the blanket `note_failure` from `deregister_stream`. Peer reputation on disconnect is already recorded by the kill path via its `UpdateNodeOperation` (note_failure / demote / set_blacklisted for remote failures), and `set_expired` is only ever set by the two kill functions, so no genuine failure recording is lost. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
20bf932 to
7402cab
Compare
`update_ingress_node_id` ran before HELLO validation, so a simultaneous-dial `Replaced` overwrote `node_id_index` to the new ingress; if HELLO then failed validation only the new stream was killed, leaving the old session in the slab but absent from the index — unaddressable by node id. Validate HELLO into a candidate first and commit it (protocols, `node_db`, `had_hello`) only once the session is accepted, so a failed HELLO never touches the index. Surfaced via Conflux-Chain#3510. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A peer that completed HELLO and sent a second one re-entered `update_ingress_node_id` with its own token already in `node_id_index`, hitting a `panic!` — a remotely-triggerable node panic from the simultaneous-dial tie-break. Reject a second HELLO on an already-ready session with `BadProtocol` (disconnecting it like any protocol violation), and downgrade the now-unreachable index branch to a soft error. Surfaced via Conflux-Chain#3510. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
7402cab to
e3b70df
Compare
Code Review SummaryStatus: No Issues Found | Recommendation: Merge OverviewThis PR bundles three targeted, independent fixes to the simultaneous-dial session path. The logic is correct and well-reasoned. Below is a brief summary of what was verified. Fix 1 — Remove redundant Fix 2 — Validate HELLO before index update ( Fix 3 — Duplicate HELLO guard and soft error ( Files Reviewed (3 files)
Fix these issues in Kilo Cloud Reviewed by claude-sonnet-4.6 · 770,231 tokens |
This PR bundles three independent fixes to the simultaneous-dial session path, all surfaced while auditing #3510. Each stands alone.
1. Stop demoting healthy peers on simultaneous-dial dedup.
deregister_streamrecordednode_db.note_failurefor every expired session it removed, but the simultaneous-dial dedup kill passesop = None(since #3436 /4ab4a79b) precisely to avoid recording a failure — so the dropped redundant connection demoted the healthy peer anyway, and the same blanket call dragged a peer back down after it had already reconnected. Peer reputation on disconnect is already recorded by the kill path via itsUpdateNodeOperation, so this removes the redundant blanketnote_failurefromderegister_stream, leaving the kill path'sopas the single source of disconnect reputation. No genuine failure is lost:set_expiredis only set by the kill paths, which already apply theop, and every real-failure kill passesremote = true, Some(Failure); the only case the blanket call uniquely covered is theop = Nonededup, which is not a failure.2. Validate HELLO before the simultaneous-dial index update.
update_ingress_node_idran before HELLO validation, so a simultaneous-dialReplacedoverwrotenode_id_indexto the new ingress; if HELLO then failed validation only the new stream was killed, leaving the old session in the slab but absent from the index — unaddressable by node id. HELLO is now validated into a candidate first and committed (protocols,node_db,had_hello) only once the session is accepted, so a failed HELLO never touches the index.3. Reject duplicate HELLO instead of panicking. A peer that completed HELLO and sent a second one re-entered
update_ingress_node_idwith its own token already innode_id_index, hitting apanic!— a remotely-triggerable node panic. A second HELLO on an already-ready session is now rejected withBadProtocol(disconnecting it like any protocol violation), and the now-unreachable index branch is downgraded to a soft error.🤖 Generated with Claude Code
This change is