Skip to content

fix(network): fix three simultaneous-dial session bugs#3529

Open
peilun-conflux wants to merge 3 commits into
Conflux-Chain:masterfrom
peilun-conflux:pr0-net-notefailure
Open

fix(network): fix three simultaneous-dial session bugs#3529
peilun-conflux wants to merge 3 commits into
Conflux-Chain:masterfrom
peilun-conflux:pr0-net-notefailure

Conversation

@peilun-conflux

@peilun-conflux peilun-conflux commented May 29, 2026

Copy link
Copy Markdown
Contributor

This PR bundles three independent fixes to the simultaneous-dial session path, all surfaced while auditing #3510. Each stands alone.

1. Stop demoting healthy peers on simultaneous-dial dedup. deregister_stream recorded node_db.note_failure for every expired session it removed, but the simultaneous-dial dedup kill passes op = None (since #3436 / 4ab4a79b) precisely to avoid recording a failure — so the dropped redundant connection demoted the healthy peer anyway, and the same blanket call dragged a peer back down after it had already reconnected. Peer reputation on disconnect is already recorded by the kill path via its UpdateNodeOperation, so this removes the redundant blanket note_failure from deregister_stream, leaving the kill path's op as the single source of disconnect reputation. No genuine failure is lost: set_expired is only set by the kill paths, which already apply the op, and every real-failure kill passes remote = true, Some(Failure); the only case the blanket call uniquely covered is the op = None dedup, which is not a failure.

2. Validate HELLO before the simultaneous-dial index update. update_ingress_node_id ran before HELLO validation, so a simultaneous-dial Replaced overwrote node_id_index to the new ingress; if HELLO then failed validation only the new stream was killed, leaving the old session in the slab but absent from the index — unaddressable by node id. HELLO is now validated into a candidate first and committed (protocols, node_db, had_hello) only once the session is accepted, so a failed HELLO never touches the index.

3. Reject duplicate HELLO instead of panicking. A peer that completed HELLO and sent a second one re-entered update_ingress_node_id with its own token already in node_id_index, hitting a panic! — a remotely-triggerable node panic. A second HELLO on an already-ready session is now rejected with BadProtocol (disconnecting it like any protocol violation), and the now-unreachable index branch is downgraded to a soft error.

🤖 Generated with Claude Code


This change is Reviewable

`deregister_stream` recorded `node_db.note_failure` for every expired
session it removed, but the simultaneous-dial dedup kill passes
`op = None` (since Conflux-Chain#3436 / 4ab4a79) precisely to avoid recording a
failure — so the dropped redundant connection demoted the healthy peer
anyway. The same blanket call also dragged a peer back down after it had
already reconnected.

Remove the blanket `note_failure` from `deregister_stream`. Peer
reputation on disconnect is already recorded by the kill path via its
`UpdateNodeOperation` (note_failure / demote / set_blacklisted for remote
failures), and `set_expired` is only ever set by the two kill functions,
so no genuine failure recording is lost.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@peilun-conflux peilun-conflux requested a review from ChenxingLi May 29, 2026 14:15
@peilun-conflux peilun-conflux changed the title fix(network): stop demoting healthy peers on simultaneous-dial dedup fix(network): fix three simultaneous-dial session bugs Jun 1, 2026
peilun-conflux and others added 2 commits June 1, 2026 19:03
`update_ingress_node_id` ran before HELLO validation, so a simultaneous-dial
`Replaced` overwrote `node_id_index` to the new ingress; if HELLO then failed
validation only the new stream was killed, leaving the old session in the slab
but absent from the index — unaddressable by node id.

Validate HELLO into a candidate first and commit it (protocols, `node_db`,
`had_hello`) only once the session is accepted, so a failed HELLO never touches
the index. Surfaced via Conflux-Chain#3510.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A peer that completed HELLO and sent a second one re-entered
`update_ingress_node_id` with its own token already in `node_id_index`,
hitting a `panic!` — a remotely-triggerable node panic from the
simultaneous-dial tie-break.

Reject a second HELLO on an already-ready session with `BadProtocol`
(disconnecting it like any protocol violation), and downgrade the
now-unreachable index branch to a soft error. Surfaced via Conflux-Chain#3510.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@kilo-code-bot

kilo-code-bot Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Overview

This PR bundles three targeted, independent fixes to the simultaneous-dial session path. The logic is correct and well-reasoned. Below is a brief summary of what was verified.

Fix 1 — Remove redundant note_failure in deregister_stream (service.rs)
The removed call blanketed every expired-session deregistration with note_failure, which incorrectly demoted healthy peers when a redundant simultaneous-dial connection was torn down (op = None). The kill path's UpdateNodeOperation already handles reputation accounting for all real-failure paths, and the only case uniquely covered by the blanket call was the op = None dedup path — which is not a failure. Removal is correct.

Fix 2 — Validate HELLO before index update (session.rs)
The old ordering was: update_ingress_node_id (writes the index) → read_hello (validates). If validation failed, the index entry from the replaced session was orphaned. The new ordering is: validate_hello (pure parsing, no state mutation) → update_ingress_node_id (writes the index) → commit (peer_protocols, node_db.insert_with_token, had_hello) only if the tie-break was won. This is the correct fix. The early-return for the tie-break loser (token_to_disconnect == self.token()) correctly skips the commit block, leaving no partial state.

Fix 3 — Duplicate HELLO guard and soft error (session.rs, session_manager.rs)
A second HELLO on a ready session re-entered update_ingress_node_id with its own token already in node_id_index, triggering a panic!. The new is_ready() guard in read_packet rejects it with BadProtocol before any index access. The panic! in SessionManager::update_ingress_node_id is correctly downgraded to a soft error (the comment "should be unreachable" is accurate given the guard); returning Err causes session.rs:395 to issue send_disconnect(UpdateNodeIdFailed) and propagate the error, which is appropriate.

Files Reviewed (3 files)
  • crates/network/src/service.rs
  • crates/network/src/session.rs
  • crates/network/src/session_manager.rs

Fix these issues in Kilo Cloud


Reviewed by claude-sonnet-4.6 · 770,231 tokens

@peilun-conflux peilun-conflux requested a review from Pana June 4, 2026 08:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant