fix(network): fix three simultaneous-dial session bugs by peilun-conflux · Pull Request #3529 · Conflux-Chain/conflux-rust

peilun-conflux · 2026-05-29T14:11:35Z

This PR bundles three independent fixes to the simultaneous-dial session path, all surfaced while auditing #3510. Each stands alone.

1. Stop demoting healthy peers on simultaneous-dial dedup. deregister_stream recorded node_db.note_failure for every expired session it removed, but the simultaneous-dial dedup kill passes op = None (since #3436 / 4ab4a79b) precisely to avoid recording a failure — so the dropped redundant connection demoted the healthy peer anyway, and the same blanket call dragged a peer back down after it had already reconnected. Peer reputation on disconnect is already recorded by the kill path via its UpdateNodeOperation, so this removes the redundant blanket note_failure from deregister_stream, leaving the kill path's op as the single source of disconnect reputation. No genuine failure is lost: set_expired is only set by the kill paths, which already apply the op, and every real-failure kill passes remote = true, Some(Failure); the only case the blanket call uniquely covered is the op = None dedup, which is not a failure.

2. Validate HELLO before the simultaneous-dial index update. update_ingress_node_id ran before HELLO validation, so a simultaneous-dial Replaced overwrote node_id_index to the new ingress; if HELLO then failed validation only the new stream was killed, leaving the old session in the slab but absent from the index — unaddressable by node id. HELLO is now validated into a candidate first and committed (protocols, node_db, had_hello) only once the session is accepted, so a failed HELLO never touches the index.

3. Reject duplicate HELLO instead of panicking. A peer that completed HELLO and sent a second one re-entered update_ingress_node_id with its own token already in node_id_index, hitting a panic! — a remotely-triggerable node panic. A second HELLO on an already-ready session is now rejected with BadProtocol (disconnecting it like any protocol violation), and the now-unreachable index branch is downgraded to a soft error.

🤖 Generated with Claude Code

This change is

`deregister_stream` recorded `node_db.note_failure` for every expired session it removed, but the simultaneous-dial dedup kill passes `op = None` (since Conflux-Chain#3436 / 4ab4a79) precisely to avoid recording a failure — so the dropped redundant connection demoted the healthy peer anyway. The same blanket call also dragged a peer back down after it had already reconnected. Remove the blanket `note_failure` from `deregister_stream`. Peer reputation on disconnect is already recorded by the kill path via its `UpdateNodeOperation` (note_failure / demote / set_blacklisted for remote failures), and `set_expired` is only ever set by the two kill functions, so no genuine failure recording is lost. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

`update_ingress_node_id` ran before HELLO validation, so a simultaneous-dial `Replaced` overwrote `node_id_index` to the new ingress; if HELLO then failed validation only the new stream was killed, leaving the old session in the slab but absent from the index — unaddressable by node id. Validate HELLO into a candidate first and commit it (protocols, `node_db`, `had_hello`) only once the session is accepted, so a failed HELLO never touches the index. Surfaced via Conflux-Chain#3510. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

A peer that completed HELLO and sent a second one re-entered `update_ingress_node_id` with its own token already in `node_id_index`, hitting a `panic!` — a remotely-triggerable node panic from the simultaneous-dial tie-break. Reject a second HELLO on an already-ready session with `BadProtocol` (disconnecting it like any protocol violation), and downgrade the now-unreachable index branch to a soft error. Surfaced via Conflux-Chain#3510. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

kilo-code-bot · 2026-06-01T11:35:13Z

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Overview

This PR bundles three targeted, independent fixes to the simultaneous-dial session path. The logic is correct and well-reasoned. Below is a brief summary of what was verified.

Fix 1 — Remove redundant note_failure in deregister_stream (service.rs)
The removed call blanketed every expired-session deregistration with note_failure, which incorrectly demoted healthy peers when a redundant simultaneous-dial connection was torn down (op = None). The kill path's UpdateNodeOperation already handles reputation accounting for all real-failure paths, and the only case uniquely covered by the blanket call was the op = None dedup path — which is not a failure. Removal is correct.

Fix 2 — Validate HELLO before index update (session.rs)
The old ordering was: update_ingress_node_id (writes the index) → read_hello (validates). If validation failed, the index entry from the replaced session was orphaned. The new ordering is: validate_hello (pure parsing, no state mutation) → update_ingress_node_id (writes the index) → commit (peer_protocols, node_db.insert_with_token, had_hello) only if the tie-break was won. This is the correct fix. The early-return for the tie-break loser (token_to_disconnect == self.token()) correctly skips the commit block, leaving no partial state.

Fix 3 — Duplicate HELLO guard and soft error (session.rs, session_manager.rs)
A second HELLO on a ready session re-entered update_ingress_node_id with its own token already in node_id_index, triggering a panic!. The new is_ready() guard in read_packet rejects it with BadProtocol before any index access. The panic! in SessionManager::update_ingress_node_id is correctly downgraded to a soft error (the comment "should be unreachable" is accurate given the guard); returning Err causes session.rs:395 to issue send_disconnect(UpdateNodeIdFailed) and propagate the error, which is appropriate.

Files Reviewed (3 files)

crates/network/src/service.rs
crates/network/src/session.rs
crates/network/src/session_manager.rs

Fix these issues in Kilo Cloud

_{Reviewed by claude-sonnet-4.6 · 770,231 tokens}

peilun-conflux mentioned this pull request May 29, 2026

Centralize network session state transitions and make node_id_index a managed invariant #3510

Open

peilun-conflux requested a review from ChenxingLi May 29, 2026 14:15

peilun-conflux changed the title ~~fix(network): stop demoting healthy peers on simultaneous-dial dedup~~ fix(network): fix three simultaneous-dial session bugs Jun 1, 2026

peilun-conflux force-pushed the pr0-net-notefailure branch from 20bf932 to 7402cab Compare June 1, 2026 05:31

peilun-conflux and others added 2 commits June 1, 2026 19:03

peilun-conflux force-pushed the pr0-net-notefailure branch from 7402cab to e3b70df Compare June 1, 2026 11:32

peilun-conflux requested a review from Pana June 4, 2026 08:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(network): fix three simultaneous-dial session bugs#3529

fix(network): fix three simultaneous-dial session bugs#3529
peilun-conflux wants to merge 3 commits into
Conflux-Chain:masterfrom
peilun-conflux:pr0-net-notefailure

peilun-conflux commented May 29, 2026 •

edited

Loading

Uh oh!

kilo-code-bot Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

peilun-conflux commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kilo-code-bot Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Summary

Overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

peilun-conflux commented May 29, 2026 •

edited

Loading

kilo-code-bot Bot commented Jun 1, 2026 •

edited

Loading