fix(net): classify Invalid-block penalty + peer lifecycle metrics#333
fix(net): classify Invalid-block penalty + peer lifecycle metrics#333constwz wants to merge 3 commits into
Conversation
Two coupled changes targeting reth-bsc #320 ("peers drop to zero after sync"): 1. Categorized Invalid handling in block_import/service.rs PR #297 fixed the immediate symptom by suppressing the BadBlock reputation penalty for ALL `PayloadStatusEnum::Invalid` outcomes. That's too blunt — a peer sending truly malformed blocks (bad Parlia signature, wrong turn difficulty, structural hash mismatch) now also escapes the penalty mechanism. We classify the error string instead. Race-shaped failures ("state root", "parent block", "unknown parent", "missing trie", "receipts root", "insert_canonical", "reorg") skip the penalty path (no Outcome emitted, reth's NetworkManager::on_block_imported never sees an Err, no BadBlock applied). Structural failures keep the original behaviour so BadBlock still catches real offenders. Aligns with geth's fetcher semantics: drop on header verification failure, not on execution failure. Test `can_handle_invalid_new_payload` (error = "test error") is restored to assert the structural branch still penalizes. Added `reorg_race_invalid_does_not_penalize` covering the skip branch. 2. Peer lifecycle summary metrics (node/network/peer_metrics.rs) Subscribes to NetworkEventListenerProvider's PeerEvent stream and emits ONE INFO-level line per 60-second window, aggregated: INFO bsc::peer_metrics window_secs=60 opened=3 closed=15 closed_by=disconnect_requested:8,too_many_peers:4,ping_timeout:3 Silent windows produce no output. DisconnectReason is bucketed into a stable snake-case label so logs stay diffable across reth upgrades and are easily Grafana-panelable later. Purpose — in production we had no visibility into WHY peers were disconnecting; rep-penalty, protocol-breach, ping-timeout and ordinary churn looked identical from the outside. This collector turns the drain pattern into something readable without spamming. Spawned once from main.rs after node.launch() alongside the other post-launch setup. Refs #320, #297 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…s counters Responding to @zhk101's review on PR #333: reth already exposes per- DisconnectReason counters (`network_disconnect_requested`, `network_too_many_peers`, etc.). The custom peer_metrics log aggregator duplicates that surface for nodes that already scrape `/metrics`. Replace it with a more useful diagnostic surface: per-outcome counters for `engine.new_payload` results. This fills the actual observability gap — there were NO metrics on the block-import path, so the race-Invalid vs. structural-Invalid ratio (the central tuning knob for the classifier added in this PR) was invisible at the Prometheus layer. New metrics (scope `bsc.block_import`, derived from the standard `#[derive(Metrics)]` pattern in `src/metrics.rs`): bsc_block_import_invalid_race # race-shaped Invalid, no penalty bsc_block_import_invalid_structural # structural Invalid, BadBlock applied bsc_block_import_valid # accepted by engine bsc_block_import_syncing # engine syncing, FCU forced bsc_block_import_hash_mismatch # rejected before reaching engine Diagnostic value: a peer-drop incident driven by structural rejects will show `invalid_structural` rate spike just before peer count declines, while a race-driven scenario will show `invalid_race` dominating with peer count stable. Either way the trace is now in the metrics layer instead of grep'd log lines. Removed: src/node/network/peer_metrics.rs and its main.rs / mod.rs wiring. The block-import classification logic (race vs. structural) and the two unit tests added in the previous commit are unchanged. Refs #320, #297, #333 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…alid-peer-penalty # Conflicts: # src/node/network/block_import/service.rs
|
@zhk101 thanks for catching this — you're right that reth's I've reworked the PR (commit `23d7248`): Removed: Added instead: a ``` This fills the actual observability gap (the block-import path had zero metrics before this), and makes the race-vs-structural ratio — which is the central tuning knob for the classifier in this PR — directly visible in Prometheus instead of grep'd from logs. Also rebased onto current `develop` to clear the merge conflict (the `fork_recover` refactor in #334 reshaped `ImportService` significantly; resolved by keeping the new `recovering_heads` / `failed_heads` / `announce_interval` fields and adding the `metrics` field alongside them). All 14 `block_import::service::tests` pass locally including the two new classification tests. |
…s counters Responding to @zhk101's review on PR #333: reth already exposes per- DisconnectReason counters (`network_disconnect_requested`, `network_too_many_peers`, etc.). The custom peer_metrics log aggregator duplicates that surface for nodes that already scrape `/metrics`. Replace it with a more useful diagnostic surface: per-outcome counters for `engine.new_payload` results. This fills the actual observability gap — there were NO metrics on the block-import path, so the race-Invalid vs. structural-Invalid ratio (the central tuning knob for the classifier added in this PR) was invisible at the Prometheus layer. New metrics (scope `bsc.block_import`, derived from the standard `#[derive(Metrics)]` pattern in `src/metrics.rs`): bsc_block_import_invalid_race # race-shaped Invalid, no penalty bsc_block_import_invalid_structural # structural Invalid, BadBlock applied bsc_block_import_valid # accepted by engine bsc_block_import_syncing # engine syncing, FCU forced bsc_block_import_hash_mismatch # rejected before reaching engine Diagnostic value: a peer-drop incident driven by structural rejects will show `invalid_structural` rate spike just before peer count declines, while a race-driven scenario will show `invalid_race` dominating with peer count stable. Either way the trace is now in the metrics layer instead of grep'd log lines. Removed: src/node/network/peer_metrics.rs and its main.rs / mod.rs wiring. The block-import classification logic (race vs. structural) and the two unit tests added in the previous commit are unchanged. Refs #320, #297, #333 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…s counters Responding to @zhk101's review on PR #333: reth already exposes per- DisconnectReason counters (`network_disconnect_requested`, `network_too_many_peers`, etc.). The custom peer_metrics log aggregator duplicates that surface for nodes that already scrape `/metrics`. Replace it with a more useful diagnostic surface: per-outcome counters for `engine.new_payload` results. This fills the actual observability gap — there were NO metrics on the block-import path, so the race-Invalid vs. structural-Invalid ratio (the central tuning knob for the classifier added in this PR) was invisible at the Prometheus layer. New metrics (scope `bsc.block_import`, derived from the standard `#[derive(Metrics)]` pattern in `src/metrics.rs`): bsc_block_import_invalid_race # race-shaped Invalid, no penalty bsc_block_import_invalid_structural # structural Invalid, BadBlock applied bsc_block_import_valid # accepted by engine bsc_block_import_syncing # engine syncing, FCU forced bsc_block_import_hash_mismatch # rejected before reaching engine Diagnostic value: a peer-drop incident driven by structural rejects will show `invalid_structural` rate spike just before peer count declines, while a race-driven scenario will show `invalid_race` dominating with peer count stable. Either way the trace is now in the metrics layer instead of grep'd log lines. Removed: src/node/network/peer_metrics.rs and its main.rs / mod.rs wiring. The block-import classification logic (race vs. structural) and the two unit tests added in the previous commit are unchanged. Refs #320, #297, #333 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Summary
Two coupled changes targeting #320 (peers drop to zero after sync):
block_import/service.rs— replace fix: do not penalize peers for Invalid blocks from new_payload #297's blanket skip with error-string classification. Race-shaped failures (state root,parent block,unknown parent,missing trie,receipts root,insert_canonical,reorg) skip the penalty path; structural failures keep theBadBlockmechanism so real offenders are still caught. Aligns with geth's fetcher semantics (drop on header verify failure, not execution failure).node/network/peer_metrics.rs) — subscribes toNetworkEventListenerProvider'sPeerEventstream and emits oneINFOline per 60 s window with opens/closes bucketed byDisconnectReason. Idle windows stay silent. Purpose: in prod we had no visibility into why peers were disconnecting.Why re-tighten after #297?
#297 suppressed the
BadBlockpenalty for everyPayloadStatusEnum::Invalid. That fixes the drain caused by concurrent-reorg races (which is real — 4 race-Invalids bans a peer, recovery +1/sec ≈ 4.5 h to re-earn), but also lets genuinely malformed blocks from malicious or broken peers escape the reputation system. Classifying by error string lets us keep the race-race immunity without surrendering the structural safety net.Example metrics output
Test plan
cargo check— passcargo clippy --tests -- -D warnings— passcargo test --lib node::network::block_import::service::tests::can_handle_invalid_new_payload— pass (structural branch still emits Err, BadBlock path preserved)cargo test --lib node::network::block_import::service::tests::reorg_race_invalid_does_not_penalize— new test, pass (race branch emits no Err outcome)bsc::peer_metricsINFO line on a running testnet/mainnet node; confirm one line per minute only when there is activity, silent during steady state.Refs #320, #297.
🤖 Generated with Claude Code