Skip to content

fix(net): classify Invalid-block penalty + peer lifecycle metrics#333

Open
constwz wants to merge 3 commits into
developfrom
fix/classified-invalid-peer-penalty
Open

fix(net): classify Invalid-block penalty + peer lifecycle metrics#333
constwz wants to merge 3 commits into
developfrom
fix/classified-invalid-peer-penalty

Conversation

@constwz
Copy link
Copy Markdown
Contributor

@constwz constwz commented Apr 24, 2026

Summary

Two coupled changes targeting #320 (peers drop to zero after sync):

  1. Categorized Invalid handling in block_import/service.rs — replace fix: do not penalize peers for Invalid blocks from new_payload #297's blanket skip with error-string classification. Race-shaped failures (state root, parent block, unknown parent, missing trie, receipts root, insert_canonical, reorg) skip the penalty path; structural failures keep the BadBlock mechanism so real offenders are still caught. Aligns with geth's fetcher semantics (drop on header verify failure, not execution failure).
  2. Peer lifecycle summary metrics (node/network/peer_metrics.rs) — subscribes to NetworkEventListenerProvider's PeerEvent stream and emits one INFO line per 60 s window with opens/closes bucketed by DisconnectReason. Idle windows stay silent. Purpose: in prod we had no visibility into why peers were disconnecting.

Why re-tighten after #297?

#297 suppressed the BadBlock penalty for every PayloadStatusEnum::Invalid. That fixes the drain caused by concurrent-reorg races (which is real — 4 race-Invalids bans a peer, recovery +1/sec ≈ 4.5 h to re-earn), but also lets genuinely malformed blocks from malicious or broken peers escape the reputation system. Classifying by error string lets us keep the race-race immunity without surrendering the structural safety net.

Example metrics output

INFO bsc::peer_metrics window_secs=60 opened=3 closed=15 \
     closed_by=disconnect_requested:8,too_many_peers:4,ping_timeout:3 \
     "peer lifecycle summary"

Test plan

  • cargo check — pass
  • cargo clippy --tests -- -D warnings — pass
  • cargo test --lib node::network::block_import::service::tests::can_handle_invalid_new_payload — pass (structural branch still emits Err, BadBlock path preserved)
  • cargo test --lib node::network::block_import::service::tests::reorg_race_invalid_does_not_penalize — new test, pass (race branch emits no Err outcome)
  • Observe bsc::peer_metrics INFO line on a running testnet/mainnet node; confirm one line per minute only when there is activity, silent during steady state.

Refs #320, #297.

🤖 Generated with Claude Code

Two coupled changes targeting reth-bsc #320 ("peers drop to zero after
sync"):

1. Categorized Invalid handling in block_import/service.rs

   PR #297 fixed the immediate symptom by suppressing the BadBlock
   reputation penalty for ALL `PayloadStatusEnum::Invalid` outcomes.
   That's too blunt — a peer sending truly malformed blocks (bad Parlia
   signature, wrong turn difficulty, structural hash mismatch) now also
   escapes the penalty mechanism.

   We classify the error string instead. Race-shaped failures
   ("state root", "parent block", "unknown parent", "missing trie",
   "receipts root", "insert_canonical", "reorg") skip the penalty path
   (no Outcome emitted, reth's NetworkManager::on_block_imported never
   sees an Err, no BadBlock applied). Structural failures keep the
   original behaviour so BadBlock still catches real offenders.

   Aligns with geth's fetcher semantics: drop on header verification
   failure, not on execution failure.

   Test `can_handle_invalid_new_payload` (error = "test error") is
   restored to assert the structural branch still penalizes. Added
   `reorg_race_invalid_does_not_penalize` covering the skip branch.

2. Peer lifecycle summary metrics (node/network/peer_metrics.rs)

   Subscribes to NetworkEventListenerProvider's PeerEvent stream and
   emits ONE INFO-level line per 60-second window, aggregated:

     INFO bsc::peer_metrics window_secs=60 opened=3 closed=15
          closed_by=disconnect_requested:8,too_many_peers:4,ping_timeout:3

   Silent windows produce no output. DisconnectReason is bucketed into
   a stable snake-case label so logs stay diffable across reth upgrades
   and are easily Grafana-panelable later.

   Purpose — in production we had no visibility into WHY peers were
   disconnecting; rep-penalty, protocol-breach, ping-timeout and
   ordinary churn looked identical from the outside. This collector
   turns the drain pattern into something readable without spamming.

   Spawned once from main.rs after node.launch() alongside the other
   post-launch setup.

Refs #320, #297

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@constwz constwz requested a review from joey0612 as a code owner April 24, 2026 05:56
@zhk101
Copy link
Copy Markdown

zhk101 commented Apr 24, 2026

reth is already logging the peer disconnect reason as metrics
image

reth_network_useless_peer
reth_network_subprotocol_specific
reth_network_already_connected
reth_network_client_quitting
reth_network_unexpected_identity
reth_network_disconnect_requested
reth_network_null_node_identity
reth_network_tcp_subsystem_error
reth_network_incompatible
reth_network_protocol_breach
reth_network_too_many_peers

cbh876 and others added 2 commits April 29, 2026 14:10
…s counters

Responding to @zhk101's review on PR #333: reth already exposes per-
DisconnectReason counters (`network_disconnect_requested`,
`network_too_many_peers`, etc.). The custom peer_metrics log aggregator
duplicates that surface for nodes that already scrape `/metrics`.

Replace it with a more useful diagnostic surface: per-outcome counters
for `engine.new_payload` results. This fills the actual observability
gap — there were NO metrics on the block-import path, so the
race-Invalid vs. structural-Invalid ratio (the central tuning knob for
the classifier added in this PR) was invisible at the Prometheus layer.

New metrics (scope `bsc.block_import`, derived from the standard
`#[derive(Metrics)]` pattern in `src/metrics.rs`):

  bsc_block_import_invalid_race        # race-shaped Invalid, no penalty
  bsc_block_import_invalid_structural  # structural Invalid, BadBlock applied
  bsc_block_import_valid               # accepted by engine
  bsc_block_import_syncing             # engine syncing, FCU forced
  bsc_block_import_hash_mismatch       # rejected before reaching engine

Diagnostic value: a peer-drop incident driven by structural rejects
will show `invalid_structural` rate spike just before peer count
declines, while a race-driven scenario will show `invalid_race`
dominating with peer count stable. Either way the trace is now in
the metrics layer instead of grep'd log lines.

Removed: src/node/network/peer_metrics.rs and its main.rs / mod.rs
wiring. The block-import classification logic (race vs. structural)
and the two unit tests added in the previous commit are unchanged.

Refs #320, #297, #333

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…alid-peer-penalty

# Conflicts:
#	src/node/network/block_import/service.rs
@constwz
Copy link
Copy Markdown
Contributor Author

constwz commented Apr 29, 2026

@zhk101 thanks for catching this — you're right that reth's DisconnectMetrics already covers per-DisconnectReason counters as Prometheus metrics, so a separate log aggregator on top of the same data is redundant for any node that scrapes /metrics.

I've reworked the PR (commit `23d7248`):

Removed: src/node/network/peer_metrics.rs and its main.rs / mod.rs wiring.

Added instead: a BscBlockImportMetrics struct in src/metrics.rs (same #[derive(Metrics)] pattern as the other BSC metric scopes) that exposes per-outcome counters for engine.new_payload results:

```
bsc_block_import_invalid_race # race-shaped Invalid → no penalty
bsc_block_import_invalid_structural # structural Invalid → BadBlock applied
bsc_block_import_valid # accepted by engine
bsc_block_import_syncing # engine syncing, FCU forced
bsc_block_import_hash_mismatch # rejected before reaching engine
```

This fills the actual observability gap (the block-import path had zero metrics before this), and makes the race-vs-structural ratio — which is the central tuning knob for the classifier in this PR — directly visible in Prometheus instead of grep'd from logs.

Also rebased onto current `develop` to clear the merge conflict (the `fork_recover` refactor in #334 reshaped `ImportService` significantly; resolved by keeping the new `recovering_heads` / `failed_heads` / `announce_interval` fields and adding the `metrics` field alongside them). All 14 `block_import::service::tests` pass locally including the two new classification tests.

constwz pushed a commit that referenced this pull request May 7, 2026
…s counters

Responding to @zhk101's review on PR #333: reth already exposes per-
DisconnectReason counters (`network_disconnect_requested`,
`network_too_many_peers`, etc.). The custom peer_metrics log aggregator
duplicates that surface for nodes that already scrape `/metrics`.

Replace it with a more useful diagnostic surface: per-outcome counters
for `engine.new_payload` results. This fills the actual observability
gap — there were NO metrics on the block-import path, so the
race-Invalid vs. structural-Invalid ratio (the central tuning knob for
the classifier added in this PR) was invisible at the Prometheus layer.

New metrics (scope `bsc.block_import`, derived from the standard
`#[derive(Metrics)]` pattern in `src/metrics.rs`):

  bsc_block_import_invalid_race        # race-shaped Invalid, no penalty
  bsc_block_import_invalid_structural  # structural Invalid, BadBlock applied
  bsc_block_import_valid               # accepted by engine
  bsc_block_import_syncing             # engine syncing, FCU forced
  bsc_block_import_hash_mismatch       # rejected before reaching engine

Diagnostic value: a peer-drop incident driven by structural rejects
will show `invalid_structural` rate spike just before peer count
declines, while a race-driven scenario will show `invalid_race`
dominating with peer count stable. Either way the trace is now in
the metrics layer instead of grep'd log lines.

Removed: src/node/network/peer_metrics.rs and its main.rs / mod.rs
wiring. The block-import classification logic (race vs. structural)
and the two unit tests added in the previous commit are unchanged.

Refs #320, #297, #333

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
chee-chyuan pushed a commit that referenced this pull request May 7, 2026
…s counters

Responding to @zhk101's review on PR #333: reth already exposes per-
DisconnectReason counters (`network_disconnect_requested`,
`network_too_many_peers`, etc.). The custom peer_metrics log aggregator
duplicates that surface for nodes that already scrape `/metrics`.

Replace it with a more useful diagnostic surface: per-outcome counters
for `engine.new_payload` results. This fills the actual observability
gap — there were NO metrics on the block-import path, so the
race-Invalid vs. structural-Invalid ratio (the central tuning knob for
the classifier added in this PR) was invisible at the Prometheus layer.

New metrics (scope `bsc.block_import`, derived from the standard
`#[derive(Metrics)]` pattern in `src/metrics.rs`):

  bsc_block_import_invalid_race        # race-shaped Invalid, no penalty
  bsc_block_import_invalid_structural  # structural Invalid, BadBlock applied
  bsc_block_import_valid               # accepted by engine
  bsc_block_import_syncing             # engine syncing, FCU forced
  bsc_block_import_hash_mismatch       # rejected before reaching engine

Diagnostic value: a peer-drop incident driven by structural rejects
will show `invalid_structural` rate spike just before peer count
declines, while a race-driven scenario will show `invalid_race`
dominating with peer count stable. Either way the trace is now in
the metrics layer instead of grep'd log lines.

Removed: src/node/network/peer_metrics.rs and its main.rs / mod.rs
wiring. The block-import classification logic (race vs. structural)
and the two unit tests added in the previous commit are unchanged.

Refs #320, #297, #333

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants