Skip to content

feat(net): expose per-kind reputation-change and ban counters#180

Open
constwz wants to merge 1 commit into
developfrom
metrics/reputation-changes-on-develop
Open

feat(net): expose per-kind reputation-change and ban counters#180
constwz wants to merge 1 commit into
developfrom
metrics/reputation-changes-on-develop

Conversation

@constwz
Copy link
Copy Markdown
Contributor

@constwz constwz commented Apr 29, 2026

Summary

Adds per-`ReputationChangeKind` and per-outcome counters to the network's `PeersManager`, so reputation-driven peer drops are visible at the Prometheus layer.

Why

Today, a peer-pool drain induced by reputation bans is invisible at the metrics layer:

  • Banned peers go into the `ban_list`, which has no exposed gauge or counter.
  • The existing `DisconnectMetrics` counters cannot tell a graceful close apart from a rep-driven kick — both increment `network_disconnect_requested`.
  • `apply_reputation_change` only emits a `debug`/`info` log line per call (since fix: resolve some issues in cross region test #175), which is fine for forensics but useless for Grafana.

Concretely this affects BSC node operators investigating the peers drop to zero after sync pattern (bnb-chain/reth-bsc#320). Without these counters, distinguishing "we are banning peers because of repeated `BadBlock` penalties" from "peers are leaving us for unrelated reasons" requires log inspection. With them, a single PromQL `rate(network_reputation_changes_bad_block[5m])` correlated against `network_connected_peers` makes the diagnosis a panel.

New metrics

```
network_reputation_changes_bad_message
network_reputation_changes_bad_block
network_reputation_changes_bad_transactions
network_reputation_changes_bad_announcement
network_reputation_changes_already_seen_transaction
network_reputation_changes_timeout
network_reputation_changes_bad_protocol
network_reputation_changes_failed_to_connect
network_reputation_changes_dropped
network_reputation_changes_reset
network_reputation_changes_other
network_bans_total
network_disconnect_and_bans_total
network_unbans_total
```

Behaviour

  • The kind-counter increments before the trusted-peer / unknown-peer guards — it answers "what's hitting us." Outcome counters answer "did we punish for it."
  • Trusted-peer exemption is preserved.
  • No behaviour change beyond the new counters.

Test plan

  • `cargo check -p reth-network` — pass
  • `cargo +nightly clippy -p reth-network --tests --all-features` — no new warnings
  • `cargo +nightly fmt --check` — clean
  • `cargo nextest run -p reth-network` — 177 passed, 4 skipped

Refs bnb-chain/reth-bsc#320.

Adds a `ReputationMetrics` struct (scope `network`) with a `Counter`
per `ReputationChangeKind` plus three outcome counters
(`bans_total`, `disconnect_and_bans_total`, `unbans_total`), and
instruments `PeersManager::apply_reputation_change` to increment them.

New Prometheus metrics:

  network_reputation_changes_bad_message
  network_reputation_changes_bad_block
  network_reputation_changes_bad_transactions
  network_reputation_changes_bad_announcement
  network_reputation_changes_already_seen_transaction
  network_reputation_changes_timeout
  network_reputation_changes_bad_protocol
  network_reputation_changes_failed_to_connect
  network_reputation_changes_dropped
  network_reputation_changes_reset
  network_reputation_changes_other
  network_bans_total
  network_disconnect_and_bans_total
  network_unbans_total

Diagnostic motivation: today, a peer-pool drain induced by reputation
bans is invisible at the metrics layer. Banned peers go into the
`ban_list`, which has no exposed gauge or counter, and the existing
`DisconnectMetrics` counters cannot tell a graceful close apart from
a rep-driven disconnect — both increment `disconnect_requested`.

Concretely this affects BSC node operators investigating the
"peers drop to zero after sync" pattern (bnb-chain/reth-bsc#320):
without these counters, distinguishing "we are banning peers because
of repeated `BadBlock` penalties" from "peers are leaving us for
unrelated reasons" requires log inspection. With them, a single
PromQL `rate(network_reputation_changes_bad_block[5m])` correlated
against `network_connected_peers` makes the diagnosis a Grafana
panel.

The kind-counter increments before the trusted-peer / unknown-peer
guards so it answers "what's hitting us" — outcome counters answer
"did we punish for it". Trusted-peer exemption is preserved.

No behaviour change beyond the new counters.
@constwz constwz requested a review from joey0612 as a code owner April 29, 2026 06:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants