bera-reth stalls while still appearing healthy

We hit the same failure mode twice on Berachain mainnet archive nodes and it looks execution-side rather than BeaconKit-only.

Reth nodes looks healthy with (eth_syncing=false, peers connected, BeaconKit engine timeouts, eth_getLogs degradation)

## Summary

On two separate occasions, `bera-reth` stopped making forward progress while still appearing superficially healthy:

- peers still connected
- `eth_syncing` returned `false`
- basic RPCs like `eth_blockNumber` still responded
- but BeaconKit stopped receiving successful EL responses and eventually flatlined
- on at least one node, local `eth_getLogs` became hung/broken until `reth` was restarted

Restarting `reth` restored progress. Restarting BeaconKit alone did not reliably recover the node.

## Incident windows

We have seen this occasionally since around 2026-03-22 and two specific times where 

- 2026-03-22 06:56 UTC
- 2026-03-22 23:39 UTC

We saw the same or very similar pattern on two different nodes.

## Main symptoms

Execution-layer side:
- `reth` stopped advancing the canonical chain height
- `eth_syncing` still returned `false`
- peers remained connected
- basic RPCs still answered
- local `eth_getLogs` either failed heavily or stopped responding
- in one case, `reth` warned that the beacon client was online but no consensus updates were being received

Consensus-layer side:
- BeaconKit started timing out on local Engine API calls to `reth`
- BeaconKit replay / deposit-reading path got stuck
- after `reth` restart, BeaconKit sometimes needed one or more restarts to fully recover

Example BeaconKit-side errors we saw:
- `engine API call timed out`
- deposit-reading / filter-related errors during recovery
- later forkchoice / replay stalls

## Recovery

What consistently helped was:

1. restart `reth`
2. if BeaconKit does not resume by itself, restart BeaconKit
3. in some cases BeaconKit needed a second restart after `reth` was healthy again

Restarting BeaconKit without restarting `reth` was not sufficient on the affected node where `eth_getLogs` was wedged.

## Why this looks like a `reth` issue

The strongest signal was:

- `reth` remained up and still answered simple RPC calls
- but archive/read-heavy RPC behavior degraded sharply before the stall
- `eth_getLogs` was a recurring failure signal
- BeaconKit then lost effective EL/CL coordination
- restarting `reth` cleared the condition

This makes it look like `reth` entered a degraded internal state rather than crashing outright.

## Metrics / patterns seen before the stall

When we looked closer at metrics before at least one of the stop windows we saw:

- sharp latency increase across multiple RPC methods, not just one:
  - `eth_getLogs`
  - `eth_call`
  - `eth_getBlockReceipts`
  - `debug_traceBlockByNumber`
  - even `eth_getBlockByNumber` got slower
- repeated `eth_getLogs` failures before the node flatlined
- archive-style read traffic present before both incidents
- no clear evidence of classic host resource exhaustion:
  - no obvious peer collapse
  - no obvious bandwidth saturation
  - no confirmed RAM exhaustion
  - no confirmed FD exhaustion

So us it looks like:
- some internal `reth` degradation is triggered or exercised by archive-heavy traffic, especially around log/filter-style paths
- that degraded state eventually breaks CL/EL coordination with BeaconKit

## Questions

- Is this a known failure mode in `bera-reth` or upstream `reth`?
- Does this line up with any known issues around:
  - `eth_getLogs`
  - filter/log handling
  - long-lived read transactions
  - Engine API responsiveness while the node still reports `eth_syncing=false`
- Are there metrics or logs we should capture next time to narrow this further?
- Is there a recommended mitigation for archive nodes besides process restart?

## Additional note

We can provide more exact logs / timings if useful, but I wanted to first report the pattern because it repeated across nodes and required `reth` restart to recover.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bera-reth stalls while still appearing healthy #232

Summary

Incident windows

Main symptoms

Recovery

Why this looks like a `reth` issue

Metrics / patterns seen before the stall

Questions

Additional note

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

bera-reth stalls while still appearing healthy #232

Description

Summary

Incident windows

Main symptoms

Recovery

Why this looks like a reth issue

Metrics / patterns seen before the stall

Questions

Additional note

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Why this looks like a `reth` issue