Skip to content

bera-reth stalls while still appearing healthy #232

@jonathanudd

Description

@jonathanudd

We hit the same failure mode twice on Berachain mainnet archive nodes and it looks execution-side rather than BeaconKit-only.

Reth nodes looks healthy with (eth_syncing=false, peers connected, BeaconKit engine timeouts, eth_getLogs degradation)

Summary

On two separate occasions, bera-reth stopped making forward progress while still appearing superficially healthy:

  • peers still connected
  • eth_syncing returned false
  • basic RPCs like eth_blockNumber still responded
  • but BeaconKit stopped receiving successful EL responses and eventually flatlined
  • on at least one node, local eth_getLogs became hung/broken until reth was restarted

Restarting reth restored progress. Restarting BeaconKit alone did not reliably recover the node.

Incident windows

We have seen this occasionally since around 2026-03-22 and two specific times where

  • 2026-03-22 06:56 UTC
  • 2026-03-22 23:39 UTC

We saw the same or very similar pattern on two different nodes.

Main symptoms

Execution-layer side:

  • reth stopped advancing the canonical chain height
  • eth_syncing still returned false
  • peers remained connected
  • basic RPCs still answered
  • local eth_getLogs either failed heavily or stopped responding
  • in one case, reth warned that the beacon client was online but no consensus updates were being received

Consensus-layer side:

  • BeaconKit started timing out on local Engine API calls to reth
  • BeaconKit replay / deposit-reading path got stuck
  • after reth restart, BeaconKit sometimes needed one or more restarts to fully recover

Example BeaconKit-side errors we saw:

  • engine API call timed out
  • deposit-reading / filter-related errors during recovery
  • later forkchoice / replay stalls

Recovery

What consistently helped was:

  1. restart reth
  2. if BeaconKit does not resume by itself, restart BeaconKit
  3. in some cases BeaconKit needed a second restart after reth was healthy again

Restarting BeaconKit without restarting reth was not sufficient on the affected node where eth_getLogs was wedged.

Why this looks like a reth issue

The strongest signal was:

  • reth remained up and still answered simple RPC calls
  • but archive/read-heavy RPC behavior degraded sharply before the stall
  • eth_getLogs was a recurring failure signal
  • BeaconKit then lost effective EL/CL coordination
  • restarting reth cleared the condition

This makes it look like reth entered a degraded internal state rather than crashing outright.

Metrics / patterns seen before the stall

When we looked closer at metrics before at least one of the stop windows we saw:

  • sharp latency increase across multiple RPC methods, not just one:
    • eth_getLogs
    • eth_call
    • eth_getBlockReceipts
    • debug_traceBlockByNumber
    • even eth_getBlockByNumber got slower
  • repeated eth_getLogs failures before the node flatlined
  • archive-style read traffic present before both incidents
  • no clear evidence of classic host resource exhaustion:
    • no obvious peer collapse
    • no obvious bandwidth saturation
    • no confirmed RAM exhaustion
    • no confirmed FD exhaustion

So us it looks like:

  • some internal reth degradation is triggered or exercised by archive-heavy traffic, especially around log/filter-style paths
  • that degraded state eventually breaks CL/EL coordination with BeaconKit

Questions

  • Is this a known failure mode in bera-reth or upstream reth?
  • Does this line up with any known issues around:
    • eth_getLogs
    • filter/log handling
    • long-lived read transactions
    • Engine API responsiveness while the node still reports eth_syncing=false
  • Are there metrics or logs we should capture next time to narrow this further?
  • Is there a recommended mitigation for archive nodes besides process restart?

Additional note

We can provide more exact logs / timings if useful, but I wanted to first report the pattern because it repeated across nodes and required reth restart to recover.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions