feat(consensus): report catching_up when fallen behind peers#40
Merged
Conversation
The catching_up field in /status is derived solely from the consensus
reactor's WaitSync() latch. WaitSync is set once at startup and cleared the
first time the node switches from block-sync to the consensus reactor, after
which it stays false for the remainder of the process. As a result the field
only ever reflects the initial block-sync and never reflects a node that later
stops making progress relative to its peers, so such a node keeps reporting
catching_up=false while operators and downstream consumers treat it as fully
synced.
Add Reactor.IsBehind(), evaluated from peer-reported heights, and OR it into
catching_up:
CatchingUp = WaitSync() || IsBehind()
IsBehind reports true when a connected peer is more than catchup_lag_threshold
blocks ahead of our consensus height, or when the connected-peer count is below
min_expected_peers. The decision uses peer heights rather than local block-time
staleness on purpose: staleness cannot distinguish a node that is behind its
peers (some peer is ahead) from a network in which every node has legitimately
stopped at the same height (no peer is ahead). Only the former is catching up;
reporting the latter as catching up would be a false positive.
The lag condition must hold continuously for catchup_debounce_duration before
catching_up flips to true, which damps flapping at the threshold boundary; the
transition back to false is immediate.
New [consensus] config (with defaults):
- catchup_lag_threshold = 5 (0 disables; must be >=2 to absorb the
one-height round skew between synced peers)
- min_expected_peers = 0 (off by default so single-node deployments
stay healthy)
- catchup_debounce_duration = 10s
WaitSync is left untouched: it gates consensus message handling, so IsBehind is
an independent read-only signal rather than a re-arming of that latch. The
offline inspect server's reactor stub returns IsBehind()=false.
Tests cover the lag decision (peer ahead beyond/at/just over the threshold,
equal-height, one-height skew, no peer height learned, threshold disabled,
single-node zero-peer, and the min-peers guard on/off) and the debounce state
machine.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
cffls
reviewed
Jun 2, 2026
Member
|
Should this kind of behaviour reported upstream? They would have faced this issue somewhere no? Worth confirming why they have deliberately chosen this behaviour. |
vbhattaccmu
reviewed
Jun 3, 2026
A node that has finished initial block-sync but later ends up with no peers cannot observe whether the network has advanced past its height. The earlier approach gated this on a `min_expected_peers` config value (default 0), which had to be raised per network to have any effect. That encoded network-specific assumptions in configuration, did not reach already-initialized nodes on a binary upgrade (the key is absent from their config.toml), and could not be exercised on a local devnet whose chain id differs from the production networks. Replace the config knob with a self-determining rule: a node with zero peers reports catching_up=true unless it is the sole validator, in which case it can finalize blocks on its own and needs no peers to make progress. Sole-validator status uses the same local-validator membership test as the signing path (Validators.HasAddress on the private validator's address) and is read live from consensus state on each call, so it follows validator-set changes the node has committed into local round state without any cached value or operator action. A non-validator node has no private validator and so always needs peers, and is reported accordingly. This requires no configuration and is network-agnostic: the behavior is identical on mainnet, testnet, and local devnets, and ships entirely in the binary with no per-network value to deploy. Details: - consensus/reactor.go: evaluateBehind takes an isSoleValidator bool instead of a peer-count threshold; isBehindRaw sources it from State.isLocalSoleValidator. - consensus/state.go: add isLocalSoleValidator, read under the state lock. - config: remove the MinExpectedPeers field, its default, its ValidateBasic check, and the config.toml template entry. ReactorCatchupConfig and its node wiring drop the corresponding argument. - config: clarify that catchup_lag_threshold = 0 disables peer-height lag detection only; the zero-peer rule applies independently of it. - consensus/reactor_catchup_test.go: table updated for the sole-validator input. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Member
Author
|
@claude review |
vbhattaccmu
approved these changes
Jun 5, 2026
…p lag IsBehind() previously compared our height to the single highest peer height. Peer heights come from NewRoundStepMessage gossip, which is only shape-checked by ValidateBasic and is not proof a peer has committed or verified blocks at that height, so one peer reporting an inflated height could push catching_up=true on a healthy node. The signal is fail-safe for safety (it can only over-report "catching up", never forge progress or admit invalid blocks), but it feeds readiness/health probes and the execution client's sync gate, so a false positive is an availability/liveness concern. Decide the lag from a majority of connected peers instead of the maximum, and require at least minCorroboratingPeers (2) of them before trusting the signal: with a single peer that peer is a trivial majority, so a lone peer can no longer drive the field. Genuine lag is unaffected — a node that has truly fallen behind sees a majority of its peers ahead. A node with fewer than two peers abstains from the height-lag check; the separate zero-peer / sole-validator rule still covers the no-peer case. The residual is an attacker controlling a majority of a node's direct peers, which is the eclipse threat that already underlies p2p; a verified or voting-power-weighted height would be stronger but amounts to re-deriving block-sync and is out of scope here. Details: - consensus/reactor.go: evaluateBehind takes the per-peer heights and trips only when 2*ahead > len(peerHeights) with len >= minCorroboratingPeers; isBehindRaw collects the per-peer heights. - config: clarify that catchup_lag_threshold gates a majority of at least two peers, and that 0 disables peer-height lag detection only. - consensus/reactor_catchup_test.go: cover single-peer-minority, two-peer split, both-of-two ahead, and lone-peer-cannot-corroborate. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
marcello33
reviewed
Jun 5, 2026
marcello33
left a comment
Collaborator
There was a problem hiding this comment.
Left some inline comments.
Also, as per primer doc: in phase 2 (not this PR) we mention to deliver a network-aware min_expected_peers = 1 default via Heimdall’s initCometBFTConfig. AFAICT we removed that field no? Shall we update the doc or apply any code change?
Member
Author
|
@marcello33 - updated the primer doc with a new version. Check here. Thanks! |
… kill Extends coverage of the IsBehind / catching_up path so the decision logic is pinned by fast, deterministic unit tests rather than relying on the heavy consensus suite. - Extract the per-peer height-gathering loop out of isBehindRaw into a pure collectPeerHeights helper, mirroring the existing evaluateBehind / applyDebounceLocked pattern, so the aggregation loop and the PeerState type assertion are testable without p2p plumbing. - consensus: add TestCollectPeerHeights (normal heights, peer with no PeerState, peer with a wrong-type value at the key), TestIsLocalSoleValidator (sole validator, sole set but not us, larger set, nil key, nil set), and TestReactorCatchupConfig. - config: cover the new catchup_lag_threshold / catchup_debounce_duration validation rules in ValidateBasic and assert their defaults. - inspect/rpc: pin the offline checker (WaitSync / IsBehind report synced). diffguard --base develop mutation testing goes from 78.7% to 100% (47/47, all tiers 100%). Adding the direct ReactorCatchupConfig test also removed a flake: with no direct test, that option setter's mutant kill depended on an unrelated flaky test in the consensus suite and flipped between runs. Fix the stale "max peer heights" comment on isBehindRaw left over from the raw-max implementation; it gathers all peer heights now. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Covers the live wiring that the pure-helper tests don't: IsBehind driving a real (non-started) Switch's peer set through collectPeerHeights into evaluateBehind, plus the sole-validator and debounce paths, end to end. Builds a Reactor with mock peers at chosen heights, a round state at a chosen height, and catchUpDebounce=0 so the result is the raw decision with no timing. Cases mirror the operational topologies: - majority of peers ahead -> behind (the partition / fell-behind case), - synced multi-peer network -> not behind (no false positive on a healthy multi-node network), - lone peer ahead -> not behind (cannot corroborate), - no peers and not the sole validator -> behind (isolated node), - no peers but the sole validator -> not behind (finalizes alone). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Reword the ValidateBasic error for catchup_lag_threshold == 1 from "to absorb round skew" to "(margin beyond the one-height round skew)". Rejecting 1 isn't required to absorb a one-height skew; the point is to keep the smallest enabled threshold above the routine one-height proposer/commit difference between synced peers. Message text only; no logic change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Collaborator
|
LGTM |
marcello33
approved these changes
Jun 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Summary
/status'scatching_upis derived solely from the consensus reactor'sWaitSync()latch.WaitSyncis set once at startup and cleared the first time the node switches from block-sync to the consensus reactor, after which it staysfalsefor the rest of the process. So the field only ever reflects the initial block-sync and never reflects a node that later stops making progress relative to its peers — that node keeps answeringcatching_up=falseeven while it has fallen well behind, and anything that reads the field as a readiness/health signal (load balancers, k8s probes, dashboards, and in-process consumers) treats it as fully synced.This PR adds
Reactor.IsBehind(), evaluated from peer-reported heights, and folds it into the field:IsBehind()istruewhen a majority of at least two connected peers report a height more thancatchup_lag_thresholdblocks ahead of our consensus height, or when the node has no peers and is not the sole validator (it cannot make progress on its own, so it cannot establish that it is current).It uses peer heights, not local block-time staleness, on purpose. A bare staleness check cannot distinguish:
Reporting the second case as catching up would be a false positive, so height comparison is the discriminator rather than "my last block is old".
Peer heights are unverified gossip (
NewRoundStepMessage;ValidateBasicchecks shape only), so the lag decision requires a majority of at least two connected peers to corroborate it rather than trusting a single peer's max — no single peer (and no lone peer, which can't corroborate) advertising an inflated height can drive the signal (see Known limitations).The zero-peer case is self-determining (no configuration): a node with no peers reports
catching_up=trueunless it is the only validator in the set, in which case it finalizes blocks on its own and needs no peers. Sole-validator status uses the same local-validator membership test as the signing path and is read live from consensus state, so it follows validator-set changes the node has committed into local round state and behaves identically on mainnet, testnet, and local devnets — no network-specific config, and the scenario is testable on any kurtosis network id.A short debounce (
catchup_debounce_duration) requires the lag condition to hold continuously beforecatching_upflips totrue, damping flapping at the threshold boundary; the transition back tofalseis immediate.New
[consensus]config (defaults)catchup_lag_threshold50disables peer-height lag detection only — the zero-peer rule still applies. Must be>=2when enabled, to absorb the normal one-height round skew between synced peers.catchup_debounce_duration10strue.Both are additive with their defaults; behaviour only changes once a majority of (≥2) peers are
>thresholdahead, or the node has no peers while other validators exist. The zero-peer rule needs no config.Implementation notes / design decisions
status.go.IsBehind()is a new read-only method on*Reactor; the RPCconsensusReactorinterface gains it andStatusORs it intoCatchingUp.WaitSyncis left untouched. It also gates consensus message handling (reactor.gochannel handlers), so re-arming it to express "behind" would disable consensus rather than just relabel health.IsBehind()is therefore a separate signal.evaluateBehindtrips only when more than half of the connected peers are beyond the threshold and there are at leastminCorroboratingPeers(2) of them — a lone peer is a trivial "majority", so it's excluded. This is fail-safe (the signal can only over-report "catching up", never forge progress) and removes the single-peer lever.State.isLocalSoleValidator, the same membership test as the signing path). No per-network configuration; testable on any devnet regardless of chain id; a non-validator node has no private validator and so always needs peers.evaluateBehind,applyDebounceLocked,collectPeerHeights) so the branching and the peer-height aggregation are unit-testable without p2p plumbing;isBehindRawonly wires them to the liveSwitchpeers and the sole-validator flag.inspectserver stubsIsBehind() => false.Executed tests
Unit (
consensus/reactor_catchup_test.go,config/config_test.go,inspect/rpc/rpc_test.go)evaluateBehind— majority far ahead, single peer is a minority, majority of three, two-peer split, both-of-two ahead, lone peer cannot corroborate (incl. far ahead), within threshold, one over threshold, equal height (network halt), one-height round skew, no peer height learned, threshold disabled, sole-validator zero-peer, non-sole zero-peer, zero-peer behind with lag disabled.applyDebounceLocked— within-debounce stays false, past-debounce flips true, transient blip resets the timer, zero-debounce immediate, not-behind resets state.collectPeerHeights— collects every height in order, no peers → empty, skips a peer with noPeerState, skips a peer with a wrong-type value at the key (covers the aggregation loop + thepeer.Get(...)assertion).isLocalSoleValidator— sole & we're it, sole but not us, larger set, nil key, nil validator set.ReactorCatchupConfig— option setter assigns lag threshold + debounce.ConsensusConfig.ValidateBasic—catchup_lag_threshold0 / 1 / 2 / negative,catchup_debounce_durationzero / negative; plus the shipped defaults (5,10s).inspectchecker —WaitSync/IsBehindboth report synced.Integration
TestIsBehindIntegrationdrives the fullIsBehind()path through a (non-started)Switchpopulated with mock peers + a minimalState(debounce 0): majority of peers ahead → behind; synced multi-peer → not behind; lone peer ahead → not behind; zero-peer non-sole → behind; zero-peer sole → not behind.Mutation
diffguard --base develop(diff mode — mutates only the changed functions across the touched files, includingstatus.go): 100% (47/47), tier-1 logic 100%, tier-2 100%, tier-3 100%. Adding the directReactorCatchupConfigtest also removed a flake where that mutant's kill depended on a flaky consensus-suite test and flip-flopped between runs. (diffguard's overall verdict still reports FAIL on pre-existing items not introduced here:consensus/reactor.gois a 2066-line upstream file over the 500-line cap, plus churn/SDP warnings on upstream code.)End-to-end (Kurtosis, 5-validator devnet)
v0.7.1image and spawned a 5-validator devnet. Isolated one validator's Heimdall P2P: after cometbft evicted the now-dead peers to 0,catching_upflippedtrue(zero-peer rule); on reconnect the node reportedtruewhile blocksyncing ~100 blocks behind its peers (lag branch), then returned tofalseonce it caught up. The control node reportedcatching_up=falsethroughout — no false positive. Recovery was automatic on heal.Build / lint
go build ./consensus/... ./config/... ./rpc/... ./node/... ./inspect/...— clean.go test ./consensus/ ./config/ ./node/ ./inspect/rpc/— pass.gofmt -l— clean.golangci-lint/gofumpt/ diffguard mutation enforced by CI.Rollout notes
/statusfield; it does not change block validity, state transition, app hash, or wire format.catching_upunchanged for synced nodes; single-validator / single-node deployments are unaffected (the sole validator advances without peers).catchup_lag_threshold/catchup_debounce_durationaway from the defaults.Known limitations
NewRoundStepMessage(ValidateBasicchecks shape only), so they are not proof a peer committed/verified those blocks, and a peer that stays connected but stops sending round-step updates (e.g. a half-open TCP connection before p2p eviction) leaves a stale height until eviction. The signal is fail-safe (it can only pushcatching_uptowardtrue/ not-ready, never forge progress or admit invalid blocks), and the majority-of-≥2 requirement means a single (or minority of) misbehaving peer — or a lone peer — cannot drive it. The residual is an attacker controlling a majority of a node's direct peers (eclipse-level) — but a validator's direct peers in the recommended sentry topology are its own operator-run sentries; the exposed surface is open-peering (pex=true) RPC/full nodes, where the impact is availability only. A verified / quorum-of-voting-power height would be stronger but amounts to re-deriving block-sync, so it's out of scope here.catching_upflipped ~1.5 min after the cut (eviction + the 10s debounce), not seconds. This is fail-safe, but it bounds how quickly a downstream consumer (e.g. a Bor seal-gate) can react to an abrupt isolation; tightening it would mean tuning the p2p liveness timeouts, which is out of scope here.IsBehind()is read-triggered. The debounce state advances on/statusreads, so a sparse poller can under-report sustained lag until a second poll occurring at leastcatchup_debounce_durationafter the first lagging observation. Reactor-side sampling/caching is a possible follow-up if/statussemantics need to be wall-clock accurate independent of callers.Open design decisions (feedback wanted)
catchup_lag_thresholddefault (5). Right value? And should the threshold be height-based (current) or time-derived (e.g. "behind by more than N block-times")? Height-based is deterministic and simpler; time-based is more portable across block-time configs.Resolved — replaced with the self-determining sole-validator rule above (no config knob), chosen so the zero-peer scenario is testable on any kurtosis network id rather than hardcoding network names.min_expected_peersdefault.catchup_debounce_durationdefault (10s). Worth tuning against typical block time; also whether the debounce should be symmetric (currently fast-to-recover, slow-to-assert).IsBehind()optionally accept that as an additional input, or is that better layered downstream of this method?/statusfield and/or a metric, in addition to folding intocatching_up, so consumers can distinguish "initial block-sync" from "fell behind after sync"?minCorroboratingPeers(2). Hardcoded minimum peers before peer-height lag is trusted. Is2right, or should it be higher (e.g.3) for stronger corroboration on well-connected nodes?