[KLC-1920] fix: force not-synced when gossip is ahead of probable highest nonce#69
[KLC-1920] fix: force not-synced when gossip is ahead of probable highest nonce#69nickgs1337 wants to merge 4 commits into
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
📜 Recent review details⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
🧰 Additional context used📓 Path-based instructions (2)**/*.go📄 CodeRabbit inference engine (Custom checks)
Files:
**/*_test.go⚙️ CodeRabbit configuration file
Files:
🧠 Learnings (4)📚 Learning: 2026-04-07T14:36:46.394ZApplied to files:
📚 Learning: 2024-11-19T20:43:36.454ZApplied to files:
📚 Learning: 2026-04-21T20:12:22.959ZApplied to files:
📚 Learning: 2026-05-23T22:52:58.065ZApplied to files:
🔇 Additional comments (4)
WalkthroughAdds HighestNonceReceived() to the ForkDetector API and mock, implements it in baseForkDetector with a guarded setter, updates computeNodeState to mark the node not synced when gossip is ahead by > BlockFinality, and adds regression tests and test harness reproducing KLC-1920. ChangesFork detector highest nonce tracking
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 8✅ Passed checks (8 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
core/process/sync/baseSync.go (1)
295-301:⚠️ Potential issue | 🟠 Major | ⚡ Quick winApply the gossip-gap guard when
currentHeaderis nil too.Line 296 can still mark
boot.hasLastBlock=trueusing onlyProbableHighestNonce, so a genesis-only node can be misclassified as synced while gossip is already ahead by more thanBlockFinality.Proposed fix
lastNonce := genesisNonce lastSlot := boot.chainHandler.GetGenesisHeader().GetSlot() - if check.IfNil(currentHeader) { - boot.hasLastBlock = boot.forkDetector.ProbableHighestNonce() == genesisNonce - log.Debug("computeNodeState", - "probableHighestNonce", boot.forkDetector.ProbableHighestNonce(), - "currentBlockNonce", nil, - "boot.hasLastBlock", boot.hasLastBlock) - } else { + currentBlockNonce := genesisNonce + if check.IfNil(currentHeader) { + // keep genesis nonce + } else { lastNonce = currentHeader.GetNonce() lastSlot = currentHeader.GetSlot() - currentBlockNonce := boot.chainHandler.GetCurrentBlockHeader().GetNonce() - probableHighestNonce := boot.forkDetector.ProbableHighestNonce() - highestNonceReceived := boot.forkDetector.HighestNonceReceived() - boot.hasLastBlock = probableHighestNonce <= currentBlockNonce - // KLC-1920: gossip-derived ceiling is the source of truth that - // probableHighestNonce can lag behind when the BHReceived path is - // disrupted (peer churn after an election, fallback observer not - // receiving fetched headers). If gossip reports the network ahead - // by more than the normal proposal/commit window, the node is not - // really synced even if probableHighestNonce equals currentBlockNonce. - if highestNonceReceived > currentBlockNonce+process.BlockFinality { - boot.hasLastBlock = false - } - log.Debug("computeNodeState", - "probableHighestNonce", probableHighestNonce, - "highestNonceReceived", highestNonceReceived, - "currentBlockNonce", currentBlockNonce, - "boot.hasLastBlock", boot.hasLastBlock) + currentBlockNonce = currentHeader.GetNonce() } + probableHighestNonce := boot.forkDetector.ProbableHighestNonce() + highestNonceReceived := boot.forkDetector.HighestNonceReceived() + boot.hasLastBlock = probableHighestNonce <= currentBlockNonce + if highestNonceReceived > currentBlockNonce && + highestNonceReceived-currentBlockNonce > uint64(process.BlockFinality) { + boot.hasLastBlock = false + } + log.Debug("computeNodeState", + "probableHighestNonce", probableHighestNonce, + "highestNonceReceived", highestNonceReceived, + "currentBlockNonce", currentBlockNonce, + "boot.hasLastBlock", boot.hasLastBlock)Also applies to: 304-316
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@core/process/sync/baseSync.go` around lines 295 - 301, When currentHeader is nil you still set boot.hasLastBlock based only on boot.forkDetector.ProbableHighestNonce() which can misclassify a genesis-only node as synced; change the check so boot.hasLastBlock is set true only if the gossip gap is within BlockFinality (e.g. boot.forkDetector.ProbableHighestNonce() - genesisNonce <= BlockFinality). Apply the same gossip-gap guard to the existing branch that handles non-nil currentHeader (the logic around check.IfNil(currentHeader), boot.hasLastBlock, boot.forkDetector.ProbableHighestNonce(), and genesisNonce) so both paths use the BlockFinality comparison before marking hasLastBlock.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@core/process/sync/baseForkDetector.go`:
- Around line 307-314: The highestNonceReceived updates can regress under
concurrency; update the implementation so reads/writes are atomic and monotonic:
replace the mutable field usage in baseForkDetector with an atomic uint64 (use
atomic.LoadUint64 for HighestNonceReceived() and an atomic CAS loop to set the
new value only if greater) or protect the update path in highestNonceReceived
(and anywhere that assigns to that field, e.g., the writer path currently doing
a compare outside the lock) by moving the compare-and-set inside the same mutex.
In short, ensure highestNonceReceived uses atomic.LoadUint64 for reads and an
atomic CompareAndSwap/Store-max loop (or a locked section) to only increase the
stored value.
In `@core/process/sync/klc1920_repro_test.go`:
- Line 86: The test currently ignores the error returned by bfd.AddHeader in the
repro loop; update the loop to check and fail on that error (e.g., using
t.Fatalf/t.Fatal or the test assertion helper in use) so BHProposed insertion
failures cause the test to fail. Locate the call to
bfd.AddHeader(&block.Block{Header: hdr}, hash, process.BHProposed, nil, nil) and
replace the silent discard with error handling that reports the error and aborts
the test.
---
Outside diff comments:
In `@core/process/sync/baseSync.go`:
- Around line 295-301: When currentHeader is nil you still set boot.hasLastBlock
based only on boot.forkDetector.ProbableHighestNonce() which can misclassify a
genesis-only node as synced; change the check so boot.hasLastBlock is set true
only if the gossip gap is within BlockFinality (e.g.
boot.forkDetector.ProbableHighestNonce() - genesisNonce <= BlockFinality). Apply
the same gossip-gap guard to the existing branch that handles non-nil
currentHeader (the logic around check.IfNil(currentHeader), boot.hasLastBlock,
boot.forkDetector.ProbableHighestNonce(), and genesisNonce) so both paths use
the BlockFinality comparison before marking hasLastBlock.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: e6680200-113a-4926-8790-4fe62c354ba8
📒 Files selected for processing (5)
common/mock/forkDetectorMock.gocore/process/interface.gocore/process/sync/baseForkDetector.gocore/process/sync/baseSync.gocore/process/sync/klc1920_repro_test.go
📜 Review details
🧰 Additional context used
📓 Path-based instructions (2)
**/*.go
📄 CodeRabbit inference engine (Custom checks)
**/*.go: Verify that any new or modified concurrent code (goroutines, channels, mutexes, sync primitives) is free of race conditions. Check for: proper lock/unlock pairing, no goroutine leaks, correct channel lifecycle management, and proper context cancellation propagation.
Verify that errors are not silently discarded. Check for: unchecked error returns, error wrapping with context, proper error propagation up the call chain, and no bare panic() calls outside of init() functions.
Files:
common/mock/forkDetectorMock.gocore/process/sync/baseForkDetector.gocore/process/interface.gocore/process/sync/klc1920_repro_test.gocore/process/sync/baseSync.go
**/*_test.go
⚙️ CodeRabbit configuration file
**/*_test.go: Test files. Review for: - Adequate coverage of edge cases and error paths - Proper use of test helpers and assertions - Race condition coverage (tests should use -race flag patterns) - No hardcoded sleep for synchronization (use channels or sync primitives) - Test isolation (no shared mutable state between tests)
Files:
core/process/sync/klc1920_repro_test.go
🧠 Learnings (7)
📓 Common learnings
Learnt from: phcarneirobc
Repo: klever-io/klever-go PR: 879
File: core/statistics/tpsBenchmark.go:167-170
Timestamp: 2024-11-19T20:43:36.454Z
Learning: In the `core/statistics/tpsBenchmark.go` file, within the `updateStatistics` method, note that `Header.Nonce` can be zero (e.g., for the genesis block), so adding a zero nonce check is unnecessary.
📚 Learning: 2026-04-21T20:12:22.959Z
Learnt from: phcarneirobc
Repo: klever-io/klever-go PR: 38
File: indexer/eventsProcessor.go:188-211
Timestamp: 2026-04-21T20:12:22.959Z
Learning: In Go structs that are JSON-marshaled, if a field is a `bool` and has the `json:"...,omitempty"` tag, then leaving that field at its zero value (`false`) is functionally equivalent (in the resulting JSON) to explicitly setting `Foundation: false`. Reviewers should not flag struct literals that omit such `bool` fields as an inconsistency; they will serialize identically because `omitempty` suppresses `false` values.
Applied to files:
common/mock/forkDetectorMock.gocore/process/sync/baseForkDetector.gocore/process/interface.gocore/process/sync/klc1920_repro_test.gocore/process/sync/baseSync.go
📚 Learning: 2026-05-23T22:52:58.065Z
Learnt from: fbsobreira
Repo: klever-io/klever-go PR: 65
File: data/blockchain/blockchain.go:170-172
Timestamp: 2026-05-23T22:52:58.065Z
Learning: In Go, the pattern `append([]byte(nil), src...)` should be treated as preserving nil identity when `src` is a nil `[]byte`: spreading a nil slice contributes zero variadic arguments, so `append` performs no allocation and returns the original nil destination slice unchanged (i.e., result is nil, not an empty non-nil slice). Do not flag this as an incorrect empty-slice conversion; it intentionally maintains `nil`.
Applied to files:
common/mock/forkDetectorMock.gocore/process/sync/baseForkDetector.gocore/process/interface.gocore/process/sync/klc1920_repro_test.gocore/process/sync/baseSync.go
📚 Learning: 2024-11-19T20:43:36.454Z
Learnt from: phcarneirobc
Repo: klever-io/klever-go PR: 879
File: core/statistics/tpsBenchmark.go:167-170
Timestamp: 2024-11-19T20:43:36.454Z
Learning: In the `core/statistics/tpsBenchmark.go` file, within the `updateStatistics` method, note that `Header.Nonce` can be zero (e.g., for the genesis block), so adding a zero nonce check is unnecessary.
Applied to files:
core/process/sync/baseForkDetector.gocore/process/sync/klc1920_repro_test.gocore/process/sync/baseSync.go
📚 Learning: 2026-04-07T14:36:46.394Z
Learnt from: RomuloSiebra
Repo: klever-io/klever-go PR: 35
File: network/p2p/libp2p/peerid_stability_test.go:100-116
Timestamp: 2026-04-07T14:36:46.394Z
Learning: In `network/p2p/libp2p/peerid_stability_test.go` (Go, klever-go repo), the empty-seed tests (`TestCreateP2PPrivKey_EmptySeed_NoError` and `TestCreateP2PPrivKey_EmptySeed_LegacySeed_NoError`) intentionally only assert `NoError/NotNil`. A previous `NotEqual` assertion across two `crypto/rand`-backed calls was deliberately removed because it is a probabilistic assertion that re-verifies OS/stdlib entropy rather than project logic. Do not suggest adding `NotEqual` comparisons for empty-seed / `crypto/rand` paths in this codebase.
Applied to files:
core/process/sync/klc1920_repro_test.go
📚 Learning: 2024-11-18T18:23:56.020Z
Learnt from: fbsobreira
Repo: klever-io/klever-go PR: 877
File: core/process/transaction/txProcess_test.go:2513-2516
Timestamp: 2024-11-18T18:23:56.020Z
Learning: In the test function `TestTxProcessor_ProcessClaimAllowanceKFIShouldFail` in `core/process/transaction/txProcess_test.go`, both validation and processing steps should be tested even when validation is expected to fail, to ensure that the processing correctly handles invalid inputs.
Applied to files:
core/process/sync/klc1920_repro_test.go
📚 Learning: 2024-11-18T18:25:28.208Z
Learnt from: fbsobreira
Repo: klever-io/klever-go PR: 877
File: core/process/transaction/txProcess_test.go:2481-2484
Timestamp: 2024-11-18T18:25:28.208Z
Learning: In the `TestTxProcessor_ProcessClaimAllowanceOkValsShouldWork` function in `core/process/transaction/txProcess_test.go`, the transaction processing and assertions are already included.
Applied to files:
core/process/sync/klc1920_repro_test.go
🔇 Additional comments (2)
core/process/interface.go (1)
65-65: LGTM!common/mock/forkDetectorMock.go (1)
16-16: LGTM!Also applies to: 65-70
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@core/process/sync/klc1920_node_state_test.go`:
- Around line 122-141: The test
TestKLC1920_ComputeNodeState_GossipWithinFinalityStaysSynced currently hardcodes
the `highest` value (51) to model `gap == BlockFinality`; change the test to
compute `highest` from the real constant by using `process.BlockFinality` (e.g.
set `highest := probable + process.BlockFinality`) when calling
buildKLC1920Bootstrap so the boundary remains correct if BlockFinality changes,
and optionally add a sibling test that sets `highest := probable +
process.BlockFinality + 1` to exercise the `gap == BlockFinality+1` case; update
references around TestKLC1920_ComputeNodeState_GossipWithinFinalityStaysSynced,
buildKLC1920Bootstrap and ComputeNodeState accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 767bcc76-5b47-468f-bb17-fe7a51543940
📒 Files selected for processing (2)
core/process/sync/export_test.gocore/process/sync/klc1920_node_state_test.go
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: setup-and-lint / setup-and-lint
- GitHub Check: Analyze (go)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.go
📄 CodeRabbit inference engine (Custom checks)
**/*.go: Verify that any new or modified concurrent code (goroutines, channels, mutexes, sync primitives) is free of race conditions. Check for: proper lock/unlock pairing, no goroutine leaks, correct channel lifecycle management, and proper context cancellation propagation.
Verify that errors are not silently discarded. Check for: unchecked error returns, error wrapping with context, proper error propagation up the call chain, and no bare panic() calls outside of init() functions.
Files:
core/process/sync/export_test.gocore/process/sync/klc1920_node_state_test.go
**/*_test.go
⚙️ CodeRabbit configuration file
**/*_test.go: Test files. Review for: - Adequate coverage of edge cases and error paths - Proper use of test helpers and assertions - Race condition coverage (tests should use -race flag patterns) - No hardcoded sleep for synchronization (use channels or sync primitives) - Test isolation (no shared mutable state between tests)
Files:
core/process/sync/export_test.gocore/process/sync/klc1920_node_state_test.go
🧠 Learnings (3)
📚 Learning: 2026-04-21T20:12:22.959Z
Learnt from: phcarneirobc
Repo: klever-io/klever-go PR: 38
File: indexer/eventsProcessor.go:188-211
Timestamp: 2026-04-21T20:12:22.959Z
Learning: In Go structs that are JSON-marshaled, if a field is a `bool` and has the `json:"...,omitempty"` tag, then leaving that field at its zero value (`false`) is functionally equivalent (in the resulting JSON) to explicitly setting `Foundation: false`. Reviewers should not flag struct literals that omit such `bool` fields as an inconsistency; they will serialize identically because `omitempty` suppresses `false` values.
Applied to files:
core/process/sync/export_test.gocore/process/sync/klc1920_node_state_test.go
📚 Learning: 2026-05-23T22:52:58.065Z
Learnt from: fbsobreira
Repo: klever-io/klever-go PR: 65
File: data/blockchain/blockchain.go:170-172
Timestamp: 2026-05-23T22:52:58.065Z
Learning: In Go, the pattern `append([]byte(nil), src...)` should be treated as preserving nil identity when `src` is a nil `[]byte`: spreading a nil slice contributes zero variadic arguments, so `append` performs no allocation and returns the original nil destination slice unchanged (i.e., result is nil, not an empty non-nil slice). Do not flag this as an incorrect empty-slice conversion; it intentionally maintains `nil`.
Applied to files:
core/process/sync/export_test.gocore/process/sync/klc1920_node_state_test.go
📚 Learning: 2026-04-07T14:36:46.394Z
Learnt from: RomuloSiebra
Repo: klever-io/klever-go PR: 35
File: network/p2p/libp2p/peerid_stability_test.go:100-116
Timestamp: 2026-04-07T14:36:46.394Z
Learning: In `network/p2p/libp2p/peerid_stability_test.go` (Go, klever-go repo), the empty-seed tests (`TestCreateP2PPrivKey_EmptySeed_NoError` and `TestCreateP2PPrivKey_EmptySeed_LegacySeed_NoError`) intentionally only assert `NoError/NotNil`. A previous `NotEqual` assertion across two `crypto/rand`-backed calls was deliberately removed because it is a probabilistic assertion that re-verifies OS/stdlib entropy rather than project logic. Do not suggest adding `NotEqual` comparisons for empty-seed / `crypto/rand` paths in this codebase.
Applied to files:
core/process/sync/klc1920_node_state_test.go
🔇 Additional comments (3)
core/process/sync/klc1920_node_state_test.go (1)
28-53: LGTM!core/process/sync/export_test.go (2)
36-46: ⚡ Quick winRemove/adjust concern: no accessor name collision on
baseBootstrap
func (boot *baseBootstrap) IsNodeSynchronized()andfunc (boot *baseBootstrap) HasLastBlock()only exist incore/process/sync/export_test.go; there are no non-_test.goreceiver methods onbaseBootstrapwith those names, so this won’t trigger a duplicate-method compile error.
25-34: ⚡ Quick winResolve:
slotManager.BeforeGenesis()gates the spawned goroutine, so it won’t nil-deref in these tests.
computeNodeState()only startsgo boot.requestHeadersIfSyncIsStuck()whenshouldTryToRequestHeaders()is true, andshouldTryToRequestHeaders()returnsfalseimmediately whenboot.slotManager.BeforeGenesis()is true. Incore/process/sync/klc1920_node_state_test.go, theconsensusMock.SlotManagerMockpassed intoNewBaseBootstrapForKLC1920TesthasBeforeGenesisCalled: func() bool { return true }(explicitly suppressing therequestHeadersIfSyncIsStuckpath), so the partially-initializedbaseBootstrapwon’t hit nil dependencies inrequestHeaders().
fbsobreira
left a comment
There was a problem hiding this comment.
Check comments and the tests file is an outlier vs. the package conventions. Pease fold them into the existing metaForkDetector_test.go, keeping the TestKLC1920_* names as regression markers. The computeNodeState boundary test (coverage comment) targets baseSync.go, so it should go in a new baseSync_test.go — next to the code it actually guards.
| lastNonce = currentHeader.GetNonce() | ||
| lastSlot = currentHeader.GetSlot() | ||
| boot.hasLastBlock = boot.forkDetector.ProbableHighestNonce() <= boot.chainHandler.GetCurrentBlockHeader().GetNonce() | ||
| currentBlockNonce := boot.chainHandler.GetCurrentBlockHeader().GetNonce() |
There was a problem hiding this comment.
this is the refetch of lastNonce from L#302
| assert.Equal(t, uint64(10), bfd.ProbableHighestNonce(), | ||
| "probableHighestNonce intentionally stays at last processed — BHProposed must not advance it (would break consensus during proposal rounds)") | ||
|
|
||
| gap := bfd.HighestNonceReceived() - bfd.ProbableHighestNonce() |
There was a problem hiding this comment.
Test pins HighestNonceReceived − ProbableHighestNonce, but fix compares against currentBlockNonce. Coincide here only because no BHReceived headers were added.
Reframe the assertion around HighestNonceReceived − currentBlockNonce so the guard tracks what the fix actually evaluates.
| // receiving fetched headers). If gossip reports the network ahead | ||
| // by more than the normal proposal/commit window, the node is not | ||
| // really synced even if probableHighestNonce equals currentBlockNonce. | ||
| if highestNonceReceived > currentBlockNonce+process.BlockFinality { |
There was a problem hiding this comment.
BlockFinality is hardcoded to 1, so this guard trips whenever the gossiped nonce ceiling (highestNonceReceived) runs ≥ 2 blocks ahead of the committed tip. That gap is reached during normal propagation/commit latency or a single missed round — i.e. when the node is briefly one block behind while the next proposal is already gossiping in — which would cause transient false not-synced flapping. Suggest widening the tolerance (e.g. tie it to the existing "max rounds without a new block" value, or BlockFinality + k) so benign one-block lag doesn't flip the state.
Both KLC-1920 (fallback desync after election period, 2+ years open) and KLC-2389 (fallback stalls in ~15-block bursts after mid-epoch restart) stem from the same invariant violation in the fork detector / synced-state computation. This PR fixes both with one change.
The fallback's heartbeat uses a freshly-generated observer key (KLC-2388 sibling work). When peer churn after an election (KLC-1920) or a mid-epoch restart (KLC-2389) breaks the
BHReceivedpath while gossip (BHProposed) keeps flowing,probableHighestNoncefreezes at the last processed nonce whilehighestNonceReceivedclimbs.computeNodeStatethen comparesprobableHighestNonce == currentBlockNonceand reportsisNodeSynchronized=true— even though the chain is silently advancing past the node. The Slack-thread log captured this exactly:setHighestNonceReceivedfiring continuously, zeroforkDetector.AddHeader state=0lines, node declaring itself synced while 7+ blocks behind.Fix
computeNodeStatealready usesprobableHighestNonce <= currentBlockNonceforhasLastBlock. We add one extra condition: if the gossip-derivedHighestNonceReceivedis more thanBlockFinalityahead ofcurrentBlockNonce, forcehasLastBlock=false. The fork detector's update semantics (processReceivedBlock,bfd.headers,probableHighestNonce) are untouched — only the synced-state interpretation changes. Consensus rounds, proposal/commit timing, and downstream listeners behave identically in healthy operation.What changed
core/process/sync/baseSync.go::computeNodeState— one newifafter the existinghasLastBlockline, plus updated debug log.core/process/sync/baseForkDetector.go— exposeHighestNonceReceived() uint64(the private getter was already there).core/process/interface.go— addHighestNonceReceived() uint64to theForkDetectorinterface.common/mock/forkDetectorMock.go— corresponding mock impl.core/process/sync/klc1920_repro_test.go— two regression tests pinning down the invariant the fix relies on.5 files, 130 / 3 lines.
Validation
Unit tests:
go test ./core/process/sync/... -run KLCgreen.Empirical reproduction. Because the bug needs the asymmetric gossip-vs-fetch pattern, a healthy 3-node localnet cannot fire it organically. We use a controlled
-infectedbuild that dropsBHReceivedevents on the fallback after 60s (env-var-gated, byte-equivalent production binaries when unset). Two clean 6-minute runs, same image base (cf9f612c), same infection patch, only the fix differs:klv_is_syncingSame infection pressure, same scenario. Pre-fix exhibits the production bug shape (fallback stuck, false-synced). Post-fix the fallback keeps up despite the same disrupted-fetch pressure, and the synced metric is honest.
Full artifacts at
sprint-97/KLC-1920/validation-artifacts/:README.md— narrative, repro instructions, ruled-out alternativesAB-A-prefix-infected/— bug-reproduction run (poll.tsv, snapshots, full container logs)AB-B-postfix-infected/— fix-prevents-bug run (same shape)run-AB-infected.sh— scenario runner (full docker reset → fresh boot → poll → capture)Caveats
-infectedimages because the production failure conditions don't occur naturally at 3-node scale. The infection is env-var-gated and only built into-infectedtags; production binaries are byte-equivalent.sprint-97/KLC-1920/slack-thread/log.txt) matches the pre-fix infected run line-for-line — same symptom pattern, same metric lies.Why not other shapes
Earlier iterations of this work bumped
probableHighestNonceinline fromBHProposedevents inprocessReceivedBlockitself. That shape also catches the bug but mutates fork-detector internals and changes a counter's update timing. This PR keeps the fork detector alone and answers the synced-state question where it's actually asked. The discarded alternative is preserved on branchklc-1920-klc-2389-old-broken-attemptfor reference.Draft status: opened for review before squashing the regression-tests commit history. Will mark ready once a maintainer signs off on the approach.
Consensus and Synchronization Stability Fix
Blockchain-Critical Impact: Fixes a sync-state invariant in the bootstrap/fork-detection layer that could let a node report itself synchronized while gossiped headers (BHProposed) advanced the chain. Affects consensus/synchronization, networking/gossip handling, and node sync metrics; does not change transaction processing, KVM, block acceptance, or persistent chain data.
Root Cause: computeNodeState determined hasLastBlock/isNodeSynchronized solely from ProbableHighestNonce vs currentBlockNonce. If BHReceived/BHProcessed stalled but BHProposed gossip continued, highestNonceReceived (from gossip) could advance while probableHighestNonce stayed frozen, allowing a false synced state.
What changed
Node Stability & Data Integrity: Improves correctness of sync-state reporting and related metrics. No changes to block/header acceptance, consensus rules, transaction execution, or persistent chain data.
Concurrency & Error Handling: Small concurrency adjustment in setHighestNonceReceived (holding mutFork during compare/update to prevent missed early-return). No broad error-handling changes.
Validation: Unit tests for the sync package (including new regression tests) passed. An empirical repro using an instrumented build shows pre-fix nodes could report synced while falling ~75 blocks behind; post-fix behavior keeps fallback close (2-block gap) and reports sync accurately. Repro artifacts and runner script included in the PR.
Cross-cutting Notes: Monitoring and tooling that depend on probable/highest-nonce or isNodeSynchronized metrics will observe more accurate sync state. The change is defensive and backwards-compatible with consensus/state integrity.