[KLC-2382] deflake TestConsensus_RevertBlockAndTransactions#46
[KLC-2382] deflake TestConsensus_RevertBlockAndTransactions#46fbsobreira wants to merge 1 commit into
Conversation
Three independent root causes — fixing all three brings the test to 44 consecutive PASS in a 50-iteration fresh-process loop at the mainnet 4-second slot budget, up from ~45% failure rate beforehand. 1. delayedBlockBroadcaster.SetHeaderForValidator was the only function in the file that appended to valHeaderBroadcastData without acquiring mutDataForBroadcast, while interceptedHeader iterated it under lock. The unsynchronized append could let interceptedHeader cancel the wrong header alarm, validators then missed the leader header, fail to sign within the slot, and the cluster stalls below quorum (visible as "worker statistics: small consensus quorum sigsNum=3"). Wrap the slice mutation in Lock/Unlock matching the pattern of SetValidatorData. 2. SlotManagerMock.SlotIndex was a plain int64 read by baseBootstrap sync goroutines via Index() and written by the test goroutine via the UpdateSlotCalled closure. Convert to atomic.Int64; update the five external call sites (utils.go, transaction/common.go log statements, dupTransaction_test.go assertion) to use Load / Store. Test-side currentSlot is also captured by closures invoked from the chronology goroutine and becomes atomic.Int64. 3. waitForBlockConditionOrTimeOut used to log.Error and return on timeout, so the next assertion fired against an out-of-sync cluster and produced misleading errors like "expected 0x10 / actual 0xf" (the original CI symptom). Thread *testing.T and call t.Fatalf with the per-node nodesComplete map at the actual point of divergence. The per-step budget is intentionally pinned to the mainnet slot duration (4 s) via simulatedSlotDuration / blockWaitTimeoutSeconds — the test is also a regression guard for production cadence.
WalkthroughThis PR introduces atomic slot index tracking for integration tests with standardized timing constants for mainnet-aligned simulation, and tightens broadcast synchronization by extending mutex coverage during validator-header scheduling. ChangesIntegration Test Slot Synchronization
Broadcast Synchronization Improvement
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Suggested labels
🚥 Pre-merge checks | ✅ 6 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (6 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
Pull request overview
This PR deflakes TestConsensus_RevertBlockAndTransactions by addressing a production concurrency bug in delayed header broadcasting, eliminating test-side data races around slot indexing, and making block-wait helpers fail fast with clearer diagnostics so timeouts don’t cascade into misleading assertions.
Changes:
- Fix a mutex omission in
delayedBlockBroadcaster.SetHeaderForValidatorto prevent racy slice appends while another goroutine iterates/cancels alarms. - Replace integration test slot index reads/writes with
atomic.Int64(SlotManagerMock.SlotIndex) and update call sites to useLoad/Store/Add. - Improve block wait helpers to accept
*testing.Tandt.Fatalfon timeout with per-node completion state.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| integrationTest/utils.go | Updates slot propagation to use atomic.Int64.Store for race-free writes. |
| integrationTest/transaction/dupTransaction/dupTransaction_test.go | Adjusts slot assertions to use SlotIndex.Load(). |
| integrationTest/transaction/common.go | Updates log fields to use SlotIndex.Load() when reporting failures. |
| integrationTest/mock/slotManagerMock.go | Converts SlotIndex to atomic.Int64 and makes Index/UpdateSlot use atomic operations. |
| integrationTest/consensus/insertDup_test.go | Makes the test’s currentSlot atomic and threads *testing.T into wait helpers. |
| integrationTest/consensus/consensus_test.go | Adds slot-budget constants, makes slot tracking atomic, and upgrades wait helpers to fail fast with diagnostics. |
| core/consensus/broadcast/delayedBroadcast.go | Wraps valHeaderBroadcastData mutation/logging in mutDataForBroadcast to prevent concurrent slice corruption. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if txResult.Block != blockNonce { | ||
| finalErr = WrapError(finalErr, ErrNotInBlock) | ||
| log.Warn("TX block does not match", "nodeIdx", i, "slot", n.SlotManager.SlotIndex, "nodeHight", n.Blkc.GetCurrentBlockHeader().GetNonce(), "foundBlock", txResult.Block, "expectedBlock", blockNonce) | ||
| log.Warn("TX block does not match", "nodeIdx", i, "slot", n.SlotManager.SlotIndex.Load(), "nodeHight", n.Blkc.GetCurrentBlockHeader().GetNonce(), "foundBlock", txResult.Block, "expectedBlock", blockNonce) |
| slot = currentSlot.Add(1) | ||
| WaitForBlockSlotOrTimeOut(t, nodes, uint64(slot), blockWaitTimeoutSeconds) | ||
|
|
||
| // TX should be processed on next blok after revert |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@integrationTest/transaction/common.go`:
- Line 235: Update the structured log key in the log.Warn call to fix the typo:
change the "nodeHight" key to "nodeHeight" so dashboards and filters use a
consistent field name; locate the log.Warn invocation that includes "nodeIdx",
"slot", "nodeHight", "foundBlock", and "expectedBlock" and replace the
misspelled key while keeping all other arguments unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 60c5ae98-e86e-4f57-b28c-1fb8bd888295
📒 Files selected for processing (7)
core/consensus/broadcast/delayedBroadcast.gointegrationTest/consensus/consensus_test.gointegrationTest/consensus/insertDup_test.gointegrationTest/mock/slotManagerMock.gointegrationTest/transaction/common.gointegrationTest/transaction/dupTransaction/dupTransaction_test.gointegrationTest/utils.go
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Agent
- GitHub Check: Analyze (go)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.go
📄 CodeRabbit inference engine (Custom checks)
**/*.go: Verify that any new or modified concurrent code (goroutines, channels, mutexes, sync primitives) is free of race conditions. Check for: proper lock/unlock pairing, no goroutine leaks, correct channel lifecycle management, and proper context cancellation propagation.
Verify that errors are not silently discarded. Check for: unchecked error returns, error wrapping with context, proper error propagation up the call chain, and no bare panic() calls outside of init() functions.
Files:
integrationTest/transaction/dupTransaction/dupTransaction_test.gointegrationTest/transaction/common.gointegrationTest/consensus/insertDup_test.gointegrationTest/utils.gointegrationTest/consensus/consensus_test.gointegrationTest/mock/slotManagerMock.gocore/consensus/broadcast/delayedBroadcast.go
**/*_test.go
⚙️ CodeRabbit configuration file
**/*_test.go: Test files. Review for: - Adequate coverage of edge cases and error paths - Proper use of test helpers and assertions - Race condition coverage (tests should use -race flag patterns) - No hardcoded sleep for synchronization (use channels or sync primitives) - Test isolation (no shared mutable state between tests)
Files:
integrationTest/transaction/dupTransaction/dupTransaction_test.gointegrationTest/consensus/insertDup_test.gointegrationTest/consensus/consensus_test.go
core/consensus/**
⚙️ CodeRabbit configuration file
core/consensus/**: This is the consensus engine. Review with extreme care: - Check for race conditions in concurrent block processing - Verify correct mutex/lock ordering to prevent deadlocks - Ensure deterministic behavior (no maps iteration without sorting, no random) - Validate message signing and verification logic - Flag any changes that could cause consensus forks or chain splits
Files:
core/consensus/broadcast/delayedBroadcast.go
🧠 Learnings (1)
📚 Learning: 2026-04-21T20:12:22.959Z
Learnt from: phcarneirobc
Repo: klever-io/klever-go PR: 38
File: indexer/eventsProcessor.go:188-211
Timestamp: 2026-04-21T20:12:22.959Z
Learning: In Go structs that are JSON-marshaled, if a field is a `bool` and has the `json:"...,omitempty"` tag, then leaving that field at its zero value (`false`) is functionally equivalent (in the resulting JSON) to explicitly setting `Foundation: false`. Reviewers should not flag struct literals that omit such `bool` fields as an inconsistency; they will serialize identically because `omitempty` suppresses `false` values.
Applied to files:
integrationTest/transaction/dupTransaction/dupTransaction_test.gointegrationTest/transaction/common.gointegrationTest/consensus/insertDup_test.gointegrationTest/utils.gointegrationTest/consensus/consensus_test.gointegrationTest/mock/slotManagerMock.gocore/consensus/broadcast/delayedBroadcast.go
🔇 Additional comments (8)
core/consensus/broadcast/delayedBroadcast.go (1)
218-233: Race condition fix is correct and properly scoped.The mutex now protects the append to
d.valHeaderBroadcastData(line 231), which races with the locked iteration ininterceptedHeader(lines 520-538). Lock/unlock pairing is sound with no early returns or panic paths between them. The pattern matchesSetLeaderDataandSetValidatorData.The comment at lines 218-222 clearly documents the root cause and consequences of the race—good practice for consensus-critical code.
integrationTest/consensus/insertDup_test.go (1)
30-40: Same loop-capture verification as inconsensus_test.go.The callback closure capture pattern is identical to the one already flagged for Go-version verification.
integrationTest/mock/slotManagerMock.go (1)
4-4: Atomic slot-index migration looks correct.Using
atomic.Int64withLoad()/Add()here is consistent and fixes the test/consensus read-write race on slot state.Also applies to: 16-16, 51-51, 79-79
integrationTest/utils.go (1)
209-212: Slot update now uses the correct atomic write path.
SlotIndex.Store(...)is the right write primitive for this shared value.integrationTest/transaction/common.go (1)
227-227: Race-safe slot diagnostics are correctly applied.Switching these log paths to
SlotIndex.Load()is the right fix for concurrent reads.Also applies to: 235-235, 243-243, 264-264
integrationTest/transaction/dupTransaction/dupTransaction_test.go (1)
159-159: Assertion now reads slot index safely.Using
SlotIndex.Load()in this cross-goroutine assertion is the correct change.integrationTest/consensus/consensus_test.go (2)
184-197: Fail-fast timeout behavior is a strong reliability improvement.Switching timeout handling to
t.Fatalf(...)here surfaces the root cause immediately instead of leaking into misleading downstream assertions.
65-76: Go version confirmed in go.mod: 1.25.7. Range-loop closures capturingnare safe—Go 1.22+ guarantees per-iteration copies of loop variables, eliminating the classic gotcha. The code is correct.
| if txResult.Block != blockNonce { | ||
| finalErr = WrapError(finalErr, ErrNotInBlock) | ||
| log.Warn("TX block does not match", "nodeIdx", i, "slot", n.SlotManager.SlotIndex, "nodeHight", n.Blkc.GetCurrentBlockHeader().GetNonce(), "foundBlock", txResult.Block, "expectedBlock", blockNonce) | ||
| log.Warn("TX block does not match", "nodeIdx", i, "slot", n.SlotManager.SlotIndex.Load(), "nodeHight", n.Blkc.GetCurrentBlockHeader().GetNonce(), "foundBlock", txResult.Block, "expectedBlock", blockNonce) |
There was a problem hiding this comment.
Fix structured log key typo: nodeHight → nodeHeight.
The typo makes log filters/dashboards inconsistent across node-height fields.
Suggested patch
- log.Warn("TX block does not match", "nodeIdx", i, "slot", n.SlotManager.SlotIndex.Load(), "nodeHight", n.Blkc.GetCurrentBlockHeader().GetNonce(), "foundBlock", txResult.Block, "expectedBlock", blockNonce)
+ log.Warn("TX block does not match", "nodeIdx", i, "slot", n.SlotManager.SlotIndex.Load(), "nodeHeight", n.Blkc.GetCurrentBlockHeader().GetNonce(), "foundBlock", txResult.Block, "expectedBlock", blockNonce)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| log.Warn("TX block does not match", "nodeIdx", i, "slot", n.SlotManager.SlotIndex.Load(), "nodeHight", n.Blkc.GetCurrentBlockHeader().GetNonce(), "foundBlock", txResult.Block, "expectedBlock", blockNonce) | |
| log.Warn("TX block does not match", "nodeIdx", i, "slot", n.SlotManager.SlotIndex.Load(), "nodeHeight", n.Blkc.GetCurrentBlockHeader().GetNonce(), "foundBlock", txResult.Block, "expectedBlock", blockNonce) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@integrationTest/transaction/common.go` at line 235, Update the structured log
key in the log.Warn call to fix the typo: change the "nodeHight" key to
"nodeHeight" so dashboards and filters use a consistent field name; locate the
log.Warn invocation that includes "nodeIdx", "slot", "nodeHight", "foundBlock",
and "expectedBlock" and replace the misspelled key while keeping all other
arguments unchanged.
|
Closing in favour of widened-scope PR — branch renamed to bugfix/KLC-2382-consensus-data-races to reflect the broader race-fix work uncovered during this investigation. New PR forthcoming on the renamed branch. |
Summary
Three independent root causes were behind the intermittent failure of
TestConsensus_RevertBlockAndTransactionsin CI. Fixing all three brings thetest to 44 consecutive passes in a 50-iteration fresh-process loop at the
mainnet 4 s slot budget, up from a ~45% failure rate at the same budget
beforehand.
delayedBlockBroadcaster.SetHeaderForValidator(
core/consensus/broadcast/delayedBroadcast.go). It was the only functionin the file that appended to
valHeaderBroadcastDatawithout acquiringmutDataForBroadcast, whileinterceptedHeaderiterated the same sliceunder the lock. The unsynchronized append could let
interceptedHeadercancel the wrong header alarm — validators then missed the leader header,
failed to sign within the slot, and the cluster stalled below quorum
(
worker statistics: small consensus quorum sigsNum=3). Fixed by wrappingthe slice mutation in
Lock/Unlock, matching the pattern ofSetValidatorData. This is the proximate cause of the CI flake.SlotManagerMock.SlotIndex. Plainint64readby
baseBootstrap.GetNodeStatesync goroutines viaIndex(), written bythe test goroutine via the
UpdateSlotCalledclosure. Converted toatomic.Int64; updated the five external call sites and promoted thetest-local
currentSlottoatomic.Int64for the same reason.waitForBlockConditionOrTimeOutused to
log.Errorandreturn, so the next assertion (header.GetSlot())fired against an out-of-sync cluster — producing the original
expected 0x10 / actual 0xfCI symptom. Now threads*testing.Tand callst.Fatalfwith the per-nodenodesCompletemap at the point ofdivergence. The per-step budget is intentionally pinned to the mainnet
slot duration (4 s) via
simulatedSlotDuration/blockWaitTimeoutSeconds— this test is also a regression guard for production cadence and must not
be relaxed.
Verification
Local stress on macOS / Go 1.25.7 / M-series CPU, 50 fresh
go testprocessesat the original 4 s budget (CI-equivalent):
The two failures in the post-fix loop both occurred within the first ~60 s after
killing a previous stress run; they show the same single-node-lag pattern and
are consistent with libp2p port
TIME_WAITresidue rather than intrinsic flake.AC of "20+ consecutive runs" is exceeded by 2×.
Out of scope (follow-up needed)
Running with `-race` after the `SetHeaderForValidator` fix still surfaces a
separate production race on `ConsensusState` shared fields:
`Data`, `Header`, `SlotCanceled`, `ExtendedCalled`,
`WaitingAllSignaturesTimeOut` on `ConsensusState` lack a mutex and are
touched from many call sites across `core/consensus/slot/...`. CI does not
enable `-race`, so this does not affect KLC-2382 directly — but it is a real
production race and a follow-up Jira should be filed.
Test plan
these changes (2 pre-existing `gosec` findings in `utils.go` are on
unrelated lines)
cleanly via `t.Fatalf` on the remaining `ConsensusState` race, with
the new diagnostic message instead of misleading downstream assertions
44 consecutive passes (iters 7–50)