Skip to content

[KLC-2382] deflake TestConsensus_RevertBlockAndTransactions#46

Closed
fbsobreira wants to merge 1 commit into
developfrom
bugfix/KLC-2382-flaky-revert-block-test
Closed

[KLC-2382] deflake TestConsensus_RevertBlockAndTransactions#46
fbsobreira wants to merge 1 commit into
developfrom
bugfix/KLC-2382-flaky-revert-block-test

Conversation

@fbsobreira
Copy link
Copy Markdown
Member

@fbsobreira fbsobreira commented May 12, 2026

Summary

Three independent root causes were behind the intermittent failure of
TestConsensus_RevertBlockAndTransactions in CI. Fixing all three brings the
test to 44 consecutive passes in a 50-iteration fresh-process loop at the
mainnet 4 s slot budget, up from a ~45% failure rate at the same budget
beforehand.

  • Production race in delayedBlockBroadcaster.SetHeaderForValidator
    (core/consensus/broadcast/delayedBroadcast.go). It was the only function
    in the file that appended to valHeaderBroadcastData without acquiring
    mutDataForBroadcast, while interceptedHeader iterated the same slice
    under the lock. The unsynchronized append could let interceptedHeader
    cancel the wrong header alarm — validators then missed the leader header,
    failed to sign within the slot, and the cluster stalled below quorum
    (worker statistics: small consensus quorum sigsNum=3). Fixed by wrapping
    the slice mutation in Lock/Unlock, matching the pattern of
    SetValidatorData. This is the proximate cause of the CI flake.
  • Test-side data race on SlotManagerMock.SlotIndex. Plain int64 read
    by baseBootstrap.GetNodeState sync goroutines via Index(), written by
    the test goroutine via the UpdateSlotCalled closure. Converted to
    atomic.Int64; updated the five external call sites and promoted the
    test-local currentSlot to atomic.Int64 for the same reason.
  • Silent timeouts hiding real failures. waitForBlockConditionOrTimeOut
    used to log.Error and return, so the next assertion (header.GetSlot())
    fired against an out-of-sync cluster — producing the original
    expected 0x10 / actual 0xf CI symptom. Now threads *testing.T and calls
    t.Fatalf with the per-node nodesComplete map at the point of
    divergence. The per-step budget is intentionally pinned to the mainnet
    slot duration (4 s) via simulatedSlotDuration / blockWaitTimeoutSeconds
    — this test is also a regression guard for production cadence and must not
    be relaxed.

Verification

Local stress on macOS / Go 1.25.7 / M-series CPU, 50 fresh go test processes
at the original 4 s budget (CI-equivalent):

Pre-fix Post-fix
Pass / fail 5 / 4 (~45% fail) 48 / 2 (96% pass)
Max consecutive PASS n/a 44 (iters 7–50)

The two failures in the post-fix loop both occurred within the first ~60 s after
killing a previous stress run; they show the same single-node-lag pattern and
are consistent with libp2p port TIME_WAIT residue rather than intrinsic flake.
AC of "20+ consecutive runs" is exceeded by 2×.

Out of scope (follow-up needed)

Running with `-race` after the `SetHeaderForValidator` fix still surfaces a
separate production race on `ConsensusState` shared fields:

  • Write at `core/consensus/slot/consensusState.go:61` (`ResetConsensusState`)
  • Read at `core/consensus/slot/consensusState.go:156` (`IsConsensusDataSet`)

`Data`, `Header`, `SlotCanceled`, `ExtendedCalled`,
`WaitingAllSignaturesTimeOut` on `ConsensusState` lack a mutex and are
touched from many call sites across `core/consensus/slot/...`. CI does not
enable `-race`, so this does not affect KLC-2382 directly — but it is a real
production race and a follow-up Jira should be filed.

Test plan

  • Build clean (`go build ./...`)
  • `golangci-lint run` on all changed packages — 0 issues attributable to
    these changes (2 pre-existing `gosec` findings in `utils.go` are on
    unrelated lines)
  • Single race-detector sanity (`go test -race -count=1`): test now fails
    cleanly via `t.Fatalf` on the remaining `ConsensusState` race, with
    the new diagnostic message instead of misleading downstream assertions
  • 50-iteration fresh-process stress at the original 4 s budget: 48/50 PASS,
    44 consecutive passes (iters 7–50)
  • CI pipeline green on this PR

Review Change Stack

Three independent root causes — fixing all three brings the test to
44 consecutive PASS in a 50-iteration fresh-process loop at the mainnet
4-second slot budget, up from ~45% failure rate beforehand.

1. delayedBlockBroadcaster.SetHeaderForValidator was the only function in
   the file that appended to valHeaderBroadcastData without acquiring
   mutDataForBroadcast, while interceptedHeader iterated it under lock.
   The unsynchronized append could let interceptedHeader cancel the wrong
   header alarm, validators then missed the leader header, fail to sign
   within the slot, and the cluster stalls below quorum (visible as
   "worker statistics: small consensus quorum sigsNum=3"). Wrap the slice
   mutation in Lock/Unlock matching the pattern of SetValidatorData.

2. SlotManagerMock.SlotIndex was a plain int64 read by baseBootstrap sync
   goroutines via Index() and written by the test goroutine via the
   UpdateSlotCalled closure. Convert to atomic.Int64; update the five
   external call sites (utils.go, transaction/common.go log statements,
   dupTransaction_test.go assertion) to use Load / Store. Test-side
   currentSlot is also captured by closures invoked from the chronology
   goroutine and becomes atomic.Int64.

3. waitForBlockConditionOrTimeOut used to log.Error and return on
   timeout, so the next assertion fired against an out-of-sync cluster
   and produced misleading errors like "expected 0x10 / actual 0xf"
   (the original CI symptom). Thread *testing.T and call t.Fatalf with
   the per-node nodesComplete map at the actual point of divergence.
   The per-step budget is intentionally pinned to the mainnet slot
   duration (4 s) via simulatedSlotDuration / blockWaitTimeoutSeconds —
   the test is also a regression guard for production cadence.
Copilot AI review requested due to automatic review settings May 12, 2026 13:57
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 12, 2026

Walkthrough

This PR introduces atomic slot index tracking for integration tests with standardized timing constants for mainnet-aligned simulation, and tightens broadcast synchronization by extending mutex coverage during validator-header scheduling.

Changes

Integration Test Slot Synchronization

Layer / File(s) Summary
Atomic slot index mock infrastructure
integrationTest/mock/slotManagerMock.go
SlotManagerMock converts SlotIndex from plain int64 to atomic.Int64, updates Index() to call Load(), and updates UpdateSlot() to call Add(1) for race-free concurrent access during integration tests.
Consensus test timing constants and helper refactoring
integrationTest/consensus/consensus_test.go
Introduces simulatedSlotDuration and blockWaitTimeoutSeconds constants; adds atomic package import; refactors WaitForBlockSlotOrTimeOut to accept *testing.T parameter; replaces shared wait logic with waitForBlockConditionOrTimeOut that calls t.Fatalf on timeout with cluster diagnostics instead of logging.
TestConsensus_RevertBlockAndTransactions atomic slot adoption
integrationTest/consensus/consensus_test.go
Replaces mutable currentSlot integer with atomic.Int64; wires node SlotManager callbacks (UpdateSlotCalled, TimestampCalled) to compute slot index and timestamps from atomic slot using simulated slot duration; updates block-wait calls to pass *testing.T and derived timeout.
Test suite adoption of atomic slot patterns
integrationTest/consensus/insertDup_test.go, integrationTest/transaction/common.go, integrationTest/transaction/dupTransaction/dupTransaction_test.go, integrationTest/utils.go
Updates insertDup_test to use atomic slot counter and new wait signature; modifies common.go CheckTXInBlock and CheckTXIsPending to call SlotIndex.Load(); updates dupTransaction_test assertion to call SlotIndex.Load(); updates utils.go UpdateSlot to call SlotIndex.Store().

Broadcast Synchronization Improvement

Layer / File(s) Summary
SetHeaderForValidator mutex coverage expansion
core/consensus/broadcast/delayedBroadcast.go
SetHeaderForValidator now acquires mutDataForBroadcast before trace logging and holds it through delay computation, valHeaderBroadcastData append, and alarmID derivation, preventing concurrent access to scheduling state during the entire setup sequence.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Suggested labels

consensus-critical

🚥 Pre-merge checks | ✅ 6 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Title check ⚠️ Warning Title follows required format KLC-2382 type: description, but 'deflake' is not an approved type; approved types are: feat, fix, refactor, perf, test, docs, chore, ci. Replace 'deflake' with 'fix' since the PR fixes flaky test failures: KLC-2382 fix: TestConsensus_RevertBlockAndTransactions
Docstring Coverage ⚠️ Warning Docstring coverage is 10.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (6 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Concurrency Safety ✅ Passed Mutex lock/unlock pairs properly balanced with correct scopes. Atomic.Int64 conversion eliminates data races. Closure captures safe in Go 1.25.7. No goroutine leaks or missing cleanup detected.
Error Handling ✅ Passed All error returns properly handled. No unchecked errors, silent failures, or bare panic() calls. Production code checks with logging, test code with require/assert, error wrapping via WrapError().
State Consistency ✅ Passed PR does not modify blockchain state (accounts, balances, storage). Changes are limited to consensus broadcasting infrastructure race fix and test synchronization utilities. Check not applicable.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch bugfix/KLC-2382-flaky-revert-block-test

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR deflakes TestConsensus_RevertBlockAndTransactions by addressing a production concurrency bug in delayed header broadcasting, eliminating test-side data races around slot indexing, and making block-wait helpers fail fast with clearer diagnostics so timeouts don’t cascade into misleading assertions.

Changes:

  • Fix a mutex omission in delayedBlockBroadcaster.SetHeaderForValidator to prevent racy slice appends while another goroutine iterates/cancels alarms.
  • Replace integration test slot index reads/writes with atomic.Int64 (SlotManagerMock.SlotIndex) and update call sites to use Load/Store/Add.
  • Improve block wait helpers to accept *testing.T and t.Fatalf on timeout with per-node completion state.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
integrationTest/utils.go Updates slot propagation to use atomic.Int64.Store for race-free writes.
integrationTest/transaction/dupTransaction/dupTransaction_test.go Adjusts slot assertions to use SlotIndex.Load().
integrationTest/transaction/common.go Updates log fields to use SlotIndex.Load() when reporting failures.
integrationTest/mock/slotManagerMock.go Converts SlotIndex to atomic.Int64 and makes Index/UpdateSlot use atomic operations.
integrationTest/consensus/insertDup_test.go Makes the test’s currentSlot atomic and threads *testing.T into wait helpers.
integrationTest/consensus/consensus_test.go Adds slot-budget constants, makes slot tracking atomic, and upgrades wait helpers to fail fast with diagnostics.
core/consensus/broadcast/delayedBroadcast.go Wraps valHeaderBroadcastData mutation/logging in mutDataForBroadcast to prevent concurrent slice corruption.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

if txResult.Block != blockNonce {
finalErr = WrapError(finalErr, ErrNotInBlock)
log.Warn("TX block does not match", "nodeIdx", i, "slot", n.SlotManager.SlotIndex, "nodeHight", n.Blkc.GetCurrentBlockHeader().GetNonce(), "foundBlock", txResult.Block, "expectedBlock", blockNonce)
log.Warn("TX block does not match", "nodeIdx", i, "slot", n.SlotManager.SlotIndex.Load(), "nodeHight", n.Blkc.GetCurrentBlockHeader().GetNonce(), "foundBlock", txResult.Block, "expectedBlock", blockNonce)
slot = currentSlot.Add(1)
WaitForBlockSlotOrTimeOut(t, nodes, uint64(slot), blockWaitTimeoutSeconds)

// TX should be processed on next blok after revert
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@integrationTest/transaction/common.go`:
- Line 235: Update the structured log key in the log.Warn call to fix the typo:
change the "nodeHight" key to "nodeHeight" so dashboards and filters use a
consistent field name; locate the log.Warn invocation that includes "nodeIdx",
"slot", "nodeHight", "foundBlock", and "expectedBlock" and replace the
misspelled key while keeping all other arguments unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 60c5ae98-e86e-4f57-b28c-1fb8bd888295

📥 Commits

Reviewing files that changed from the base of the PR and between 3b06f69 and 9b3a362.

📒 Files selected for processing (7)
  • core/consensus/broadcast/delayedBroadcast.go
  • integrationTest/consensus/consensus_test.go
  • integrationTest/consensus/insertDup_test.go
  • integrationTest/mock/slotManagerMock.go
  • integrationTest/transaction/common.go
  • integrationTest/transaction/dupTransaction/dupTransaction_test.go
  • integrationTest/utils.go
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Agent
  • GitHub Check: Analyze (go)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.go

📄 CodeRabbit inference engine (Custom checks)

**/*.go: Verify that any new or modified concurrent code (goroutines, channels, mutexes, sync primitives) is free of race conditions. Check for: proper lock/unlock pairing, no goroutine leaks, correct channel lifecycle management, and proper context cancellation propagation.
Verify that errors are not silently discarded. Check for: unchecked error returns, error wrapping with context, proper error propagation up the call chain, and no bare panic() calls outside of init() functions.

Files:

  • integrationTest/transaction/dupTransaction/dupTransaction_test.go
  • integrationTest/transaction/common.go
  • integrationTest/consensus/insertDup_test.go
  • integrationTest/utils.go
  • integrationTest/consensus/consensus_test.go
  • integrationTest/mock/slotManagerMock.go
  • core/consensus/broadcast/delayedBroadcast.go
**/*_test.go

⚙️ CodeRabbit configuration file

**/*_test.go: Test files. Review for: - Adequate coverage of edge cases and error paths - Proper use of test helpers and assertions - Race condition coverage (tests should use -race flag patterns) - No hardcoded sleep for synchronization (use channels or sync primitives) - Test isolation (no shared mutable state between tests)

Files:

  • integrationTest/transaction/dupTransaction/dupTransaction_test.go
  • integrationTest/consensus/insertDup_test.go
  • integrationTest/consensus/consensus_test.go
core/consensus/**

⚙️ CodeRabbit configuration file

core/consensus/**: This is the consensus engine. Review with extreme care: - Check for race conditions in concurrent block processing - Verify correct mutex/lock ordering to prevent deadlocks - Ensure deterministic behavior (no maps iteration without sorting, no random) - Validate message signing and verification logic - Flag any changes that could cause consensus forks or chain splits

Files:

  • core/consensus/broadcast/delayedBroadcast.go
🧠 Learnings (1)
📚 Learning: 2026-04-21T20:12:22.959Z
Learnt from: phcarneirobc
Repo: klever-io/klever-go PR: 38
File: indexer/eventsProcessor.go:188-211
Timestamp: 2026-04-21T20:12:22.959Z
Learning: In Go structs that are JSON-marshaled, if a field is a `bool` and has the `json:"...,omitempty"` tag, then leaving that field at its zero value (`false`) is functionally equivalent (in the resulting JSON) to explicitly setting `Foundation: false`. Reviewers should not flag struct literals that omit such `bool` fields as an inconsistency; they will serialize identically because `omitempty` suppresses `false` values.

Applied to files:

  • integrationTest/transaction/dupTransaction/dupTransaction_test.go
  • integrationTest/transaction/common.go
  • integrationTest/consensus/insertDup_test.go
  • integrationTest/utils.go
  • integrationTest/consensus/consensus_test.go
  • integrationTest/mock/slotManagerMock.go
  • core/consensus/broadcast/delayedBroadcast.go
🔇 Additional comments (8)
core/consensus/broadcast/delayedBroadcast.go (1)

218-233: Race condition fix is correct and properly scoped.

The mutex now protects the append to d.valHeaderBroadcastData (line 231), which races with the locked iteration in interceptedHeader (lines 520-538). Lock/unlock pairing is sound with no early returns or panic paths between them. The pattern matches SetLeaderData and SetValidatorData.

The comment at lines 218-222 clearly documents the root cause and consequences of the race—good practice for consensus-critical code.

integrationTest/consensus/insertDup_test.go (1)

30-40: Same loop-capture verification as in consensus_test.go.

The callback closure capture pattern is identical to the one already flagged for Go-version verification.

integrationTest/mock/slotManagerMock.go (1)

4-4: Atomic slot-index migration looks correct.

Using atomic.Int64 with Load()/Add() here is consistent and fixes the test/consensus read-write race on slot state.

Also applies to: 16-16, 51-51, 79-79

integrationTest/utils.go (1)

209-212: Slot update now uses the correct atomic write path.

SlotIndex.Store(...) is the right write primitive for this shared value.

integrationTest/transaction/common.go (1)

227-227: Race-safe slot diagnostics are correctly applied.

Switching these log paths to SlotIndex.Load() is the right fix for concurrent reads.

Also applies to: 235-235, 243-243, 264-264

integrationTest/transaction/dupTransaction/dupTransaction_test.go (1)

159-159: Assertion now reads slot index safely.

Using SlotIndex.Load() in this cross-goroutine assertion is the correct change.

integrationTest/consensus/consensus_test.go (2)

184-197: Fail-fast timeout behavior is a strong reliability improvement.

Switching timeout handling to t.Fatalf(...) here surfaces the root cause immediately instead of leaking into misleading downstream assertions.


65-76: Go version confirmed in go.mod: 1.25.7. Range-loop closures capturing n are safe—Go 1.22+ guarantees per-iteration copies of loop variables, eliminating the classic gotcha. The code is correct.

if txResult.Block != blockNonce {
finalErr = WrapError(finalErr, ErrNotInBlock)
log.Warn("TX block does not match", "nodeIdx", i, "slot", n.SlotManager.SlotIndex, "nodeHight", n.Blkc.GetCurrentBlockHeader().GetNonce(), "foundBlock", txResult.Block, "expectedBlock", blockNonce)
log.Warn("TX block does not match", "nodeIdx", i, "slot", n.SlotManager.SlotIndex.Load(), "nodeHight", n.Blkc.GetCurrentBlockHeader().GetNonce(), "foundBlock", txResult.Block, "expectedBlock", blockNonce)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix structured log key typo: nodeHightnodeHeight.

The typo makes log filters/dashboards inconsistent across node-height fields.

Suggested patch
- log.Warn("TX block does not match", "nodeIdx", i, "slot", n.SlotManager.SlotIndex.Load(), "nodeHight", n.Blkc.GetCurrentBlockHeader().GetNonce(), "foundBlock", txResult.Block, "expectedBlock", blockNonce)
+ log.Warn("TX block does not match", "nodeIdx", i, "slot", n.SlotManager.SlotIndex.Load(), "nodeHeight", n.Blkc.GetCurrentBlockHeader().GetNonce(), "foundBlock", txResult.Block, "expectedBlock", blockNonce)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
log.Warn("TX block does not match", "nodeIdx", i, "slot", n.SlotManager.SlotIndex.Load(), "nodeHight", n.Blkc.GetCurrentBlockHeader().GetNonce(), "foundBlock", txResult.Block, "expectedBlock", blockNonce)
log.Warn("TX block does not match", "nodeIdx", i, "slot", n.SlotManager.SlotIndex.Load(), "nodeHeight", n.Blkc.GetCurrentBlockHeader().GetNonce(), "foundBlock", txResult.Block, "expectedBlock", blockNonce)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@integrationTest/transaction/common.go` at line 235, Update the structured log
key in the log.Warn call to fix the typo: change the "nodeHight" key to
"nodeHeight" so dashboards and filters use a consistent field name; locate the
log.Warn invocation that includes "nodeIdx", "slot", "nodeHight", "foundBlock",
and "expectedBlock" and replace the misspelled key while keeping all other
arguments unchanged.

@fbsobreira fbsobreira marked this pull request as draft May 12, 2026 18:28
@fbsobreira
Copy link
Copy Markdown
Member Author

Closing in favour of widened-scope PR — branch renamed to bugfix/KLC-2382-consensus-data-races to reflect the broader race-fix work uncovered during this investigation. New PR forthcoming on the renamed branch.

@fbsobreira fbsobreira closed this May 13, 2026
@fbsobreira fbsobreira deleted the bugfix/KLC-2382-flaky-revert-block-test branch May 13, 2026 09:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants