[KLC-2382] fix data races in consensus, sharding, and libp2p layers#48
Conversation
Three independent root causes — fixing all three brings the test to 44 consecutive PASS in a 50-iteration fresh-process loop at the mainnet 4-second slot budget, up from ~45% failure rate beforehand. 1. delayedBlockBroadcaster.SetHeaderForValidator was the only function in the file that appended to valHeaderBroadcastData without acquiring mutDataForBroadcast, while interceptedHeader iterated it under lock. The unsynchronized append could let interceptedHeader cancel the wrong header alarm, validators then missed the leader header, fail to sign within the slot, and the cluster stalls below quorum (visible as "worker statistics: small consensus quorum sigsNum=3"). Wrap the slice mutation in Lock/Unlock matching the pattern of SetValidatorData. 2. SlotManagerMock.SlotIndex was a plain int64 read by baseBootstrap sync goroutines via Index() and written by the test goroutine via the UpdateSlotCalled closure. Convert to atomic.Int64; update the five external call sites (utils.go, transaction/common.go log statements, dupTransaction_test.go assertion) to use Load / Store. Test-side currentSlot is also captured by closures invoked from the chronology goroutine and becomes atomic.Int64. 3. waitForBlockConditionOrTimeOut used to log.Error and return on timeout, so the next assertion fired against an out-of-sync cluster and produced misleading errors like "expected 0x10 / actual 0xf" (the original CI symptom). Thread *testing.T and call t.Fatalf with the per-node nodesComplete map at the actual point of divergence. The per-step budget is intentionally pinned to the mainnet slot duration (4 s) via simulatedSlotDuration / blockWaitTimeoutSeconds — the test is also a regression guard for production cadence.
Investigation into the flaky TestConsensus_RevertBlockAndTransactions uncovered ~400 race-detector warnings beyond the test scaffolding fixed in the previous commit. The deflake target only stabilises once the underlying state-sharing hazards in the consensus protocol are closed. Consensus slot state: - ConsensusState slot-scoped fields (Data, Header, SlotIndex, SlotTimestamp, SlotCanceled, ExtendedCalled, WaitingAllSignaturesTimeOut) now sit behind mutSlotState; every reader/writer acquires the lock at handler boundary. - Introduce BeginNewSlot for atomic slot transitions. Splitting flag reset from SlotIndex/SlotTimestamp install allowed stale waitAllSignatures goroutines from slot N-1 to fire in the gap, pass the spawnSlot freshness guard, and write WaitingAllSignaturesTimeOut=true onto cleared state — causing the next leader to commit with minimum quorum prematurely. - Extract resetSlotStateLocked so ResetConsensusState (construction) and BeginNewSlot (slot transition) share the field list without duplication. - subslotBlock: snapshot ExtendedCalled *after* SetProcessingBlock(true) to preserve the original Extend happens-before invariant. Snapshotting first reopened a window where Extend could flip the flag, observe ProcessingBlock=false, and revert state mid-processing. - subslotEndSlot: header snapshot pattern in doEndSlotJobByLeader; signBlockHeader, createAndBroadcastHeaderFinalInfo, and broadcastBlockDataLeader now take the header as a parameter so the goroutine fan-out does not re-read sr.Header without the lock. - subslotSignature: spawnSlot guard around waitAllSignatures so a goroutine spawned in slot N-1 does not corrupt slot N flags. - subslotStartSlot: cancellation paths use SetSlotCanceled; slot transition uses BeginNewSlot. - worker: Extend uses SetExtendedCalled; checkSelfState and executeMessage take snapshots under RLock. - consensusMessageValidator: snapshot SlotIndex under lock before comparing against message slot. - subslot: timer path uses SetSlotCanceled instead of unguarded write. slotConsensus split-mutex: - Split rcns.mut into mutConsensusGroup (consensusGroup slice header) and mut (validatorSlotStates map). JobDone / SetJobDone is on the hot signature-collection path and was contending with every ConsensusGroup() reader. SetConsensusGroup intentionally does not hold both locks simultaneously — ComputeSize may briefly undercount for one poll cycle during a swap, which is benign because the next poll observes consistent state. - ConsensusGroup() captures the slice header under RLock so callers see a stable view; elements are read-only. BlockProcessor: - metaProcessor: clone header before launching async getMetricsFromHeader goroutine; the original code passed the live pointer that was mutated by the commit path on the calling goroutine. Sharding: - nodesCoordinator: convert currentEpoch and stateReady to atomic.Uint32 / atomic.Bool; update registry load site. - hashValidatorShuffler: snapshot NodesToShuffle under RLock instead of ranging the live slice. libp2p: - netMessenger.connMonitor: check ctx.Done() at the top of each loop iteration to prevent the goroutine from leaking after Close. Integration test: - consensus_test.go: polling-fairness fix so the waitForBlock condition is re-evaluated on a uniform tick instead of biased toward the earliest-completing node. Race detector now clean across the consensus, sharding, and libp2p suites at -count=1; integration test passes 15/15 fresh-process runs and 75% mean across -count=30 in-process stress (vs ~68% baseline).
WalkthroughSystem-wide concurrency hardening: adds slot-scoped and consensus-group locks, snapshots and passes headers under locks, pins signature timeouts, converts coordinator/tests to atomics, synchronizes delayed broadcast mutations, clones headers for async metrics, and makes the network monitor loop cancellable. ChangesConsensus Concurrency Hardening and Slot State Synchronization
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 6 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (6 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Pull request overview
This PR addresses widespread race-detector findings (and related flakiness in TestConsensus_RevertBlockAndTransactions) by introducing explicit synchronization across consensus slot state, sharding nodes-coordinator state, and a libp2p connection-monitor goroutine, plus updating integration test scaffolding to avoid unsynchronized shared state.
Changes:
- Add slot-scoped locking/snapshot patterns in consensus (slot state mutex, atomic slot transition via
BeginNewSlot, safer goroutine fan-out using header snapshots, and multiple reader/writer fixes). - Make sharding state concurrency-safe (atomics for
currentEpoch/stateReady, safer reads of shuffler parameters, safer config display snapshot). - Deflake integration tests by removing shared mutable state races (atomic slot index in mocks/tests and improved waiting helpers), and bound a libp2p monitor loop to
ctx.Done().
Reviewed changes
Copilot reviewed 25 out of 25 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
sharding/nodesCoordinator.go |
Convert epoch/ready flags to atomics; avoid races on savedStateKey and config display reads. |
sharding/nodesCoordinator_test.go |
Update tests for atomic currentEpoch. |
sharding/indexHashedNodesCoordinatorRegistry.go |
Read currentEpoch via Load() for registry export. |
sharding/hashValidatorShuffler.go |
Snapshot shuffler params under RLock to avoid concurrent epoch-update races. |
network/p2p/libp2p/netMessenger.go |
Add ctx-bound termination path for connection monitor loop. |
core/process/block/block.go |
Clone header before async metrics collection to avoid shared pointer races. |
core/consensus/broadcast/delayedBroadcast.go |
Serialize slice access in SetHeaderForValidator to prevent append/iterate races. |
core/consensus/slot/consensusState.go |
Introduce mutSlotState, setter helpers, and BeginNewSlot for atomic slot transitions. |
core/consensus/slot/slotConsensus.go |
Split mutexes and snapshot consensus group reads to reduce contention and data races. |
core/consensus/slot/worker.go |
Snapshot slot fields under lock before comparisons/logging; use setter for ExtendedCalled; header snapshot for revert. |
core/consensus/slot/consensusMessageValidator.go |
Snapshot SlotIndex under lock before validating message slot. |
core/consensus/slot/subslot.go |
Use SetSlotCanceled instead of an unguarded field write on timeout path. |
core/consensus/slot/bls/subslotStartSlot.go |
Use BeginNewSlot; lock-protected reads for slot state; cancellation via setter. |
core/consensus/slot/bls/subslotBlock.go |
Lock-protected Data/Header writes; safer ExtendedCalled snapshot ordering; cancellation via setter. |
core/consensus/slot/bls/subslotSignature.go |
Lock-protected snapshots for data/header and flags; add spawn-slot guard for timeout goroutine. |
core/consensus/slot/bls/subslotSignature_test.go |
Avoid sharing a mock container across t.Parallel() subtests. |
core/consensus/slot/bls/subslotEndSlot.go |
Snapshot header pointer for leader fan-out; lock-protected reads; cancellation via setter; pass header explicitly to async calls. |
core/consensus/slot/bls/subslotEndSlot_test.go |
Update tests for CreateAndBroadcastHeaderFinalInfo(header) signature. |
core/consensus/slot/bls/export_test.go |
Export updated helper signature for tests. |
integrationTest/mock/slotManagerMock.go |
Make SlotIndex atomic to avoid test goroutine vs background goroutine races. |
integrationTest/utils.go |
Use SlotIndex.Store() when advancing slots in tests. |
integrationTest/consensus/consensus_test.go |
Make test slot counter atomic; improve wait helper to fail at source and reduce polling bias. |
integrationTest/consensus/insertDup_test.go |
Align with new atomic slot counter + updated wait helper signature/constants. |
integrationTest/transaction/common.go |
Read atomic slot index in logs. |
integrationTest/transaction/dupTransaction/dupTransaction_test.go |
Read atomic slot index in assertions. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
core/consensus/slot/consensusState.go (1)
16-43: 🧹 Nitpick | 🔵 Trivial | ⚖️ Poor tradeoffPublic fields with mutex protection require caller discipline.
The slot-scoped fields (
Data,Header,SlotIndex, etc.) remain public whilemutSlotStateguards them. This relies on callers usingLockSlotState()/RLockSlotState()at handler boundaries. Consider adding a brief doc comment on each public field noting it requiresmutSlotState, or provide getter/setter methods for all guarded fields to enforce the contract at compile time.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@core/consensus/slot/consensusState.go` around lines 16 - 43, ConsensusState exposes slot-scoped public fields (Data, Header, SlotIndex, SlotTimestamp, SlotCanceled, ExtendedCalled, WaitingAllSignaturesTimeOut) that are protected by mutSlotState; add a short doc comment on each of those fields stating they must be accessed under mutSlotState (or alternatively implement and use explicit getters/setters) so callers know to call LockSlotState/RLockSlotState; update ConsensusState struct comments to mention the locking discipline and, if you choose getters/setters, add methods like GetData/SetData, GetHeader/SetHeader, GetSlotIndex/SetSlotIndex that acquire the appropriate lock and operate on the underlying fields to enforce safety at compile time.core/consensus/slot/slotConsensus.go (1)
77-108: 🧹 Nitpick | 🔵 Trivial | 💤 Low valueAcknowledged race window is acceptable but worth unit testing.
The comment correctly documents that a concurrent
ComputeSizecall can briefly see the newconsensusGroupagainst the oldvalidatorSlotStates, causingJobDoneto returnErrInvalidKey. This self-heals on the next poll cycle. Consider adding a unit test that exercises this edge case to ensure theErrInvalidKeypath inComputeSizelogs a debug message and continues gracefully (lines 192-194).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@core/consensus/slot/slotConsensus.go` around lines 77 - 108, Add a unit test that reproduces the acknowledged race window between SetConsensusGroup and ComputeSize: spawn concurrent goroutines that repeatedly call SetConsensusGroup(...) and a client path invoking ComputeSize()/JobDone until you observe JobDone returning ErrInvalidKey; assert that ComputeSize handles this by logging the debug message and continuing (no panic/stop) and that subsequent polls recover and count sizes correctly; target the slotConsensus methods SetConsensusGroup, ComputeSize, JobDone, and the validatorSlotStates/consensusGroup interaction, using retries/waits to make the transient race observable and capturing the logger output to verify the debug log path is exercised.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@core/consensus/slot/bls/subslotSignature_test.go`:
- Line 417: The test calls sr.IsJobDone(pubKey, bls.SrSignature) but only
asserts the boolean, ignoring the returned error; update the test to capture
both (got, err := sr.IsJobDone(...)) and fail the test if err != nil (e.g.,
require.NoError/if err != nil { t.Fatalf(...) }) before asserting equality
against tc.shouldJobBeDone so validator key lookup errors are surfaced instead
of silently treated as "not done".
In `@sharding/nodesCoordinator_test.go`:
- Around line 593-594: The test ends after setting ihgs.currentEpoch.Store(1)
without calling the function under test; update the test to call
allValidatorsInfo() on the coordinator instance and assert it returns
ErrEpochNodesConfigDoesNotExist (and no validators) — locate the
coordinator/ihgs test setup, invoke coordinator.allValidatorsInfo() (or the
method name used in the diff), capture (resp, err) and assert err ==
ErrEpochNodesConfigDoesNotExist (and resp is empty/nil), or if this test is
obsolete remove the test entirely.
In `@sharding/nodesCoordinator.go`:
- Around line 787-800: The code reads fields from ihgs.nodesConfig[newEpoch]
while only holding ihgs.mutNodesConfig.RLock(); to be consistent with other
readers (e.g., GetAllElectedValidatorsKeys) and to prevent future race windows,
also acquire the inner nodesConfig lock before accessing its maps: call
ihgs.nodesConfig.mutNodesMaps.RLock() after ihgs.mutNodesConfig.RLock(), read
displayCfg.electedList/eligibleList/waitingList/leavingList, then unlock
nodesConfig.mutNodesMaps with RUnlock() and finally
ihgs.mutNodesConfig.RUnlock(); keep the call to displayNodesConfiguration(...)
unchanged.
---
Outside diff comments:
In `@core/consensus/slot/consensusState.go`:
- Around line 16-43: ConsensusState exposes slot-scoped public fields (Data,
Header, SlotIndex, SlotTimestamp, SlotCanceled, ExtendedCalled,
WaitingAllSignaturesTimeOut) that are protected by mutSlotState; add a short doc
comment on each of those fields stating they must be accessed under mutSlotState
(or alternatively implement and use explicit getters/setters) so callers know to
call LockSlotState/RLockSlotState; update ConsensusState struct comments to
mention the locking discipline and, if you choose getters/setters, add methods
like GetData/SetData, GetHeader/SetHeader, GetSlotIndex/SetSlotIndex that
acquire the appropriate lock and operate on the underlying fields to enforce
safety at compile time.
In `@core/consensus/slot/slotConsensus.go`:
- Around line 77-108: Add a unit test that reproduces the acknowledged race
window between SetConsensusGroup and ComputeSize: spawn concurrent goroutines
that repeatedly call SetConsensusGroup(...) and a client path invoking
ComputeSize()/JobDone until you observe JobDone returning ErrInvalidKey; assert
that ComputeSize handles this by logging the debug message and continuing (no
panic/stop) and that subsequent polls recover and count sizes correctly; target
the slotConsensus methods SetConsensusGroup, ComputeSize, JobDone, and the
validatorSlotStates/consensusGroup interaction, using retries/waits to make the
transient race observable and capturing the logger output to verify the debug
log path is exercised.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: d1415408-77f5-4f35-845c-fbb931873267
📒 Files selected for processing (25)
core/consensus/broadcast/delayedBroadcast.gocore/consensus/slot/bls/export_test.gocore/consensus/slot/bls/subslotBlock.gocore/consensus/slot/bls/subslotEndSlot.gocore/consensus/slot/bls/subslotEndSlot_test.gocore/consensus/slot/bls/subslotSignature.gocore/consensus/slot/bls/subslotSignature_test.gocore/consensus/slot/bls/subslotStartSlot.gocore/consensus/slot/consensusMessageValidator.gocore/consensus/slot/consensusState.gocore/consensus/slot/slotConsensus.gocore/consensus/slot/subslot.gocore/consensus/slot/worker.gocore/process/block/block.gointegrationTest/consensus/consensus_test.gointegrationTest/consensus/insertDup_test.gointegrationTest/mock/slotManagerMock.gointegrationTest/transaction/common.gointegrationTest/transaction/dupTransaction/dupTransaction_test.gointegrationTest/utils.gonetwork/p2p/libp2p/netMessenger.gosharding/hashValidatorShuffler.gosharding/indexHashedNodesCoordinatorRegistry.gosharding/nodesCoordinator.gosharding/nodesCoordinator_test.go
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Agent
- GitHub Check: setup-and-lint / setup-and-lint
- GitHub Check: Analyze (go)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.go
📄 CodeRabbit inference engine (Custom checks)
**/*.go: Verify that any new or modified concurrent code (goroutines, channels, mutexes, sync primitives) is free of race conditions. Check for: proper lock/unlock pairing, no goroutine leaks, correct channel lifecycle management, and proper context cancellation propagation.
Verify that errors are not silently discarded. Check for: unchecked error returns, error wrapping with context, proper error propagation up the call chain, and no bare panic() calls outside of init() functions.
Files:
core/process/block/block.gointegrationTest/utils.gocore/consensus/slot/subslot.gointegrationTest/transaction/dupTransaction/dupTransaction_test.gosharding/indexHashedNodesCoordinatorRegistry.gocore/consensus/slot/bls/export_test.gosharding/hashValidatorShuffler.gocore/consensus/slot/consensusMessageValidator.gosharding/nodesCoordinator_test.gonetwork/p2p/libp2p/netMessenger.gointegrationTest/consensus/insertDup_test.gointegrationTest/mock/slotManagerMock.gocore/consensus/slot/bls/subslotEndSlot_test.gocore/consensus/slot/worker.gosharding/nodesCoordinator.gocore/consensus/broadcast/delayedBroadcast.gointegrationTest/consensus/consensus_test.gointegrationTest/transaction/common.gocore/consensus/slot/bls/subslotBlock.gocore/consensus/slot/bls/subslotSignature.gocore/consensus/slot/bls/subslotSignature_test.gocore/consensus/slot/slotConsensus.gocore/consensus/slot/bls/subslotStartSlot.gocore/consensus/slot/bls/subslotEndSlot.gocore/consensus/slot/consensusState.go
core/consensus/**
⚙️ CodeRabbit configuration file
core/consensus/**: This is the consensus engine. Review with extreme care: - Check for race conditions in concurrent block processing - Verify correct mutex/lock ordering to prevent deadlocks - Ensure deterministic behavior (no maps iteration without sorting, no random) - Validate message signing and verification logic - Flag any changes that could cause consensus forks or chain splits
Files:
core/consensus/slot/subslot.gocore/consensus/slot/bls/export_test.gocore/consensus/slot/consensusMessageValidator.gocore/consensus/slot/bls/subslotEndSlot_test.gocore/consensus/slot/worker.gocore/consensus/broadcast/delayedBroadcast.gocore/consensus/slot/bls/subslotBlock.gocore/consensus/slot/bls/subslotSignature.gocore/consensus/slot/bls/subslotSignature_test.gocore/consensus/slot/slotConsensus.gocore/consensus/slot/bls/subslotStartSlot.gocore/consensus/slot/bls/subslotEndSlot.gocore/consensus/slot/consensusState.go
**/*_test.go
⚙️ CodeRabbit configuration file
**/*_test.go: Test files. Review for: - Adequate coverage of edge cases and error paths - Proper use of test helpers and assertions - Race condition coverage (tests should use -race flag patterns) - No hardcoded sleep for synchronization (use channels or sync primitives) - Test isolation (no shared mutable state between tests)
Files:
integrationTest/transaction/dupTransaction/dupTransaction_test.gocore/consensus/slot/bls/export_test.gosharding/nodesCoordinator_test.gointegrationTest/consensus/insertDup_test.gocore/consensus/slot/bls/subslotEndSlot_test.gointegrationTest/consensus/consensus_test.gocore/consensus/slot/bls/subslotSignature_test.go
network/**
⚙️ CodeRabbit configuration file
network/**: Peer-to-peer networking layer. - Check for proper input validation on all received messages - Verify rate limiting and DoS protection mechanisms - Ensure connection handling is goroutine-safe - Look for potential message amplification attacks - Verify TLS/authentication on peer connections
Files:
network/p2p/libp2p/netMessenger.go
🧠 Learnings (1)
📚 Learning: 2026-04-21T20:12:22.959Z
Learnt from: phcarneirobc
Repo: klever-io/klever-go PR: 38
File: indexer/eventsProcessor.go:188-211
Timestamp: 2026-04-21T20:12:22.959Z
Learning: In Go structs that are JSON-marshaled, if a field is a `bool` and has the `json:"...,omitempty"` tag, then leaving that field at its zero value (`false`) is functionally equivalent (in the resulting JSON) to explicitly setting `Foundation: false`. Reviewers should not flag struct literals that omit such `bool` fields as an inconsistency; they will serialize identically because `omitempty` suppresses `false` values.
Applied to files:
core/process/block/block.gointegrationTest/utils.gocore/consensus/slot/subslot.gointegrationTest/transaction/dupTransaction/dupTransaction_test.gosharding/indexHashedNodesCoordinatorRegistry.gocore/consensus/slot/bls/export_test.gosharding/hashValidatorShuffler.gocore/consensus/slot/consensusMessageValidator.gosharding/nodesCoordinator_test.gonetwork/p2p/libp2p/netMessenger.gointegrationTest/consensus/insertDup_test.gointegrationTest/mock/slotManagerMock.gocore/consensus/slot/bls/subslotEndSlot_test.gocore/consensus/slot/worker.gosharding/nodesCoordinator.gocore/consensus/broadcast/delayedBroadcast.gointegrationTest/consensus/consensus_test.gointegrationTest/transaction/common.gocore/consensus/slot/bls/subslotBlock.gocore/consensus/slot/bls/subslotSignature.gocore/consensus/slot/bls/subslotSignature_test.gocore/consensus/slot/slotConsensus.gocore/consensus/slot/bls/subslotStartSlot.gocore/consensus/slot/bls/subslotEndSlot.gocore/consensus/slot/consensusState.go
🔇 Additional comments (42)
network/p2p/libp2p/netMessenger.go (1)
437-450: LGTM!core/process/block/block.go (1)
159-173: LGTM!core/consensus/broadcast/delayedBroadcast.go (1)
218-233: LGTM!core/consensus/slot/slotConsensus.go (2)
7-27: LGTM!
48-74: LGTM!Also applies to: 186-225
core/consensus/slot/consensusState.go (3)
119-137: LGTM!
177-189: LGTM!
222-238: LGTM!Also applies to: 270-277, 362-368
core/consensus/slot/subslot.go (1)
161-168: LGTM!core/consensus/slot/consensusMessageValidator.go (1)
109-139: LGTM!core/consensus/slot/bls/subslotSignature_test.go (1)
378-419: LGTM!core/consensus/slot/worker.go (4)
390-402: LGTM!
554-568: LGTM!
596-617: LGTM!
650-692: LGTM!core/consensus/slot/bls/subslotBlock.go (1)
143-145: LGTM!Also applies to: 194-197, 269-283, 299-327, 369-369, 399-404
core/consensus/slot/bls/subslotStartSlot.go (1)
79-79: LGTM!Also applies to: 88-93, 116-116, 136-136, 169-169, 175-177, 184-184, 211-216
core/consensus/slot/bls/subslotSignature.go (1)
78-81: LGTM!Also applies to: 93-107, 128-131, 215-219, 235-235, 246-246, 308-350
core/consensus/slot/bls/subslotEndSlot.go (2)
130-138: LGTM!Also applies to: 200-217, 246-258, 267-301, 309-336, 342-352, 371-384, 428-458, 484-490, 500-506, 555-568, 572-592, 594-601
614-627: LGTM!core/consensus/slot/bls/export_test.go (1)
253-255: LGTM!core/consensus/slot/bls/subslotEndSlot_test.go (1)
763-766: LGTM!sharding/hashValidatorShuffler.go (1)
75-91: LGTM!sharding/nodesCoordinator.go (5)
75-78: LGTM!
118-120: LGTM!
211-216: LGTM!
275-276: LGTM!Also applies to: 288-288
482-484: LGTM!sharding/indexHashedNodesCoordinatorRegistry.go (1)
31-31: LGTM!integrationTest/mock/slotManagerMock.go (3)
11-16: LGTM!
46-52: LGTM!
73-80: LGTM!integrationTest/utils.go (1)
209-213: LGTM!integrationTest/transaction/dupTransaction/dupTransaction_test.go (1)
152-163: LGTM!integrationTest/transaction/common.go (2)
221-256: LGTM!
258-270: LGTM!integrationTest/consensus/consensus_test.go (4)
19-30: LGTM!
59-89: LGTM!
153-176: LGTM!
184-232: LGTM!integrationTest/consensus/insertDup_test.go (2)
27-49: LGTM!
51-78: LGTM!
- nodesCoordinator: guard EpochStartPrepare display block against a nil nodesConfig[newEpoch] when SetNodes returns an error, and acquire mutNodesMaps.RLock() for consistency with other readers. - slotConsensus: rewrite the SetConsensusGroup doc comment so it matches the body — the two writes are intentionally non-atomic to keep the JobDone hot path off the ConsensusGroup lock; the brief inconsistency self-heals on the next poll. - consensusState: add SetWaitingAllSignaturesTimeOutIfSlot to perform the slot-equality check and the flag write under a single mutSlotState write lock. - subslotSignature: use the new helper in waitAllSignatures so a stale goroutine cannot set WaitingAllSignaturesTimeOut after BeginNewSlot has cleared state. - libp2p netMessenger: replace per-iteration time.After with a reusable time.Ticker so the connection-monitor sweep loop no longer allocates a fresh timer on every iteration.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@core/consensus/slot/slotConsensus.go`:
- Around line 69-78: ConsensusGroup currently returns the internal slice header
(rcns.consensusGroup) under mutConsensusGroup.RLock, which lets callers mutate
the underlying array and corrupt internal state; update the ConsensusGroup
method to return a defensive copy (e.g., using append to copy into a nil slice)
so callers receive a snapshot that cannot modify rcns.consensusGroup, preserving
the lock/snapshot semantics around the slotConsensus.consensusGroup field
guarded by mutConsensusGroup.
In `@sharding/nodesCoordinator.go`:
- Around line 213-216: The slice header savedStateKey is copied while holding
ihgs.mutSavedStateKey.RLock() but released before calling ihgs.saveState, which
allows races on the backing bytes; while still holding the lock, make a deep
copy of the key (e.g. use bytes.Clone or equivalent) into a new variable and
pass that cloned slice to ihgs.saveState so saveState sees an immutable copy;
update the code around ihgs.mutSavedStateKey.RLock()/RUnlock(), savedStateKey
and the ihgs.saveState(savedStateKey) call to clone while locked and then
release the lock before calling saveState.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 76345190-e85c-4298-aec2-739118c99bef
📒 Files selected for processing (5)
core/consensus/slot/bls/subslotSignature.gocore/consensus/slot/consensusState.gocore/consensus/slot/slotConsensus.gonetwork/p2p/libp2p/netMessenger.gosharding/nodesCoordinator.go
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: test
- GitHub Check: Analyze (go)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.go
📄 CodeRabbit inference engine (Custom checks)
**/*.go: Verify that any new or modified concurrent code (goroutines, channels, mutexes, sync primitives) is free of race conditions. Check for: proper lock/unlock pairing, no goroutine leaks, correct channel lifecycle management, and proper context cancellation propagation.
Verify that errors are not silently discarded. Check for: unchecked error returns, error wrapping with context, proper error propagation up the call chain, and no bare panic() calls outside of init() functions.
Files:
network/p2p/libp2p/netMessenger.gocore/consensus/slot/slotConsensus.gocore/consensus/slot/consensusState.gosharding/nodesCoordinator.gocore/consensus/slot/bls/subslotSignature.go
network/**
⚙️ CodeRabbit configuration file
network/**: Peer-to-peer networking layer. - Check for proper input validation on all received messages - Verify rate limiting and DoS protection mechanisms - Ensure connection handling is goroutine-safe - Look for potential message amplification attacks - Verify TLS/authentication on peer connections
Files:
network/p2p/libp2p/netMessenger.go
core/consensus/**
⚙️ CodeRabbit configuration file
core/consensus/**: This is the consensus engine. Review with extreme care: - Check for race conditions in concurrent block processing - Verify correct mutex/lock ordering to prevent deadlocks - Ensure deterministic behavior (no maps iteration without sorting, no random) - Validate message signing and verification logic - Flag any changes that could cause consensus forks or chain splits
Files:
core/consensus/slot/slotConsensus.gocore/consensus/slot/consensusState.gocore/consensus/slot/bls/subslotSignature.go
🧠 Learnings (1)
📚 Learning: 2026-04-21T20:12:22.959Z
Learnt from: phcarneirobc
Repo: klever-io/klever-go PR: 38
File: indexer/eventsProcessor.go:188-211
Timestamp: 2026-04-21T20:12:22.959Z
Learning: In Go structs that are JSON-marshaled, if a field is a `bool` and has the `json:"...,omitempty"` tag, then leaving that field at its zero value (`false`) is functionally equivalent (in the resulting JSON) to explicitly setting `Foundation: false`. Reviewers should not flag struct literals that omit such `bool` fields as an inconsistency; they will serialize identically because `omitempty` suppresses `false` values.
Applied to files:
network/p2p/libp2p/netMessenger.gocore/consensus/slot/slotConsensus.gocore/consensus/slot/consensusState.gosharding/nodesCoordinator.gocore/consensus/slot/bls/subslotSignature.go
🔇 Additional comments (6)
network/p2p/libp2p/netMessenger.go (1)
438-450: LGTM!core/consensus/slot/slotConsensus.go (2)
7-31: LGTM!Also applies to: 54-62, 90-112, 193-207, 211-229
114-132: ⚡ Quick winNo data race in
consensusGroupSizeandselfPubKeyaccessors — review incorrect.Both fields are initialized once in
NewSlotConsensusand never modified during consensus processing; the setters exist only in test code. After initialization, these are read-only in production, making concurrent reads safe regardless of synchronization. The locking documentation correctly omits them because they don't require mutex protection.If clarity is desired, consider removing the unused
SetConsensusGroupSizeandSetSelfPubKeymethods entirely and document these as set-once fields.core/consensus/slot/consensusState.go (1)
16-56: LGTM!Also applies to: 57-90, 110-131, 133-151, 192-203, 235-252, 284-291, 376-382
core/consensus/slot/bls/subslotSignature.go (1)
69-135: LGTM!Also applies to: 214-275, 308-344
sharding/nodesCoordinator.go (1)
75-78: LGTM!Also applies to: 118-120, 275-288, 482-482, 787-800
…ter tests - Reduces ProcessBlock cognitive complexity from 24 to 14 (Sonar limit 15) by extracting four helpers: validateBlockAndRequestMissing, dispatchAsyncHeaderMetrics, handleEpochStartBlock, verifyBlockTrieRoots. Behavior unchanged; existing tests still pass. - Adds unit tests for the new slot-state primitives on ConsensusState (LockSlotState/RLockSlotState, GetData, SetSlotCanceled, SetExtendedCalled, SetWaitingAllSignaturesTimeOut, SetWaitingAllSignaturesTimeOutIfSlot for both slot-match and slot-mismatch, BeginNewSlot). Coverage on new code rises from 76.9% to 84.1% across the PR.
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@core/consensus/slot/consensusState_test.go`:
- Around line 555-557: The test currently never sets Header before calling
BeginNewSlot so the reset path isn't exercised; modify the setup to assign a
non-nil header (e.g., set cns.Header = &Header{...} or call the code path that
populates Header) before calling cns.LockSlotState()/cns.Data =
[]byte("stale")/cns.UnlockSlotState() and then call cns.BeginNewSlot(...) to
verify Header is cleared; apply the same change for the second occurrence around
lines noted (the block at 566-572) so both assertions that expect header to be
nil actually validate a reset from a previously non-nil Header.
In `@core/process/block/block.go`:
- Line 296: Fix the typo in the Bugsnag error message inside the bugsnag.Notify
call in block.go: change the string "process epoch valdiator state: %w" to
"process epoch validator state: %w" (the call that composes the error with
fmt.Errorf and passes err and header metadata); update the message text only so
the variable names (err, header) and the bugsnag.Notify invocation remain
unchanged.
- Around line 275-280: The code calls mp.verifyFees(header) and logs the error
via bugsnag but always returns nil; update the epoch-start handling to mirror
the non-epoch-start path by returning the error instead of swallowing it: after
calling verifyFees(header) check err and both notify bugsnag and return
fmt.Errorf("process verify fees: %w", err) (or simply return err) from the
function; ensure this change is applied around the epoch-start processing block
that invokes verifyFees and keep the bugsnag.Notify call as-is so failures are
both reported and propagated.
- Line 233: The code currently compares the returned error from
checkBlockValidity using a direct equality (if err ==
process.ErrBlockHashDoesNotMatch); change this to use errors.Is(err,
process.ErrBlockHashDoesNotMatch) so wrapped errors are detected, and add the
"errors" import if it's not already present; update the conditional in the
block.go handling around checkBlockValidity to use errors.Is with the sentinel
ErrBlockHashDoesNotMatch.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 0a2e5858-6b54-41ea-9183-9850006c6ac2
📒 Files selected for processing (2)
core/consensus/slot/consensusState_test.gocore/process/block/block.go
📜 Review details
🧰 Additional context used
📓 Path-based instructions (3)
**/*.go
📄 CodeRabbit inference engine (Custom checks)
**/*.go: Verify that any new or modified concurrent code (goroutines, channels, mutexes, sync primitives) is free of race conditions. Check for: proper lock/unlock pairing, no goroutine leaks, correct channel lifecycle management, and proper context cancellation propagation.
Verify that errors are not silently discarded. Check for: unchecked error returns, error wrapping with context, proper error propagation up the call chain, and no bare panic() calls outside of init() functions.
Files:
core/consensus/slot/consensusState_test.gocore/process/block/block.go
core/consensus/**
⚙️ CodeRabbit configuration file
core/consensus/**: This is the consensus engine. Review with extreme care: - Check for race conditions in concurrent block processing - Verify correct mutex/lock ordering to prevent deadlocks - Ensure deterministic behavior (no maps iteration without sorting, no random) - Validate message signing and verification logic - Flag any changes that could cause consensus forks or chain splits
Files:
core/consensus/slot/consensusState_test.go
**/*_test.go
⚙️ CodeRabbit configuration file
**/*_test.go: Test files. Review for: - Adequate coverage of edge cases and error paths - Proper use of test helpers and assertions - Race condition coverage (tests should use -race flag patterns) - No hardcoded sleep for synchronization (use channels or sync primitives) - Test isolation (no shared mutable state between tests)
Files:
core/consensus/slot/consensusState_test.go
🧠 Learnings (1)
📚 Learning: 2026-04-21T20:12:22.959Z
Learnt from: phcarneirobc
Repo: klever-io/klever-go PR: 38
File: indexer/eventsProcessor.go:188-211
Timestamp: 2026-04-21T20:12:22.959Z
Learning: In Go structs that are JSON-marshaled, if a field is a `bool` and has the `json:"...,omitempty"` tag, then leaving that field at its zero value (`false`) is functionally equivalent (in the resulting JSON) to explicitly setting `Foundation: false`. Reviewers should not flag struct literals that omit such `bool` fields as an inconsistency; they will serialize identically because `omitempty` suppresses `false` values.
Applied to files:
core/consensus/slot/consensusState_test.gocore/process/block/block.go
🔇 Additional comments (3)
core/consensus/slot/consensusState_test.go (1)
6-546: LGTM!Also applies to: 559-565, 573-576
core/process/block/block.go (2)
243-263: LGTM!
126-129: LGTM!Also applies to: 148-148, 190-191, 216-222
- block.go: handleEpochStartBlock now returns the verifyFees error instead of nil (regression introduced by a linter pass after the refactor); matches the docstring and the non-epoch-start path. - block.go: use errors.Is for the ErrBlockHashDoesNotMatch sentinel comparison; fix pre-existing "valdiator" typo in bugsnag message. - nodesCoordinator: bytes.Clone savedStateKey while holding the RLock so saveState reads a stable buffer that no future writer can mutate. - consensusState_test: seed Header to a non-nil value before BeginNewSlot so the post-reset nil assertion is meaningful. - slotConsensus_test: remove stray ConsensusGroup()[1] = "X" mutation from TestSlotConsensus_ResetValidationMap; it never affected any assertion and modeled an antipattern against the documented read-only contract.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
sharding/nodesCoordinator.go (1)
877-883: 🧹 Nitpick | 🔵 Trivial | 💤 Low valueConsider
bytes.Clonefor consistency with line 214.
GetSavedStateKeyreturns a slice header without cloning. While this is safe given thatsavedStateKeyis only reassigned (never mutated in place), usingbytes.Clonehere would provide the same defensive guarantee applied at line 214 inEpochStartAction.♻️ Suggested consistency improvement
func (ihgs *indexHashedNodesCoordinator) GetSavedStateKey() []byte { ihgs.mutSavedStateKey.RLock() - key := ihgs.savedStateKey + key := bytes.Clone(ihgs.savedStateKey) ihgs.mutSavedStateKey.RUnlock() return key }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@sharding/nodesCoordinator.go` around lines 877 - 883, GetSavedStateKey currently returns ihgs.savedStateKey directly under lock, exposing the same slice header; change it to return a defensive copy using bytes.Clone of ihgs.savedStateKey while still holding the read lock. In other words, inside GetSavedStateKey (for type indexHashedNodesCoordinator) acquire the RLock, set keyCopy := bytes.Clone(ihgs.savedStateKey), release the RLock, and return keyCopy — matching the defensive pattern used in EpochStartAction at the earlier referenced location.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@sharding/nodesCoordinator.go`:
- Around line 877-883: GetSavedStateKey currently returns ihgs.savedStateKey
directly under lock, exposing the same slice header; change it to return a
defensive copy using bytes.Clone of ihgs.savedStateKey while still holding the
read lock. In other words, inside GetSavedStateKey (for type
indexHashedNodesCoordinator) acquire the RLock, set keyCopy :=
bytes.Clone(ihgs.savedStateKey), release the RLock, and return keyCopy —
matching the defensive pattern used in EpochStartAction at the earlier
referenced location.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: b2be6c23-7c20-467e-ae61-19e662dadda7
📒 Files selected for processing (4)
core/consensus/slot/consensusState_test.gocore/consensus/slot/slotConsensus_test.gocore/process/block/block.gosharding/nodesCoordinator.go
💤 Files with no reviewable changes (1)
- core/consensus/slot/slotConsensus_test.go
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: test
- GitHub Check: Analyze (go)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.go
📄 CodeRabbit inference engine (Custom checks)
**/*.go: Verify that any new or modified concurrent code (goroutines, channels, mutexes, sync primitives) is free of race conditions. Check for: proper lock/unlock pairing, no goroutine leaks, correct channel lifecycle management, and proper context cancellation propagation.
Verify that errors are not silently discarded. Check for: unchecked error returns, error wrapping with context, proper error propagation up the call chain, and no bare panic() calls outside of init() functions.
Files:
sharding/nodesCoordinator.gocore/process/block/block.gocore/consensus/slot/consensusState_test.go
core/consensus/**
⚙️ CodeRabbit configuration file
core/consensus/**: This is the consensus engine. Review with extreme care: - Check for race conditions in concurrent block processing - Verify correct mutex/lock ordering to prevent deadlocks - Ensure deterministic behavior (no maps iteration without sorting, no random) - Validate message signing and verification logic - Flag any changes that could cause consensus forks or chain splits
Files:
core/consensus/slot/consensusState_test.go
**/*_test.go
⚙️ CodeRabbit configuration file
**/*_test.go: Test files. Review for: - Adequate coverage of edge cases and error paths - Proper use of test helpers and assertions - Race condition coverage (tests should use -race flag patterns) - No hardcoded sleep for synchronization (use channels or sync primitives) - Test isolation (no shared mutable state between tests)
Files:
core/consensus/slot/consensusState_test.go
🧠 Learnings (1)
📚 Learning: 2026-04-21T20:12:22.959Z
Learnt from: phcarneirobc
Repo: klever-io/klever-go PR: 38
File: indexer/eventsProcessor.go:188-211
Timestamp: 2026-04-21T20:12:22.959Z
Learning: In Go structs that are JSON-marshaled, if a field is a `bool` and has the `json:"...,omitempty"` tag, then leaving that field at its zero value (`false`) is functionally equivalent (in the resulting JSON) to explicitly setting `Foundation: false`. Reviewers should not flag struct literals that omit such `bool` fields as an inconsistency; they will serialize identically because `omitempty` suppresses `false` values.
Applied to files:
sharding/nodesCoordinator.gocore/process/block/block.gocore/consensus/slot/consensusState_test.go
🔇 Additional comments (17)
core/consensus/slot/consensusState_test.go (2)
6-6: LGTM!
465-577: LGTM!core/process/block/block.go (9)
7-7: LGTM!
127-130: LGTM!
149-149: LGTM!
191-191: LGTM!
217-223: LGTM!
225-242: LGTM!
244-264: LGTM!
266-282: LGTM!
284-308: LGTM!sharding/nodesCoordinator.go (6)
75-78: LGTM!
118-120: LGTM!
211-216: LGTM!
275-288: LGTM!
482-484: LGTM!
787-800: LGTM!
Summary
Investigation into the flaky
TestConsensus_RevertBlockAndTransactionsintegration test uncovered ~400 race-detector warnings across the consensus protocol, sharding nodes-coordinator, and libp2p connection monitor. The test deflake only stabilises once the underlying state-sharing hazards are closed, so this PR widens the original ticket scope to cover all required race fixes.Two commits:
[KLC-2382] deflake TestConsensus_RevertBlockAndTransactions— three independent root causes in the test scaffolding (delayedBroadcast unsynchronised append, mock slot index race, polling fairness +*testing.Tthreading).[KLC-2382] fix data races across consensus, sharding, and libp2p— the broader protocol/library fixes.What changed
Consensus slot state
mutSlotStateonConsensusStateto guard slot-scoped fields (Data,Header,SlotIndex,SlotTimestamp,SlotCanceled,ExtendedCalled,WaitingAllSignaturesTimeOut). Every reader/writer acquires at handler boundary.BeginNewSlotfor atomic slot transition (flag reset + index/timestamp install under a single lock). Closes thewaitAllSignaturesspawnSlot guard hazard where a stale goroutine could writeWaitingAllSignaturesTimeOut=trueonto freshly-cleared state, causing the next leader to commit with minimum quorum prematurely.resetSlotStateLockedsoResetConsensusState(construction) andBeginNewSlot(slot transition) share the field list without duplication.subslotBlock.processReceivedBlock: snapshotExtendedCalledafterSetProcessingBlock(true)to preserve the originalExtendhappens-before invariant. Snapshotting first reopened a window whereExtendcould flip the flag, observeProcessingBlock=false, and revert state mid-processing.subslotEndSlot.doEndSlotJobByLeader: header snapshot pattern —signBlockHeader,createAndBroadcastHeaderFinalInfo,broadcastBlockDataLeadertake the header as a parameter so goroutine fan-out does not re-readsr.Headerwithout the lock.subslotSignature.waitAllSignatures: spawnSlot guard so a goroutine from slot N-1 does not corrupt slot N flags.subslotStartSlot: cancellation usesSetSlotCanceled; slot transition usesBeginNewSlot.worker.Extend: usesSetExtendedCalled; snapshot patterns incheckSelfState/executeMessage.consensusMessageValidator: snapshotSlotIndexunder lock before comparing against the message slot.subslot: timer path usesSetSlotCanceledinstead of unguarded write.slotConsensus split-mutex
rcns.mutintomutConsensusGroup(consensus group slice) andmut(validator slot states map). The hotJobDonesignature-collection path no longer contends withConsensusGroup()readers.SetConsensusGroupintentionally does not hold both locks simultaneously;ComputeSizemay briefly undercount for one poll cycle during a swap — benign because the next poll observes consistent state. Comment in the source spells this out so a future reviewer doesn't "fix" it by widening the lock and reintroducing the contention.ConsensusGroup()captures the slice header under RLock so callers see a stable view; elements are read-only.BlockProcessor
metaProcessor: clone header before launching asyncgetMetricsFromHeadergoroutine to avoid sharing a live pointer with the commit path on the calling goroutine.Sharding
nodesCoordinator: convertcurrentEpochandstateReadytoatomic.Uint32/atomic.Bool; update registry load site.hashValidatorShuffler: snapshotNodesToShuffleunder RLock instead of ranging the live slice.libp2p
netMessenger.connMonitor: checkctx.Done()at the top of each loop iteration so the goroutine doesn't leak afterClose().Integration test scaffolding (in the deflake commit)
delayedBroadcast.SetHeaderForValidator: acquiremutDataForBroadcastbefore appending — the unsynchronised append was racing withinterceptedHeader's iteration under lock and could silently steal the wrong header alarm, dropping the cluster below quorum.SlotManagerMock.SlotIndex: convert toatomic.Int64; update five external call sites.waitForBlockConditionOrTimeOut: thread*testing.Tand callt.Fatalfat the actual point of divergence; per-step budget pinned to the mainnet 4-second slot duration (the test doubles as a regression guard for production cadence).Verification
go test -race -count=1 ./core/consensus/slot/... ./core/consensus/slot/bls/... ./sharding/... ./network/p2p/libp2p/...— cleango test -race -count=1 ./integrationTest/consensus/... -run TestConsensusBLSFullTestSingleKeys— 15/15 fresh-process runs PASSgo test -race -count=30 ./integrationTest/consensus/... -run TestConsensusBLSFullTestSingleKeys— 75% mean PASS (vs ~68% baseline HEAD-of-develop)TestConsensus_RevertBlockAndTransactions: 44 consecutive PASS over a 50-iteration fresh-process loop at the mainnet 4-second slot budget (was ~45% failure rate)Test plan
TestConsensus_RevertBlockAndTransactions× 20 in CIData Race Fixes Across Consensus, Sharding, and Networking (focus: consensus, tx processing, state, KVM, networking)
This PR eliminates ~400 race-detector warnings found while deflaking integration tests by removing data races and closing ordering gaps across consensus, block/tx processing, sharding, and libp2p. Changes prioritize node stability and data integrity by enforcing proper slot-scoped synchronization, safe goroutine fan-out, and careful atomic usage; KVM was not modified.
Affected blockchain‑critical components
Key concurrency and correctness changes
Consensus slot-scoped locking
Safe fan-out / stale-goroutine prevention
Reduced contention and disciplined locking
Networking and delayed-broadcast fixes
Sharding and coordinator atomics
Test scaffolding and integration tests
Impact on node stability, data integrity, and performance
Public API / behavioral changes