Skip to content

[KLC-2382] fix data races in consensus, sharding, and libp2p layers#48

Merged
fbsobreira merged 5 commits into
developfrom
bugfix/KLC-2382-consensus-data-races
May 13, 2026
Merged

[KLC-2382] fix data races in consensus, sharding, and libp2p layers#48
fbsobreira merged 5 commits into
developfrom
bugfix/KLC-2382-consensus-data-races

Conversation

@fbsobreira

@fbsobreira fbsobreira commented May 13, 2026

Copy link
Copy Markdown
Member

Summary

Investigation into the flaky TestConsensus_RevertBlockAndTransactions integration test uncovered ~400 race-detector warnings across the consensus protocol, sharding nodes-coordinator, and libp2p connection monitor. The test deflake only stabilises once the underlying state-sharing hazards are closed, so this PR widens the original ticket scope to cover all required race fixes.

Two commits:

  1. [KLC-2382] deflake TestConsensus_RevertBlockAndTransactions — three independent root causes in the test scaffolding (delayedBroadcast unsynchronised append, mock slot index race, polling fairness + *testing.T threading).
  2. [KLC-2382] fix data races across consensus, sharding, and libp2p — the broader protocol/library fixes.

What changed

Consensus slot state

  • Introduce mutSlotState on ConsensusState to guard slot-scoped fields (Data, Header, SlotIndex, SlotTimestamp, SlotCanceled, ExtendedCalled, WaitingAllSignaturesTimeOut). Every reader/writer acquires at handler boundary.
  • Add BeginNewSlot for atomic slot transition (flag reset + index/timestamp install under a single lock). Closes the waitAllSignatures spawnSlot guard hazard where a stale goroutine could write WaitingAllSignaturesTimeOut=true onto freshly-cleared state, causing the next leader to commit with minimum quorum prematurely.
  • Extract resetSlotStateLocked so ResetConsensusState (construction) and BeginNewSlot (slot transition) share the field list without duplication.
  • subslotBlock.processReceivedBlock: snapshot ExtendedCalled after SetProcessingBlock(true) to preserve the original Extend happens-before invariant. Snapshotting first reopened a window where Extend could flip the flag, observe ProcessingBlock=false, and revert state mid-processing.
  • subslotEndSlot.doEndSlotJobByLeader: header snapshot pattern — signBlockHeader, createAndBroadcastHeaderFinalInfo, broadcastBlockDataLeader take the header as a parameter so goroutine fan-out does not re-read sr.Header without the lock.
  • subslotSignature.waitAllSignatures: spawnSlot guard so a goroutine from slot N-1 does not corrupt slot N flags.
  • subslotStartSlot: cancellation uses SetSlotCanceled; slot transition uses BeginNewSlot.
  • worker.Extend: uses SetExtendedCalled; snapshot patterns in checkSelfState / executeMessage.
  • consensusMessageValidator: snapshot SlotIndex under lock before comparing against the message slot.
  • subslot: timer path uses SetSlotCanceled instead of unguarded write.

slotConsensus split-mutex

  • Split rcns.mut into mutConsensusGroup (consensus group slice) and mut (validator slot states map). The hot JobDone signature-collection path no longer contends with ConsensusGroup() readers.
  • SetConsensusGroup intentionally does not hold both locks simultaneously; ComputeSize may briefly undercount for one poll cycle during a swap — benign because the next poll observes consistent state. Comment in the source spells this out so a future reviewer doesn't "fix" it by widening the lock and reintroducing the contention.
  • ConsensusGroup() captures the slice header under RLock so callers see a stable view; elements are read-only.

BlockProcessor

  • metaProcessor: clone header before launching async getMetricsFromHeader goroutine to avoid sharing a live pointer with the commit path on the calling goroutine.

Sharding

  • nodesCoordinator: convert currentEpoch and stateReady to atomic.Uint32 / atomic.Bool; update registry load site.
  • hashValidatorShuffler: snapshot NodesToShuffle under RLock instead of ranging the live slice.

libp2p

  • netMessenger.connMonitor: check ctx.Done() at the top of each loop iteration so the goroutine doesn't leak after Close().

Integration test scaffolding (in the deflake commit)

  • delayedBroadcast.SetHeaderForValidator: acquire mutDataForBroadcast before appending — the unsynchronised append was racing with interceptedHeader's iteration under lock and could silently steal the wrong header alarm, dropping the cluster below quorum.
  • SlotManagerMock.SlotIndex: convert to atomic.Int64; update five external call sites.
  • waitForBlockConditionOrTimeOut: thread *testing.T and call t.Fatalf at the actual point of divergence; per-step budget pinned to the mainnet 4-second slot duration (the test doubles as a regression guard for production cadence).
  • Polling-fairness fix so the wait condition is evaluated on a uniform tick instead of biased toward the earliest-completing node.

Verification

  • go test -race -count=1 ./core/consensus/slot/... ./core/consensus/slot/bls/... ./sharding/... ./network/p2p/libp2p/... — clean
  • go test -race -count=1 ./integrationTest/consensus/... -run TestConsensusBLSFullTestSingleKeys — 15/15 fresh-process runs PASS
  • go test -race -count=30 ./integrationTest/consensus/... -run TestConsensusBLSFullTestSingleKeys — 75% mean PASS (vs ~68% baseline HEAD-of-develop)
  • TestConsensus_RevertBlockAndTransactions: 44 consecutive PASS over a 50-iteration fresh-process loop at the mainnet 4-second slot budget (was ~45% failure rate)

Test plan

  • CI race lane stays green on this branch
  • Re-run TestConsensus_RevertBlockAndTransactions × 20 in CI
  • Confirm no regression in consensus throughput on the staging cluster

Data Race Fixes Across Consensus, Sharding, and Networking (focus: consensus, tx processing, state, KVM, networking)

This PR eliminates ~400 race-detector warnings found while deflaking integration tests by removing data races and closing ordering gaps across consensus, block/tx processing, sharding, and libp2p. Changes prioritize node stability and data integrity by enforcing proper slot-scoped synchronization, safe goroutine fan-out, and careful atomic usage; KVM was not modified.

Affected blockchain‑critical components

  • Consensus core: slot lifecycle, ConsensusState, subslot flows (start/extend/end), signature generation/timeouts, worker message processing, and consensus-group handling.
  • Block/transaction processing: ProcessBlock validation path, async header-metrics dispatch, epoch-start handling, and post-tx trie-root verification (consolidated with Bugsnag reporting).
  • State management: slot-scoped fields (SlotIndex, SlotTimestamp, SlotCanceled, ExtendedCalled, WaitingAllSignaturesTimeOut) and sharding coordinator epoch/readiness.
  • Networking (libp2p): delayed broadcast synchronization and messenger connection-monitor lifecycle.
  • KVM: not touched.

Key concurrency and correctness changes

  • Consensus slot-scoped locking

    • Adds mutSlotState (sync.RWMutex) on ConsensusState with Lock/RLock helpers and new setters (SetSlotCanceled, SetExtendedCalled, SetWaitingAllSignaturesTimeOut, SetWaitingAllSignaturesTimeOutIfSlot) to enforce slot-pin semantics and prevent stale-goroutine writes.
    • BeginNewSlot/resetSlotStateLocked perform atomic slot transitions under a single write lock.
  • Safe fan-out / stale-goroutine prevention

    • Snapshot guarded fields under RLock before spawning workers or timers.
    • Pass cloned header copies or spawnSlot identifiers into goroutines (ProcessBlock clones headers before async metrics; subslot timeouts receive spawnSlot; waitAllSignatures uses slot-aware setter).
  • Reduced contention and disciplined locking

    • slotConsensus splits a hot-path mutex into mutConsensusGroup (consensusGroup slice) and mut (validator slot states) to lower contention. ConsensusGroup() returns a stable slice-header snapshot under RLock. SetConsensusGroup rebuilds validatorSlotStates without holding both locks; a brief documented transient undercount is an intentional trade to reduce contention.
  • Networking and delayed-broadcast fixes

    • createConnectionMonitor uses a ticker/select and checks ctx.Done() at loop top so monitor goroutine exits with messenger shutdown.
    • delayedBroadcast append and alarm-id computation synchronized under mutDataForBroadcast.
  • Sharding and coordinator atomics

    • indexHashedNodesCoordinator currentEpoch → atomic.Uint32 and stateReady → atomic.Bool (loads/stores used).
    • hashValidatorShuffler snapshots NodesToShuffle and nodes under the same RLock to avoid races.
    • Registry population reads epoch atomically.
  • Test scaffolding and integration tests

    • SlotManagerMock.SlotIndex → atomic.Int64; Index/UpdateSlot use atomic ops.
    • Polling helpers now accept *testing.T, use deadline-based exponential backoff and t.Fatalf for clear failure diagnostics.
    • Tests refactored to avoid shared mutable state in parallel subtests and to better simulate production slot timing.

Impact on node stability, data integrity, and performance

  • Node stability: Removes numerous concurrency bugs that could cause crashes, hangs, goroutine leaks, or incorrect consensus progression by enforcing locking, pinning slot-specific goroutines, and ensuring connection monitors exit cleanly.
  • Data integrity: Synchronized access to slot state, headers, and consensus group membership prevents stale writes and races that could corrupt consensus decisions or block processing.
  • Performance trade-offs: Adds RW locks and snapshotting but reduces contention through mutex splitting and avoids expensive deep copies on hot paths. A documented, brief non-atomic visibility window in SetConsensusGroup is an intentional trade to lower contention.
  • Verification: go test -race runs for modified packages are reported clean. Multiple integration runs and a 44/50 fresh-process pass for the deflaked test are reported; CI/staging verification tasks remain outstanding.

Public API / behavioral changes

  • Primarily internal: new lock/setter helpers on ConsensusState and internal refactors. No KVM changes. A few test helpers changed signatures to accept *testing.T.

Review Change Stack

Three independent root causes — fixing all three brings the test to
44 consecutive PASS in a 50-iteration fresh-process loop at the mainnet
4-second slot budget, up from ~45% failure rate beforehand.

1. delayedBlockBroadcaster.SetHeaderForValidator was the only function in
   the file that appended to valHeaderBroadcastData without acquiring
   mutDataForBroadcast, while interceptedHeader iterated it under lock.
   The unsynchronized append could let interceptedHeader cancel the wrong
   header alarm, validators then missed the leader header, fail to sign
   within the slot, and the cluster stalls below quorum (visible as
   "worker statistics: small consensus quorum sigsNum=3"). Wrap the slice
   mutation in Lock/Unlock matching the pattern of SetValidatorData.

2. SlotManagerMock.SlotIndex was a plain int64 read by baseBootstrap sync
   goroutines via Index() and written by the test goroutine via the
   UpdateSlotCalled closure. Convert to atomic.Int64; update the five
   external call sites (utils.go, transaction/common.go log statements,
   dupTransaction_test.go assertion) to use Load / Store. Test-side
   currentSlot is also captured by closures invoked from the chronology
   goroutine and becomes atomic.Int64.

3. waitForBlockConditionOrTimeOut used to log.Error and return on
   timeout, so the next assertion fired against an out-of-sync cluster
   and produced misleading errors like "expected 0x10 / actual 0xf"
   (the original CI symptom). Thread *testing.T and call t.Fatalf with
   the per-node nodesComplete map at the actual point of divergence.
   The per-step budget is intentionally pinned to the mainnet slot
   duration (4 s) via simulatedSlotDuration / blockWaitTimeoutSeconds —
   the test is also a regression guard for production cadence.
Investigation into the flaky TestConsensus_RevertBlockAndTransactions
uncovered ~400 race-detector warnings beyond the test scaffolding fixed
in the previous commit. The deflake target only stabilises once the
underlying state-sharing hazards in the consensus protocol are closed.

Consensus slot state:
- ConsensusState slot-scoped fields (Data, Header, SlotIndex,
  SlotTimestamp, SlotCanceled, ExtendedCalled, WaitingAllSignaturesTimeOut)
  now sit behind mutSlotState; every reader/writer acquires the lock at
  handler boundary.
- Introduce BeginNewSlot for atomic slot transitions. Splitting flag
  reset from SlotIndex/SlotTimestamp install allowed stale
  waitAllSignatures goroutines from slot N-1 to fire in the gap, pass
  the spawnSlot freshness guard, and write
  WaitingAllSignaturesTimeOut=true onto cleared state — causing the
  next leader to commit with minimum quorum prematurely.
- Extract resetSlotStateLocked so ResetConsensusState (construction)
  and BeginNewSlot (slot transition) share the field list without
  duplication.
- subslotBlock: snapshot ExtendedCalled *after* SetProcessingBlock(true)
  to preserve the original Extend happens-before invariant. Snapshotting
  first reopened a window where Extend could flip the flag, observe
  ProcessingBlock=false, and revert state mid-processing.
- subslotEndSlot: header snapshot pattern in doEndSlotJobByLeader;
  signBlockHeader, createAndBroadcastHeaderFinalInfo, and
  broadcastBlockDataLeader now take the header as a parameter so the
  goroutine fan-out does not re-read sr.Header without the lock.
- subslotSignature: spawnSlot guard around waitAllSignatures so a
  goroutine spawned in slot N-1 does not corrupt slot N flags.
- subslotStartSlot: cancellation paths use SetSlotCanceled; slot
  transition uses BeginNewSlot.
- worker: Extend uses SetExtendedCalled; checkSelfState and
  executeMessage take snapshots under RLock.
- consensusMessageValidator: snapshot SlotIndex under lock before
  comparing against message slot.
- subslot: timer path uses SetSlotCanceled instead of unguarded write.

slotConsensus split-mutex:
- Split rcns.mut into mutConsensusGroup (consensusGroup slice header)
  and mut (validatorSlotStates map). JobDone / SetJobDone is on the
  hot signature-collection path and was contending with every
  ConsensusGroup() reader. SetConsensusGroup intentionally does not
  hold both locks simultaneously — ComputeSize may briefly undercount
  for one poll cycle during a swap, which is benign because the next
  poll observes consistent state.
- ConsensusGroup() captures the slice header under RLock so callers
  see a stable view; elements are read-only.

BlockProcessor:
- metaProcessor: clone header before launching async
  getMetricsFromHeader goroutine; the original code passed the live
  pointer that was mutated by the commit path on the calling
  goroutine.

Sharding:
- nodesCoordinator: convert currentEpoch and stateReady to
  atomic.Uint32 / atomic.Bool; update registry load site.
- hashValidatorShuffler: snapshot NodesToShuffle under RLock instead of
  ranging the live slice.

libp2p:
- netMessenger.connMonitor: check ctx.Done() at the top of each loop
  iteration to prevent the goroutine from leaking after Close.

Integration test:
- consensus_test.go: polling-fairness fix so the waitForBlock condition
  is re-evaluated on a uniform tick instead of biased toward the
  earliest-completing node.

Race detector now clean across the consensus, sharding, and libp2p
suites at -count=1; integration test passes 15/15 fresh-process runs
and 75% mean across -count=30 in-process stress (vs ~68% baseline).
Copilot AI review requested due to automatic review settings May 13, 2026 09:26
@coderabbitai

coderabbitai Bot commented May 13, 2026

Copy link
Copy Markdown

Walkthrough

System-wide concurrency hardening: adds slot-scoped and consensus-group locks, snapshots and passes headers under locks, pins signature timeouts, converts coordinator/tests to atomics, synchronizes delayed broadcast mutations, clones headers for async metrics, and makes the network monitor loop cancellable.

Changes

Consensus Concurrency Hardening and Slot State Synchronization

Layer / File(s) Summary
Consensus state locking infrastructure
core/consensus/slot/consensusState.go
Adds mutSlotState RWMutex, exported Lock/RLock helpers and setters (SetSlotCanceled, SetExtendedCalled, SetWaitingAllSignaturesTimeOutIfSlot), resetSlotStateLocked/BeginNewSlot changes, and read helpers acquiring locks; tests added for lock/setter semantics and GetData contract.
Consensus group concurrent access locking
core/consensus/slot/slotConsensus.go
Adds mutConsensusGroup RWMutex; ConsensusGroup() returns a snapshot view; SetConsensusGroup and ComputeSize/ResetSlotState refactored to use snapshots and document non-atomic cross-lock visibility.
Subslot block processing with slot-state locking
core/consensus/slot/bls/subslotBlock.go
Snapshots SlotTimestamp/Header under read lock, writes Data/Header under write lock, uses snapshotted header for ProcessBlock and consensus checks, and uses SetSlotCanceled(true) setter on cancellation paths.
End-slot handler header snapshotting and parameter passing
core/consensus/slot/bls/subslotEndSlot.go, core/consensus/slot/bls/export_test.go, core/consensus/slot/bls/subslotEndSlot_test.go
Snapshots sr.Header/ExtendedCalled under locks, clones/passes header into signBlockHeader(header), createAndBroadcastHeaderFinalInfo(header), CommitBlock(header), and broadcast helpers; updates related tests to pass header.
Signature job execution and goroutine lifecycle hardening
core/consensus/slot/bls/subslotSignature.go, core/consensus/slot/bls/subslotSignature_test.go
doSignatureJob snapshots Data/Header under read lock; leader captures spawn slot and passes it to waitAllSignatures; timeout goroutine sets timeout via SetWaitingAllSignaturesTimeOutIfSlot, avoiding stale goroutine effects; tests reworked to avoid shared mutable state.
Start-slot initialization with atomic BeginNewSlot
core/consensus/slot/bls/subslotStartSlot.go
Uses BeginNewSlot(index,timestamp) for atomic initialization; reads SlotCanceled under lock; error/timeout paths set cancellation via setter; generateNextConsensusGroup snapshots SlotIndex.
Consensus message validation and worker message processing
core/consensus/slot/consensusMessageValidator.go, core/consensus/slot/worker.go, core/consensus/slot/subslot.go
Wraps SlotIndex reads with RLockSlotState() for validation and mismatch handling; worker snapshots slot state once per operation; Extend uses SetExtendedCalled(true) and snapshots Header before reverting state.
Delayed broadcast slice mutation synchronization
core/consensus/broadcast/delayedBroadcast.go
Moves duration calculation, slice append, trace logging, and alarm ID creation inside mutDataForBroadcast write-locked region to remove an unsynchronized slice mutation window.
ProcessBlock helpers and header cloning
core/process/block/block.go
Introduces validateBlockAndRequestMissing, dispatchAsyncHeaderMetrics, handleEpochStartBlock, and verifyBlockTrieRoots; clones block header before spawning async metrics goroutine to avoid races and requests missing parent headers on hash mismatches.
Integration test atomic slot tracking and helper refactoring
integrationTest/consensus/consensus_test.go, integrationTest/consensus/insertDup_test.go, integrationTest/mock/slotManagerMock.go, integrationTest/transaction/common.go, integrationTest/transaction/dupTransaction/dupTransaction_test.go, integrationTest/utils.go
Tests and mocks moved to atomic.Int64 for slot tracking; introduced simulatedSlotDuration and blockWaitTimeoutSeconds; polling helpers now accept *testing.T and perform deadline final-check with t.Fatalf; SlotManagerMock changed to atomic SlotIndex and utilities updated.
Sharding coordinator and validator atomics
sharding/nodesCoordinator.go, sharding/nodesCoordinator_test.go, sharding/hashValidatorShuffler.go, sharding/indexHashedNodesCoordinatorRegistry.go
Converts coordinator fields to atomics (currentEpoch, stateReady), snapshots epoch config under RLock, uses .Load()/.Store(), and avoids TOCTOU in shuffler by reading params inside same RLock.
Network messenger goroutine lifetime management
network/p2p/libp2p/netMessenger.go
Replaces unconditional Sleep loop with a ticker/select tied to netMes.ctx.Done() so connection-monitor goroutine exits on context cancellation.
ConsensusState unit tests
core/consensus/slot/consensusState_test.go
Adds tests for lock/reentry, setter toggles, conditional timeout setter behavior, GetData contract, and BeginNewSlot reset semantics.
SlotConsensus test minor cleanup
core/consensus/slot/slotConsensus_test.go
Removes an unnecessary direct ConsensusGroup mutation prior to ResetSlotState in a unit test.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 6 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 12.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Concurrency Safety ❓ Inconclusive No result was produced after verification. Marking as INCONCLUSIVE. Re-run the check or adjust instructions to produce a final result.
✅ Passed checks (6 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title follows the required format [KLC-XXXX] type: description with KLC-2382 as the JIRA key and 'fix' as the type. The description clearly summarizes the main change: addressing data races in consensus, sharding, and libp2p layers.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Error Handling ✅ Passed All errors checked and handled correctly. No silently discarded errors, proper wrapping, no bare panic calls. Async operations use appropriate void/bool returns.
State Consistency ✅ Passed Slot transitions atomic via mutSlotState. Block mutations have error-rollback defer. SetSlotCanceled atomic. SetWaitingAllSignaturesTimeOutIfSlot prevents stale. SetConsensusGroup window documented.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch bugfix/KLC-2382-consensus-data-races

Comment @coderabbitai help to get the list of available commands and usage tips.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses widespread race-detector findings (and related flakiness in TestConsensus_RevertBlockAndTransactions) by introducing explicit synchronization across consensus slot state, sharding nodes-coordinator state, and a libp2p connection-monitor goroutine, plus updating integration test scaffolding to avoid unsynchronized shared state.

Changes:

  • Add slot-scoped locking/snapshot patterns in consensus (slot state mutex, atomic slot transition via BeginNewSlot, safer goroutine fan-out using header snapshots, and multiple reader/writer fixes).
  • Make sharding state concurrency-safe (atomics for currentEpoch / stateReady, safer reads of shuffler parameters, safer config display snapshot).
  • Deflake integration tests by removing shared mutable state races (atomic slot index in mocks/tests and improved waiting helpers), and bound a libp2p monitor loop to ctx.Done().

Reviewed changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
sharding/nodesCoordinator.go Convert epoch/ready flags to atomics; avoid races on savedStateKey and config display reads.
sharding/nodesCoordinator_test.go Update tests for atomic currentEpoch.
sharding/indexHashedNodesCoordinatorRegistry.go Read currentEpoch via Load() for registry export.
sharding/hashValidatorShuffler.go Snapshot shuffler params under RLock to avoid concurrent epoch-update races.
network/p2p/libp2p/netMessenger.go Add ctx-bound termination path for connection monitor loop.
core/process/block/block.go Clone header before async metrics collection to avoid shared pointer races.
core/consensus/broadcast/delayedBroadcast.go Serialize slice access in SetHeaderForValidator to prevent append/iterate races.
core/consensus/slot/consensusState.go Introduce mutSlotState, setter helpers, and BeginNewSlot for atomic slot transitions.
core/consensus/slot/slotConsensus.go Split mutexes and snapshot consensus group reads to reduce contention and data races.
core/consensus/slot/worker.go Snapshot slot fields under lock before comparisons/logging; use setter for ExtendedCalled; header snapshot for revert.
core/consensus/slot/consensusMessageValidator.go Snapshot SlotIndex under lock before validating message slot.
core/consensus/slot/subslot.go Use SetSlotCanceled instead of an unguarded field write on timeout path.
core/consensus/slot/bls/subslotStartSlot.go Use BeginNewSlot; lock-protected reads for slot state; cancellation via setter.
core/consensus/slot/bls/subslotBlock.go Lock-protected Data/Header writes; safer ExtendedCalled snapshot ordering; cancellation via setter.
core/consensus/slot/bls/subslotSignature.go Lock-protected snapshots for data/header and flags; add spawn-slot guard for timeout goroutine.
core/consensus/slot/bls/subslotSignature_test.go Avoid sharing a mock container across t.Parallel() subtests.
core/consensus/slot/bls/subslotEndSlot.go Snapshot header pointer for leader fan-out; lock-protected reads; cancellation via setter; pass header explicitly to async calls.
core/consensus/slot/bls/subslotEndSlot_test.go Update tests for CreateAndBroadcastHeaderFinalInfo(header) signature.
core/consensus/slot/bls/export_test.go Export updated helper signature for tests.
integrationTest/mock/slotManagerMock.go Make SlotIndex atomic to avoid test goroutine vs background goroutine races.
integrationTest/utils.go Use SlotIndex.Store() when advancing slots in tests.
integrationTest/consensus/consensus_test.go Make test slot counter atomic; improve wait helper to fail at source and reduce polling bias.
integrationTest/consensus/insertDup_test.go Align with new atomic slot counter + updated wait helper signature/constants.
integrationTest/transaction/common.go Read atomic slot index in logs.
integrationTest/transaction/dupTransaction/dupTransaction_test.go Read atomic slot index in assertions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread sharding/nodesCoordinator.go
Comment thread core/consensus/slot/slotConsensus.go Outdated
Comment thread core/consensus/slot/bls/subslotSignature.go Outdated
Comment thread network/p2p/libp2p/netMessenger.go

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
core/consensus/slot/consensusState.go (1)

16-43: 🧹 Nitpick | 🔵 Trivial | ⚖️ Poor tradeoff

Public fields with mutex protection require caller discipline.

The slot-scoped fields (Data, Header, SlotIndex, etc.) remain public while mutSlotState guards them. This relies on callers using LockSlotState()/RLockSlotState() at handler boundaries. Consider adding a brief doc comment on each public field noting it requires mutSlotState, or provide getter/setter methods for all guarded fields to enforce the contract at compile time.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@core/consensus/slot/consensusState.go` around lines 16 - 43, ConsensusState
exposes slot-scoped public fields (Data, Header, SlotIndex, SlotTimestamp,
SlotCanceled, ExtendedCalled, WaitingAllSignaturesTimeOut) that are protected by
mutSlotState; add a short doc comment on each of those fields stating they must
be accessed under mutSlotState (or alternatively implement and use explicit
getters/setters) so callers know to call LockSlotState/RLockSlotState; update
ConsensusState struct comments to mention the locking discipline and, if you
choose getters/setters, add methods like GetData/SetData, GetHeader/SetHeader,
GetSlotIndex/SetSlotIndex that acquire the appropriate lock and operate on the
underlying fields to enforce safety at compile time.
core/consensus/slot/slotConsensus.go (1)

77-108: 🧹 Nitpick | 🔵 Trivial | 💤 Low value

Acknowledged race window is acceptable but worth unit testing.

The comment correctly documents that a concurrent ComputeSize call can briefly see the new consensusGroup against the old validatorSlotStates, causing JobDone to return ErrInvalidKey. This self-heals on the next poll cycle. Consider adding a unit test that exercises this edge case to ensure the ErrInvalidKey path in ComputeSize logs a debug message and continues gracefully (lines 192-194).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@core/consensus/slot/slotConsensus.go` around lines 77 - 108, Add a unit test
that reproduces the acknowledged race window between SetConsensusGroup and
ComputeSize: spawn concurrent goroutines that repeatedly call
SetConsensusGroup(...) and a client path invoking ComputeSize()/JobDone until
you observe JobDone returning ErrInvalidKey; assert that ComputeSize handles
this by logging the debug message and continuing (no panic/stop) and that
subsequent polls recover and count sizes correctly; target the slotConsensus
methods SetConsensusGroup, ComputeSize, JobDone, and the
validatorSlotStates/consensusGroup interaction, using retries/waits to make the
transient race observable and capturing the logger output to verify the debug
log path is exercised.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@core/consensus/slot/bls/subslotSignature_test.go`:
- Line 417: The test calls sr.IsJobDone(pubKey, bls.SrSignature) but only
asserts the boolean, ignoring the returned error; update the test to capture
both (got, err := sr.IsJobDone(...)) and fail the test if err != nil (e.g.,
require.NoError/if err != nil { t.Fatalf(...) }) before asserting equality
against tc.shouldJobBeDone so validator key lookup errors are surfaced instead
of silently treated as "not done".

In `@sharding/nodesCoordinator_test.go`:
- Around line 593-594: The test ends after setting ihgs.currentEpoch.Store(1)
without calling the function under test; update the test to call
allValidatorsInfo() on the coordinator instance and assert it returns
ErrEpochNodesConfigDoesNotExist (and no validators) — locate the
coordinator/ihgs test setup, invoke coordinator.allValidatorsInfo() (or the
method name used in the diff), capture (resp, err) and assert err ==
ErrEpochNodesConfigDoesNotExist (and resp is empty/nil), or if this test is
obsolete remove the test entirely.

In `@sharding/nodesCoordinator.go`:
- Around line 787-800: The code reads fields from ihgs.nodesConfig[newEpoch]
while only holding ihgs.mutNodesConfig.RLock(); to be consistent with other
readers (e.g., GetAllElectedValidatorsKeys) and to prevent future race windows,
also acquire the inner nodesConfig lock before accessing its maps: call
ihgs.nodesConfig.mutNodesMaps.RLock() after ihgs.mutNodesConfig.RLock(), read
displayCfg.electedList/eligibleList/waitingList/leavingList, then unlock
nodesConfig.mutNodesMaps with RUnlock() and finally
ihgs.mutNodesConfig.RUnlock(); keep the call to displayNodesConfiguration(...)
unchanged.

---

Outside diff comments:
In `@core/consensus/slot/consensusState.go`:
- Around line 16-43: ConsensusState exposes slot-scoped public fields (Data,
Header, SlotIndex, SlotTimestamp, SlotCanceled, ExtendedCalled,
WaitingAllSignaturesTimeOut) that are protected by mutSlotState; add a short doc
comment on each of those fields stating they must be accessed under mutSlotState
(or alternatively implement and use explicit getters/setters) so callers know to
call LockSlotState/RLockSlotState; update ConsensusState struct comments to
mention the locking discipline and, if you choose getters/setters, add methods
like GetData/SetData, GetHeader/SetHeader, GetSlotIndex/SetSlotIndex that
acquire the appropriate lock and operate on the underlying fields to enforce
safety at compile time.

In `@core/consensus/slot/slotConsensus.go`:
- Around line 77-108: Add a unit test that reproduces the acknowledged race
window between SetConsensusGroup and ComputeSize: spawn concurrent goroutines
that repeatedly call SetConsensusGroup(...) and a client path invoking
ComputeSize()/JobDone until you observe JobDone returning ErrInvalidKey; assert
that ComputeSize handles this by logging the debug message and continuing (no
panic/stop) and that subsequent polls recover and count sizes correctly; target
the slotConsensus methods SetConsensusGroup, ComputeSize, JobDone, and the
validatorSlotStates/consensusGroup interaction, using retries/waits to make the
transient race observable and capturing the logger output to verify the debug
log path is exercised.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: d1415408-77f5-4f35-845c-fbb931873267

📥 Commits

Reviewing files that changed from the base of the PR and between 3b06f69 and 15f04c5.

📒 Files selected for processing (25)
  • core/consensus/broadcast/delayedBroadcast.go
  • core/consensus/slot/bls/export_test.go
  • core/consensus/slot/bls/subslotBlock.go
  • core/consensus/slot/bls/subslotEndSlot.go
  • core/consensus/slot/bls/subslotEndSlot_test.go
  • core/consensus/slot/bls/subslotSignature.go
  • core/consensus/slot/bls/subslotSignature_test.go
  • core/consensus/slot/bls/subslotStartSlot.go
  • core/consensus/slot/consensusMessageValidator.go
  • core/consensus/slot/consensusState.go
  • core/consensus/slot/slotConsensus.go
  • core/consensus/slot/subslot.go
  • core/consensus/slot/worker.go
  • core/process/block/block.go
  • integrationTest/consensus/consensus_test.go
  • integrationTest/consensus/insertDup_test.go
  • integrationTest/mock/slotManagerMock.go
  • integrationTest/transaction/common.go
  • integrationTest/transaction/dupTransaction/dupTransaction_test.go
  • integrationTest/utils.go
  • network/p2p/libp2p/netMessenger.go
  • sharding/hashValidatorShuffler.go
  • sharding/indexHashedNodesCoordinatorRegistry.go
  • sharding/nodesCoordinator.go
  • sharding/nodesCoordinator_test.go
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Agent
  • GitHub Check: setup-and-lint / setup-and-lint
  • GitHub Check: Analyze (go)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.go

📄 CodeRabbit inference engine (Custom checks)

**/*.go: Verify that any new or modified concurrent code (goroutines, channels, mutexes, sync primitives) is free of race conditions. Check for: proper lock/unlock pairing, no goroutine leaks, correct channel lifecycle management, and proper context cancellation propagation.
Verify that errors are not silently discarded. Check for: unchecked error returns, error wrapping with context, proper error propagation up the call chain, and no bare panic() calls outside of init() functions.

Files:

  • core/process/block/block.go
  • integrationTest/utils.go
  • core/consensus/slot/subslot.go
  • integrationTest/transaction/dupTransaction/dupTransaction_test.go
  • sharding/indexHashedNodesCoordinatorRegistry.go
  • core/consensus/slot/bls/export_test.go
  • sharding/hashValidatorShuffler.go
  • core/consensus/slot/consensusMessageValidator.go
  • sharding/nodesCoordinator_test.go
  • network/p2p/libp2p/netMessenger.go
  • integrationTest/consensus/insertDup_test.go
  • integrationTest/mock/slotManagerMock.go
  • core/consensus/slot/bls/subslotEndSlot_test.go
  • core/consensus/slot/worker.go
  • sharding/nodesCoordinator.go
  • core/consensus/broadcast/delayedBroadcast.go
  • integrationTest/consensus/consensus_test.go
  • integrationTest/transaction/common.go
  • core/consensus/slot/bls/subslotBlock.go
  • core/consensus/slot/bls/subslotSignature.go
  • core/consensus/slot/bls/subslotSignature_test.go
  • core/consensus/slot/slotConsensus.go
  • core/consensus/slot/bls/subslotStartSlot.go
  • core/consensus/slot/bls/subslotEndSlot.go
  • core/consensus/slot/consensusState.go
core/consensus/**

⚙️ CodeRabbit configuration file

core/consensus/**: This is the consensus engine. Review with extreme care: - Check for race conditions in concurrent block processing - Verify correct mutex/lock ordering to prevent deadlocks - Ensure deterministic behavior (no maps iteration without sorting, no random) - Validate message signing and verification logic - Flag any changes that could cause consensus forks or chain splits

Files:

  • core/consensus/slot/subslot.go
  • core/consensus/slot/bls/export_test.go
  • core/consensus/slot/consensusMessageValidator.go
  • core/consensus/slot/bls/subslotEndSlot_test.go
  • core/consensus/slot/worker.go
  • core/consensus/broadcast/delayedBroadcast.go
  • core/consensus/slot/bls/subslotBlock.go
  • core/consensus/slot/bls/subslotSignature.go
  • core/consensus/slot/bls/subslotSignature_test.go
  • core/consensus/slot/slotConsensus.go
  • core/consensus/slot/bls/subslotStartSlot.go
  • core/consensus/slot/bls/subslotEndSlot.go
  • core/consensus/slot/consensusState.go
**/*_test.go

⚙️ CodeRabbit configuration file

**/*_test.go: Test files. Review for: - Adequate coverage of edge cases and error paths - Proper use of test helpers and assertions - Race condition coverage (tests should use -race flag patterns) - No hardcoded sleep for synchronization (use channels or sync primitives) - Test isolation (no shared mutable state between tests)

Files:

  • integrationTest/transaction/dupTransaction/dupTransaction_test.go
  • core/consensus/slot/bls/export_test.go
  • sharding/nodesCoordinator_test.go
  • integrationTest/consensus/insertDup_test.go
  • core/consensus/slot/bls/subslotEndSlot_test.go
  • integrationTest/consensus/consensus_test.go
  • core/consensus/slot/bls/subslotSignature_test.go
network/**

⚙️ CodeRabbit configuration file

network/**: Peer-to-peer networking layer. - Check for proper input validation on all received messages - Verify rate limiting and DoS protection mechanisms - Ensure connection handling is goroutine-safe - Look for potential message amplification attacks - Verify TLS/authentication on peer connections

Files:

  • network/p2p/libp2p/netMessenger.go
🧠 Learnings (1)
📚 Learning: 2026-04-21T20:12:22.959Z
Learnt from: phcarneirobc
Repo: klever-io/klever-go PR: 38
File: indexer/eventsProcessor.go:188-211
Timestamp: 2026-04-21T20:12:22.959Z
Learning: In Go structs that are JSON-marshaled, if a field is a `bool` and has the `json:"...,omitempty"` tag, then leaving that field at its zero value (`false`) is functionally equivalent (in the resulting JSON) to explicitly setting `Foundation: false`. Reviewers should not flag struct literals that omit such `bool` fields as an inconsistency; they will serialize identically because `omitempty` suppresses `false` values.

Applied to files:

  • core/process/block/block.go
  • integrationTest/utils.go
  • core/consensus/slot/subslot.go
  • integrationTest/transaction/dupTransaction/dupTransaction_test.go
  • sharding/indexHashedNodesCoordinatorRegistry.go
  • core/consensus/slot/bls/export_test.go
  • sharding/hashValidatorShuffler.go
  • core/consensus/slot/consensusMessageValidator.go
  • sharding/nodesCoordinator_test.go
  • network/p2p/libp2p/netMessenger.go
  • integrationTest/consensus/insertDup_test.go
  • integrationTest/mock/slotManagerMock.go
  • core/consensus/slot/bls/subslotEndSlot_test.go
  • core/consensus/slot/worker.go
  • sharding/nodesCoordinator.go
  • core/consensus/broadcast/delayedBroadcast.go
  • integrationTest/consensus/consensus_test.go
  • integrationTest/transaction/common.go
  • core/consensus/slot/bls/subslotBlock.go
  • core/consensus/slot/bls/subslotSignature.go
  • core/consensus/slot/bls/subslotSignature_test.go
  • core/consensus/slot/slotConsensus.go
  • core/consensus/slot/bls/subslotStartSlot.go
  • core/consensus/slot/bls/subslotEndSlot.go
  • core/consensus/slot/consensusState.go
🔇 Additional comments (42)
network/p2p/libp2p/netMessenger.go (1)

437-450: LGTM!

core/process/block/block.go (1)

159-173: LGTM!

core/consensus/broadcast/delayedBroadcast.go (1)

218-233: LGTM!

core/consensus/slot/slotConsensus.go (2)

7-27: LGTM!


48-74: LGTM!

Also applies to: 186-225

core/consensus/slot/consensusState.go (3)

119-137: LGTM!


177-189: LGTM!


222-238: LGTM!

Also applies to: 270-277, 362-368

core/consensus/slot/subslot.go (1)

161-168: LGTM!

core/consensus/slot/consensusMessageValidator.go (1)

109-139: LGTM!

core/consensus/slot/bls/subslotSignature_test.go (1)

378-419: LGTM!

core/consensus/slot/worker.go (4)

390-402: LGTM!


554-568: LGTM!


596-617: LGTM!


650-692: LGTM!

core/consensus/slot/bls/subslotBlock.go (1)

143-145: LGTM!

Also applies to: 194-197, 269-283, 299-327, 369-369, 399-404

core/consensus/slot/bls/subslotStartSlot.go (1)

79-79: LGTM!

Also applies to: 88-93, 116-116, 136-136, 169-169, 175-177, 184-184, 211-216

core/consensus/slot/bls/subslotSignature.go (1)

78-81: LGTM!

Also applies to: 93-107, 128-131, 215-219, 235-235, 246-246, 308-350

core/consensus/slot/bls/subslotEndSlot.go (2)

130-138: LGTM!

Also applies to: 200-217, 246-258, 267-301, 309-336, 342-352, 371-384, 428-458, 484-490, 500-506, 555-568, 572-592, 594-601


614-627: LGTM!

core/consensus/slot/bls/export_test.go (1)

253-255: LGTM!

core/consensus/slot/bls/subslotEndSlot_test.go (1)

763-766: LGTM!

sharding/hashValidatorShuffler.go (1)

75-91: LGTM!

sharding/nodesCoordinator.go (5)

75-78: LGTM!


118-120: LGTM!


211-216: LGTM!


275-276: LGTM!

Also applies to: 288-288


482-484: LGTM!

sharding/indexHashedNodesCoordinatorRegistry.go (1)

31-31: LGTM!

integrationTest/mock/slotManagerMock.go (3)

11-16: LGTM!


46-52: LGTM!


73-80: LGTM!

integrationTest/utils.go (1)

209-213: LGTM!

integrationTest/transaction/dupTransaction/dupTransaction_test.go (1)

152-163: LGTM!

integrationTest/transaction/common.go (2)

221-256: LGTM!


258-270: LGTM!

integrationTest/consensus/consensus_test.go (4)

19-30: LGTM!


59-89: LGTM!


153-176: LGTM!


184-232: LGTM!

integrationTest/consensus/insertDup_test.go (2)

27-49: LGTM!


51-78: LGTM!

Comment thread core/consensus/slot/bls/subslotSignature_test.go
Comment thread sharding/nodesCoordinator_test.go
Comment thread sharding/nodesCoordinator.go Outdated
coderabbitai[bot]
coderabbitai Bot previously approved these changes May 13, 2026
- nodesCoordinator: guard EpochStartPrepare display block against a nil
  nodesConfig[newEpoch] when SetNodes returns an error, and acquire
  mutNodesMaps.RLock() for consistency with other readers.
- slotConsensus: rewrite the SetConsensusGroup doc comment so it matches
  the body — the two writes are intentionally non-atomic to keep the
  JobDone hot path off the ConsensusGroup lock; the brief inconsistency
  self-heals on the next poll.
- consensusState: add SetWaitingAllSignaturesTimeOutIfSlot to perform
  the slot-equality check and the flag write under a single mutSlotState
  write lock.
- subslotSignature: use the new helper in waitAllSignatures so a stale
  goroutine cannot set WaitingAllSignaturesTimeOut after BeginNewSlot
  has cleared state.
- libp2p netMessenger: replace per-iteration time.After with a reusable
  time.Ticker so the connection-monitor sweep loop no longer allocates
  a fresh timer on every iteration.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@core/consensus/slot/slotConsensus.go`:
- Around line 69-78: ConsensusGroup currently returns the internal slice header
(rcns.consensusGroup) under mutConsensusGroup.RLock, which lets callers mutate
the underlying array and corrupt internal state; update the ConsensusGroup
method to return a defensive copy (e.g., using append to copy into a nil slice)
so callers receive a snapshot that cannot modify rcns.consensusGroup, preserving
the lock/snapshot semantics around the slotConsensus.consensusGroup field
guarded by mutConsensusGroup.

In `@sharding/nodesCoordinator.go`:
- Around line 213-216: The slice header savedStateKey is copied while holding
ihgs.mutSavedStateKey.RLock() but released before calling ihgs.saveState, which
allows races on the backing bytes; while still holding the lock, make a deep
copy of the key (e.g. use bytes.Clone or equivalent) into a new variable and
pass that cloned slice to ihgs.saveState so saveState sees an immutable copy;
update the code around ihgs.mutSavedStateKey.RLock()/RUnlock(), savedStateKey
and the ihgs.saveState(savedStateKey) call to clone while locked and then
release the lock before calling saveState.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 76345190-e85c-4298-aec2-739118c99bef

📥 Commits

Reviewing files that changed from the base of the PR and between 15f04c5 and 69f262a.

📒 Files selected for processing (5)
  • core/consensus/slot/bls/subslotSignature.go
  • core/consensus/slot/consensusState.go
  • core/consensus/slot/slotConsensus.go
  • network/p2p/libp2p/netMessenger.go
  • sharding/nodesCoordinator.go
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: test
  • GitHub Check: Analyze (go)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.go

📄 CodeRabbit inference engine (Custom checks)

**/*.go: Verify that any new or modified concurrent code (goroutines, channels, mutexes, sync primitives) is free of race conditions. Check for: proper lock/unlock pairing, no goroutine leaks, correct channel lifecycle management, and proper context cancellation propagation.
Verify that errors are not silently discarded. Check for: unchecked error returns, error wrapping with context, proper error propagation up the call chain, and no bare panic() calls outside of init() functions.

Files:

  • network/p2p/libp2p/netMessenger.go
  • core/consensus/slot/slotConsensus.go
  • core/consensus/slot/consensusState.go
  • sharding/nodesCoordinator.go
  • core/consensus/slot/bls/subslotSignature.go
network/**

⚙️ CodeRabbit configuration file

network/**: Peer-to-peer networking layer. - Check for proper input validation on all received messages - Verify rate limiting and DoS protection mechanisms - Ensure connection handling is goroutine-safe - Look for potential message amplification attacks - Verify TLS/authentication on peer connections

Files:

  • network/p2p/libp2p/netMessenger.go
core/consensus/**

⚙️ CodeRabbit configuration file

core/consensus/**: This is the consensus engine. Review with extreme care: - Check for race conditions in concurrent block processing - Verify correct mutex/lock ordering to prevent deadlocks - Ensure deterministic behavior (no maps iteration without sorting, no random) - Validate message signing and verification logic - Flag any changes that could cause consensus forks or chain splits

Files:

  • core/consensus/slot/slotConsensus.go
  • core/consensus/slot/consensusState.go
  • core/consensus/slot/bls/subslotSignature.go
🧠 Learnings (1)
📚 Learning: 2026-04-21T20:12:22.959Z
Learnt from: phcarneirobc
Repo: klever-io/klever-go PR: 38
File: indexer/eventsProcessor.go:188-211
Timestamp: 2026-04-21T20:12:22.959Z
Learning: In Go structs that are JSON-marshaled, if a field is a `bool` and has the `json:"...,omitempty"` tag, then leaving that field at its zero value (`false`) is functionally equivalent (in the resulting JSON) to explicitly setting `Foundation: false`. Reviewers should not flag struct literals that omit such `bool` fields as an inconsistency; they will serialize identically because `omitempty` suppresses `false` values.

Applied to files:

  • network/p2p/libp2p/netMessenger.go
  • core/consensus/slot/slotConsensus.go
  • core/consensus/slot/consensusState.go
  • sharding/nodesCoordinator.go
  • core/consensus/slot/bls/subslotSignature.go
🔇 Additional comments (6)
network/p2p/libp2p/netMessenger.go (1)

438-450: LGTM!

core/consensus/slot/slotConsensus.go (2)

7-31: LGTM!

Also applies to: 54-62, 90-112, 193-207, 211-229


114-132: ⚡ Quick win

No data race in consensusGroupSize and selfPubKey accessors — review incorrect.

Both fields are initialized once in NewSlotConsensus and never modified during consensus processing; the setters exist only in test code. After initialization, these are read-only in production, making concurrent reads safe regardless of synchronization. The locking documentation correctly omits them because they don't require mutex protection.

If clarity is desired, consider removing the unused SetConsensusGroupSize and SetSelfPubKey methods entirely and document these as set-once fields.

core/consensus/slot/consensusState.go (1)

16-56: LGTM!

Also applies to: 57-90, 110-131, 133-151, 192-203, 235-252, 284-291, 376-382

core/consensus/slot/bls/subslotSignature.go (1)

69-135: LGTM!

Also applies to: 214-275, 308-344

sharding/nodesCoordinator.go (1)

75-78: LGTM!

Also applies to: 118-120, 275-288, 482-482, 787-800

Comment thread core/consensus/slot/slotConsensus.go
Comment thread sharding/nodesCoordinator.go
…ter tests

- Reduces ProcessBlock cognitive complexity from 24 to 14 (Sonar limit 15)
  by extracting four helpers: validateBlockAndRequestMissing,
  dispatchAsyncHeaderMetrics, handleEpochStartBlock, verifyBlockTrieRoots.
  Behavior unchanged; existing tests still pass.
- Adds unit tests for the new slot-state primitives on ConsensusState
  (LockSlotState/RLockSlotState, GetData, SetSlotCanceled, SetExtendedCalled,
  SetWaitingAllSignaturesTimeOut, SetWaitingAllSignaturesTimeOutIfSlot for
  both slot-match and slot-mismatch, BeginNewSlot). Coverage on new code
  rises from 76.9% to 84.1% across the PR.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@core/consensus/slot/consensusState_test.go`:
- Around line 555-557: The test currently never sets Header before calling
BeginNewSlot so the reset path isn't exercised; modify the setup to assign a
non-nil header (e.g., set cns.Header = &Header{...} or call the code path that
populates Header) before calling cns.LockSlotState()/cns.Data =
[]byte("stale")/cns.UnlockSlotState() and then call cns.BeginNewSlot(...) to
verify Header is cleared; apply the same change for the second occurrence around
lines noted (the block at 566-572) so both assertions that expect header to be
nil actually validate a reset from a previously non-nil Header.

In `@core/process/block/block.go`:
- Line 296: Fix the typo in the Bugsnag error message inside the bugsnag.Notify
call in block.go: change the string "process epoch valdiator state: %w" to
"process epoch validator state: %w" (the call that composes the error with
fmt.Errorf and passes err and header metadata); update the message text only so
the variable names (err, header) and the bugsnag.Notify invocation remain
unchanged.
- Around line 275-280: The code calls mp.verifyFees(header) and logs the error
via bugsnag but always returns nil; update the epoch-start handling to mirror
the non-epoch-start path by returning the error instead of swallowing it: after
calling verifyFees(header) check err and both notify bugsnag and return
fmt.Errorf("process verify fees: %w", err) (or simply return err) from the
function; ensure this change is applied around the epoch-start processing block
that invokes verifyFees and keep the bugsnag.Notify call as-is so failures are
both reported and propagated.
- Line 233: The code currently compares the returned error from
checkBlockValidity using a direct equality (if err ==
process.ErrBlockHashDoesNotMatch); change this to use errors.Is(err,
process.ErrBlockHashDoesNotMatch) so wrapped errors are detected, and add the
"errors" import if it's not already present; update the conditional in the
block.go handling around checkBlockValidity to use errors.Is with the sentinel
ErrBlockHashDoesNotMatch.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0a2e5858-6b54-41ea-9183-9850006c6ac2

📥 Commits

Reviewing files that changed from the base of the PR and between 69f262a and dd22602.

📒 Files selected for processing (2)
  • core/consensus/slot/consensusState_test.go
  • core/process/block/block.go
📜 Review details
🧰 Additional context used
📓 Path-based instructions (3)
**/*.go

📄 CodeRabbit inference engine (Custom checks)

**/*.go: Verify that any new or modified concurrent code (goroutines, channels, mutexes, sync primitives) is free of race conditions. Check for: proper lock/unlock pairing, no goroutine leaks, correct channel lifecycle management, and proper context cancellation propagation.
Verify that errors are not silently discarded. Check for: unchecked error returns, error wrapping with context, proper error propagation up the call chain, and no bare panic() calls outside of init() functions.

Files:

  • core/consensus/slot/consensusState_test.go
  • core/process/block/block.go
core/consensus/**

⚙️ CodeRabbit configuration file

core/consensus/**: This is the consensus engine. Review with extreme care: - Check for race conditions in concurrent block processing - Verify correct mutex/lock ordering to prevent deadlocks - Ensure deterministic behavior (no maps iteration without sorting, no random) - Validate message signing and verification logic - Flag any changes that could cause consensus forks or chain splits

Files:

  • core/consensus/slot/consensusState_test.go
**/*_test.go

⚙️ CodeRabbit configuration file

**/*_test.go: Test files. Review for: - Adequate coverage of edge cases and error paths - Proper use of test helpers and assertions - Race condition coverage (tests should use -race flag patterns) - No hardcoded sleep for synchronization (use channels or sync primitives) - Test isolation (no shared mutable state between tests)

Files:

  • core/consensus/slot/consensusState_test.go
🧠 Learnings (1)
📚 Learning: 2026-04-21T20:12:22.959Z
Learnt from: phcarneirobc
Repo: klever-io/klever-go PR: 38
File: indexer/eventsProcessor.go:188-211
Timestamp: 2026-04-21T20:12:22.959Z
Learning: In Go structs that are JSON-marshaled, if a field is a `bool` and has the `json:"...,omitempty"` tag, then leaving that field at its zero value (`false`) is functionally equivalent (in the resulting JSON) to explicitly setting `Foundation: false`. Reviewers should not flag struct literals that omit such `bool` fields as an inconsistency; they will serialize identically because `omitempty` suppresses `false` values.

Applied to files:

  • core/consensus/slot/consensusState_test.go
  • core/process/block/block.go
🔇 Additional comments (3)
core/consensus/slot/consensusState_test.go (1)

6-546: LGTM!

Also applies to: 559-565, 573-576

core/process/block/block.go (2)

243-263: LGTM!


126-129: LGTM!

Also applies to: 148-148, 190-191, 216-222

Comment thread core/consensus/slot/consensusState_test.go
Comment thread core/process/block/block.go Outdated
Comment thread core/process/block/block.go Outdated
Comment thread core/process/block/block.go Outdated
- block.go: handleEpochStartBlock now returns the verifyFees error
  instead of nil (regression introduced by a linter pass after the
  refactor); matches the docstring and the non-epoch-start path.
- block.go: use errors.Is for the ErrBlockHashDoesNotMatch sentinel
  comparison; fix pre-existing "valdiator" typo in bugsnag message.
- nodesCoordinator: bytes.Clone savedStateKey while holding the RLock
  so saveState reads a stable buffer that no future writer can mutate.
- consensusState_test: seed Header to a non-nil value before
  BeginNewSlot so the post-reset nil assertion is meaningful.
- slotConsensus_test: remove stray ConsensusGroup()[1] = "X" mutation
  from TestSlotConsensus_ResetValidationMap; it never affected any
  assertion and modeled an antipattern against the documented
  read-only contract.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
sharding/nodesCoordinator.go (1)

877-883: 🧹 Nitpick | 🔵 Trivial | 💤 Low value

Consider bytes.Clone for consistency with line 214.

GetSavedStateKey returns a slice header without cloning. While this is safe given that savedStateKey is only reassigned (never mutated in place), using bytes.Clone here would provide the same defensive guarantee applied at line 214 in EpochStartAction.

♻️ Suggested consistency improvement
 func (ihgs *indexHashedNodesCoordinator) GetSavedStateKey() []byte {
 	ihgs.mutSavedStateKey.RLock()
-	key := ihgs.savedStateKey
+	key := bytes.Clone(ihgs.savedStateKey)
 	ihgs.mutSavedStateKey.RUnlock()

 	return key
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@sharding/nodesCoordinator.go` around lines 877 - 883, GetSavedStateKey
currently returns ihgs.savedStateKey directly under lock, exposing the same
slice header; change it to return a defensive copy using bytes.Clone of
ihgs.savedStateKey while still holding the read lock. In other words, inside
GetSavedStateKey (for type indexHashedNodesCoordinator) acquire the RLock, set
keyCopy := bytes.Clone(ihgs.savedStateKey), release the RLock, and return
keyCopy — matching the defensive pattern used in EpochStartAction at the earlier
referenced location.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@sharding/nodesCoordinator.go`:
- Around line 877-883: GetSavedStateKey currently returns ihgs.savedStateKey
directly under lock, exposing the same slice header; change it to return a
defensive copy using bytes.Clone of ihgs.savedStateKey while still holding the
read lock. In other words, inside GetSavedStateKey (for type
indexHashedNodesCoordinator) acquire the RLock, set keyCopy :=
bytes.Clone(ihgs.savedStateKey), release the RLock, and return keyCopy —
matching the defensive pattern used in EpochStartAction at the earlier
referenced location.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: b2be6c23-7c20-467e-ae61-19e662dadda7

📥 Commits

Reviewing files that changed from the base of the PR and between dd22602 and e4b6499.

📒 Files selected for processing (4)
  • core/consensus/slot/consensusState_test.go
  • core/consensus/slot/slotConsensus_test.go
  • core/process/block/block.go
  • sharding/nodesCoordinator.go
💤 Files with no reviewable changes (1)
  • core/consensus/slot/slotConsensus_test.go
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: test
  • GitHub Check: Analyze (go)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.go

📄 CodeRabbit inference engine (Custom checks)

**/*.go: Verify that any new or modified concurrent code (goroutines, channels, mutexes, sync primitives) is free of race conditions. Check for: proper lock/unlock pairing, no goroutine leaks, correct channel lifecycle management, and proper context cancellation propagation.
Verify that errors are not silently discarded. Check for: unchecked error returns, error wrapping with context, proper error propagation up the call chain, and no bare panic() calls outside of init() functions.

Files:

  • sharding/nodesCoordinator.go
  • core/process/block/block.go
  • core/consensus/slot/consensusState_test.go
core/consensus/**

⚙️ CodeRabbit configuration file

core/consensus/**: This is the consensus engine. Review with extreme care: - Check for race conditions in concurrent block processing - Verify correct mutex/lock ordering to prevent deadlocks - Ensure deterministic behavior (no maps iteration without sorting, no random) - Validate message signing and verification logic - Flag any changes that could cause consensus forks or chain splits

Files:

  • core/consensus/slot/consensusState_test.go
**/*_test.go

⚙️ CodeRabbit configuration file

**/*_test.go: Test files. Review for: - Adequate coverage of edge cases and error paths - Proper use of test helpers and assertions - Race condition coverage (tests should use -race flag patterns) - No hardcoded sleep for synchronization (use channels or sync primitives) - Test isolation (no shared mutable state between tests)

Files:

  • core/consensus/slot/consensusState_test.go
🧠 Learnings (1)
📚 Learning: 2026-04-21T20:12:22.959Z
Learnt from: phcarneirobc
Repo: klever-io/klever-go PR: 38
File: indexer/eventsProcessor.go:188-211
Timestamp: 2026-04-21T20:12:22.959Z
Learning: In Go structs that are JSON-marshaled, if a field is a `bool` and has the `json:"...,omitempty"` tag, then leaving that field at its zero value (`false`) is functionally equivalent (in the resulting JSON) to explicitly setting `Foundation: false`. Reviewers should not flag struct literals that omit such `bool` fields as an inconsistency; they will serialize identically because `omitempty` suppresses `false` values.

Applied to files:

  • sharding/nodesCoordinator.go
  • core/process/block/block.go
  • core/consensus/slot/consensusState_test.go
🔇 Additional comments (17)
core/consensus/slot/consensusState_test.go (2)

6-6: LGTM!


465-577: LGTM!

core/process/block/block.go (9)

7-7: LGTM!


127-130: LGTM!


149-149: LGTM!


191-191: LGTM!


217-223: LGTM!


225-242: LGTM!


244-264: LGTM!


266-282: LGTM!


284-308: LGTM!

sharding/nodesCoordinator.go (6)

75-78: LGTM!


118-120: LGTM!


211-216: LGTM!


275-288: LGTM!


482-484: LGTM!


787-800: LGTM!

@fbsobreira fbsobreira merged commit ee255e8 into develop May 13, 2026
12 of 14 checks passed
@fbsobreira fbsobreira deleted the bugfix/KLC-2382-consensus-data-races branch May 13, 2026 17:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants