Skip to content

[KLC-2395] fix parallel subtest data race in HonestyScore test#47

Closed
fbsobreira wants to merge 1 commit into
developfrom
bugfix/KLC-2395-honesty-score-parallel-race
Closed

[KLC-2395] fix parallel subtest data race in HonestyScore test#47
fbsobreira wants to merge 1 commit into
developfrom
bugfix/KLC-2395-honesty-score-parallel-race

Conversation

@fbsobreira
Copy link
Copy Markdown
Member

@fbsobreira fbsobreira commented May 12, 2026

Summary

TestSubslotSignature_ReceivedSignature_HonestyScore
(core/consensus/slot/bls/subslotSignature_test.go) failed intermittently
in CI. The parent test calls t.Parallel() and both of its table-driven
subtests also call t.Parallel() while sharing the outer-scope container
and subslotSignature. Each subtest re-registers the container's
PeerHonestyHandler with a closure that asserts against its own table row,
so when the subtests run concurrently one subtest's handler fires inside
the other's sr.ReceivedSignature call — the wrong-row assertions trip and
the eventual t.Fail on a *testing.T whose subtest has already returned
produces panic: Fail in goroutine after [subtest] has completed. That
recovered panic shows up in CI as the unnamed null FAIL.

The fix constructs container and subslotSignature inside each
subtest so the parallel runs no longer share mutable state. The table now
stores a ConsensusGroup() index instead of a pre-resolved pubKey, since
the pubKey is derived from each subtest's own fresh sr.

Verification

Reliably reproduced on develop:

go test -run TestSubslotSignature_ReceivedSignature_HonestyScore \
  ./core/consensus/slot/bls/... -count=500 -timeout 5m              # fails within ~200 iters
go test -run TestSubslotSignature_ReceivedSignature_HonestyScore \
  ./core/consensus/slot/bls/... -count=200 -race -timeout 5m         # WARNING: DATA RACE + panic

Post-fix on this branch:

go test -run TestSubslotSignature_ReceivedSignature_HonestyScore \
  ./core/consensus/slot/bls/... -count=1000 -timeout 5m              # ok, 0.344s
go test -run TestSubslotSignature_ReceivedSignature_HonestyScore \
  ./core/consensus/slot/bls/... -count=500  -race -timeout 5m        # ok, 1.512s

No panics, no race-detector warnings.

Related

  • KLC-2382 (similar flake class in integrationTest/consensus; same null
    CI artifact pattern from a recovered panic during test teardown).

Test plan

  • go build ./...
  • golangci-lint run ./core/consensus/slot/bls/... — 0 issues
  • 1000-iteration stress (plain): pass
  • 500-iteration stress (-race): pass
  • CI pipeline green on this PR

Summary

This PR fixes a critical data race in the TestSubslotSignature_ReceivedSignature_HonestyScore test within the BLS (Boneh-Lynn-Shacham) consensus slot signature verification component. The fix ensures proper test isolation for parallel execution and restores reliable validation of peer honesty score updates.

Blockchain-Critical Components Affected

The test validates the BLS subslot signature consensus component (core/consensus/slot/bls/subslotSignature.go), which is responsible for:

  • Verifying signatures from consensus nodes
  • Updating peer honesty scores via PeerHonestyHandler callbacks (defined in core/consensus/interface.go)
  • Determining when consensus signature collection is complete

These are core to consensus stability and validator reputation tracking, which affects node peer selection and consensus reliability.

Issue & Resolution

The Problem: The test and its subtests both called t.Parallel() while sharing mutable state (a container and subslotSignature instance) across parallel test cases. Each subtest registered a PeerHonestyHandler closure that captured test case-specific assertions. When subtests ran concurrently, a handler registered by one subtest could execute during another subtest's sr.ReceivedSignature() call, triggering assertion failures against wrong test data and causing panics like "Fail in goroutine after [subtest] has completed."

The Fix: Each parallel subtest now creates its own fresh container and subslotSignature instance, eliminating shared mutable state. The test cases now store a pubKeyIndex (index into the consensus group) instead of a precomputed pubKey string; each subtest derives its public key dynamically from its isolated instance.

Impact on Node Stability & Data Integrity

  • Stability: Fixes intermittent CI test failures that masked potential bugs in consensus signature handling and honesty scoring
  • Data Integrity: Ensures the honesty scoring mechanism (peer reputation tracking) is properly validated with clean test isolation
  • Concurrency: Resolves a concurrency anti-pattern in the test harness itself; verified with stress runs (1000 iterations plain, 500 with -race flag) showing no panics or race detector warnings

Testing Validation

  • Reproduced original failures on develop with go test -count=500 and -race -count=200
  • Verified fix passes stress tests with no concurrent execution issues
  • Related to KLC-2382 (similar test flake class)

Review Change Stack

TestSubslotSignature_ReceivedSignature_HonestyScore declared the parent test
parallel and ran its two table-driven subtests in parallel as well, while
they all shared the outer-scope container and subslotSignature. Each subtest
re-registered the container's PeerHonestyHandler with a closure that asserts
against its own table row. When the subtests run concurrently, one subtest's
handler fires inside the other's sr.ReceivedSignature call, the wrong-row
assertions trip, and the eventual t.Fail on a *testing.T whose subtest has
already returned produces "panic: Fail in goroutine after [subtest] has
completed". That recovered panic shows up in CI as the unnamed null FAIL.

Construct container + subslotSignature inside each subtest so the parallel
runs no longer share mutable state. Table now stores a ConsensusGroup index
so each subtest resolves its own pubKey from its own sr.

Reproduced reliably on develop with `go test -count=500` (or `-race -count=200`)
within the first ~200 iterations; with this change the same stress completes
cleanly at 1000 iterations plain and 500 iterations under -race.
Copilot AI review requested due to automatic review settings May 12, 2026 14:52
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes an intermittent CI flake in TestSubslotSignature_ReceivedSignature_HonestyScore caused by parallel subtests sharing mutable test state (a consensus container and SubslotSignature) and overwriting the PeerHonestyHandler closure.

Changes:

  • Move container := mock.InitConsensusCore() and sr := *initSubslotSignatureWithContainer(container) initialization into each parallel subtest to avoid shared mutable state.
  • Replace table-driven test’s stored pubKey with pubKeyIndex (resolved from each subtest’s own freshly created sr).
  • Update assertions/message construction to use the per-subtest pubKey.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 12, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 60537b28-7e23-4d57-81a0-f91b3af843e9

📥 Commits

Reviewing files that changed from the base of the PR and between 3b06f69 and 6278e7a.

📒 Files selected for processing (1)
  • core/consensus/slot/bls/subslotSignature_test.go
📜 Recent review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: test
🧰 Additional context used
📓 Path-based instructions (3)
**/*.go

📄 CodeRabbit inference engine (Custom checks)

**/*.go: Verify that any new or modified concurrent code (goroutines, channels, mutexes, sync primitives) is free of race conditions. Check for: proper lock/unlock pairing, no goroutine leaks, correct channel lifecycle management, and proper context cancellation propagation.
Verify that errors are not silently discarded. Check for: unchecked error returns, error wrapping with context, proper error propagation up the call chain, and no bare panic() calls outside of init() functions.

Files:

  • core/consensus/slot/bls/subslotSignature_test.go
core/consensus/**

⚙️ CodeRabbit configuration file

core/consensus/**: This is the consensus engine. Review with extreme care: - Check for race conditions in concurrent block processing - Verify correct mutex/lock ordering to prevent deadlocks - Ensure deterministic behavior (no maps iteration without sorting, no random) - Validate message signing and verification logic - Flag any changes that could cause consensus forks or chain splits

Files:

  • core/consensus/slot/bls/subslotSignature_test.go
**/*_test.go

⚙️ CodeRabbit configuration file

**/*_test.go: Test files. Review for: - Adequate coverage of edge cases and error paths - Proper use of test helpers and assertions - Race condition coverage (tests should use -race flag patterns) - No hardcoded sleep for synchronization (use channels or sync primitives) - Test isolation (no shared mutable state between tests)

Files:

  • core/consensus/slot/bls/subslotSignature_test.go
🧠 Learnings (1)
📚 Learning: 2026-04-21T20:12:22.959Z
Learnt from: phcarneirobc
Repo: klever-io/klever-go PR: 38
File: indexer/eventsProcessor.go:188-211
Timestamp: 2026-04-21T20:12:22.959Z
Learning: In Go structs that are JSON-marshaled, if a field is a `bool` and has the `json:"...,omitempty"` tag, then leaving that field at its zero value (`false`) is functionally equivalent (in the resulting JSON) to explicitly setting `Foundation: false`. Reviewers should not flag struct literals that omit such `bool` fields as an inconsistency; they will serialize identically because `omitempty` suppresses `false` values.

Applied to files:

  • core/consensus/slot/bls/subslotSignature_test.go
🔇 Additional comments (2)
core/consensus/slot/bls/subslotSignature_test.go (2)

382-423: Good fix: per-subtest container and subslotSignature removes the shared-state race.

This directly addresses the flaky cross-subtest handler mutation and restores isolation under t.Parallel().


378-380: ⚡ Quick win

No changes needed. The code safely uses parallel subtests with range-variable capture under Go 1.25.7, which has built-in per-iteration rebinding since Go 1.22. The tc := tc pattern is unnecessary for this toolchain version.

			> Likely an incorrect or invalid review comment.

Walkthrough

This PR eliminates a data race in TestSubslotSignature_ReceivedSignature_HonestyScore by refactoring the test to compute validator public keys dynamically per parallel subtest rather than storing precomputed shared strings. The consensus container, subslot signature, and peer honesty handler are now freshly instantiated for each subtest execution.

Changes

Honesty Score Test Data Race Elimination

Layer / File(s) Summary
Test case structure and per-subtest computation
core/consensus/slot/bls/subslotSignature_test.go
Test case data now uses pubKeyIndex (index into ConsensusGroup()) instead of a precomputed pubKey string. The subtest runner creates a fresh consensus container and subslot signature per parallel execution with comments documenting the prior race condition. The SetPeerHonestyHandler closure, ConsensusMessage pubkey argument, and IsJobDone assertion all use the dynamically computed per-subtest pubKey instead of referencing shared test case data.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 7 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (7 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title follows the required format [KLC-XXXX] type: description with KLC-2395 as the ticket, 'fix' as the type, and a clear description of the parallel subtest data race fix.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Concurrency Safety ✅ Passed Data race fixed via isolation: container/sr moved into each subtest scope, handler closures capture local pubKey, no shared mutable state. No goroutines, proper channel scoping, zero resource leaks.
Error Handling ✅ Passed No new error handling issues introduced. Pre-existing error discarding in test helpers. No panic() calls. Data race fix does not add error issues.
State Consistency ✅ Passed This PR modifies only a test file to fix a data race. No production code affecting blockchain state (accounts, balances, storage) was changed. The state consistency check is not applicable.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch bugfix/KLC-2395-honesty-score-parallel-race

Comment @coderabbitai help to get the list of available commands and usage tips.

@fbsobreira
Copy link
Copy Markdown
Member Author

n/a - to be handle in KLC-2382

@fbsobreira fbsobreira closed this May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants