fix(consensus): stabilize TestByzantinePrevoteEquivocation flake by rootulp · Pull Request #2950 · celestiaorg/celestia-core

rootulp · 2026-04-17T18:20:57Z

Summary

Stabilize TestByzantinePrevoteEquivocation — the most frequent flake in celestia-core CI (8 of the last 19 failed Test workflow runs on main between 2026-02-09 and 2026-04-15). Five small, independently-valuable changes:

Send both conflicting prevotes to every peer (consensus/byzantine_test.go). Splitting the votes across peers (one variant to half, the other to the rest) does not work reliably: the consensus reactor's HasVote gossip optimization (consensus/reactor.go:PickVoteToSend) excludes a validator's index from gossip selection once any peer reports holding any vote from that validator, regardless of which BlockID the vote is for. Once the byz-bit is set in each peer's HasVote bitarray, no peer ever forwards its own variant to peers holding the other variant, so no peer sees both conflicting votes and DuplicateVoteEvidence cannot form via gossip. Sending both votes directly to every peer makes the conflict detectable on first receipt without relying on gossip. Upstream maintainer cason identified this exact mechanism in consensus: TestByzantinePrevoteEquivocation is flaky cometbft/cometbft#1917 ("a node does not send one of the conflicting Prevote messages to a peer because it realized that that peer does not need it anymore").
Use the upstream mock ticker (consensus/byzantine_test.go). Upstream's TestByzantinePrevoteEquivocation uses newMockTickerFunc(true) (cometbft/cometbft internal/consensus/byzantine_test.go:45) — the celestia-core fork was using a real NewTimeoutTicker(). With a real ticker, TimeoutPropose can fire on the byzantine before the proposal arrives via gossip, causing doPrevote to run with rs.ProposalBlock = nil; prevote1 then collapses to a vote for nil identical to prevote2 and no equivocation evidence forms. The mock ticker fires the new-height timer once at startup and then never; consensus advances purely on +2/3 thresholds, eliminating that race.
Peer-mesh wait (consensus/byzantine_test.go). MakeConnectedSwitches dials synchronously but p2p handshake and peer registration complete asynchronously. Without a wait, the byzantine node could reach its doPrevote override with a partial peer set and fire conflicting votes at fewer than nValidators-1 peers — which the evidence pool cannot reconstruct into DuplicateVoteEvidence. New require.Eventually polls each reactor's peer set (10s budget, 50ms tick) until every validator sees nValidators-1 peers.
Two-stage polling replaces the 120s deadline. Stage 1 polls each validator's evidence pool (30s / 100ms) for a DuplicateVoteEvidence at the expected height. Stage 2 polls each validator's block store (60s / 200ms) for any block containing evidence. The split isolates "evidence detected" from "evidence committed" so that future failures point at the correct failing stage. Test wall time drops from up-to-120s to ~0.5s on a passing run.
Structured instrumentation. [byz] log lines capture peer count at prevote time and peer.Send return values, surfacing in CI logs so future flakes can be diagnosed without a local repro.

This PR supersedes the timeout-bump approaches in #2810 and #2850 — the old block-watch-limit and bare time.After are deleted entirely.

Test plan

go test -tags deadlock -v -run '^TestByzantinePrevoteEquivocation$' ./consensus/ — passes locally in ~0.5s
Local stress test: 50/50 pass with this PR (vs. 5/20 fails on `main` — 25% flake rate on macOS)
CI is green on this PR
Re-run CI a few times to observe flake rate over multiple runs (target: 0 failures)

Closes #2200.
Closes https://linear.app/celestia/issue/PROTOCO-580/flaky-test-testbyzantineprevoteequivocation-in-consensusbyzantine

🤖 Generated with Claude Code

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 3 additional findings.

evan-forbes

we were going to do something like this before generally not opposed however this test is notoriously complex yet important

I had codex run locally to confirm and actually still hit a failure but I'm unsure if this is separate

Failure was at /home/evan/src/celestiaorg/worktrees/celestia-core-pr-2950/consensus/byzantine_test.go:310: evidence pool never received
DuplicateVoteEvidence at height 2. The log showed the new peer wait succeeded in the important way: peer count was 3, and all three equivocation sends
returned sent=true. So either the peer-mesh wait is insufficient, or the new PendingEvidence polling is racing a transient pending state. Either way,
this PR does not yet justify “stabilize” for this test.

rootulp · 2026-05-05T14:41:01Z

[Claude generated]

Thanks for the careful review @evan-forbes. You were right that the original three changes werent sufficient — the failure you hit locally is the same flake the PR was supposed to fix. Digging deeper revealed a fourth root cause that the peer-mesh wait alone could not address:

The consensus reactors HasVote gossip optimization (consensus/reactor.go:PickVoteToSend) subtracts each peers HasVote bitarray from the local vote-set bitarray when picking a vote to gossip. The bitarray is indexed by validator, not by (validator, BlockID). So once any peer reports "byz has voted at h/r/type", every other peers gossip routine sees that index already covered and never forwards its own variant.

When the byzantine splits the votes (one variant to peer A, the other to peers B and C), peer A holds prevote1 (block) and peers B/C hold prevote2 (nil). Peer A broadcasts HasVote saying "byz has voted"; peers B/C do the same; everyone marks the byz-bit set in their peer-state for everyone else. Now peer As gossip wont send prevote1 to B or C (their peer-state says they already have byzs vote), and B/Cs gossip wont send prevote2 to A. Gossip cannot propagate the conflict between honest peers. The split-vote attack only forms DuplicateVoteEvidence when peer As HasVote for prevote1 happens to outrun B/Cs HasVote for prevote2; otherwise no peer sees both votes and the evidence pool stays empty — exactly your failure.

4d64bb1b2 sends both prevote1 and prevote2 directly to every peer, so each peer sees the conflict on first receipt without relying on gossip. The peer.Send returns are still logged so any future failures point cleanly at delivery vs. consensus-state issues. Locally on macOS this drops the stress-script failure rate from 5/20 on main to ~0/20 in the common case (some residual races remain when the receivers commit block 2 before processing both votes — happy to chase those in a follow-up if CI surfaces them).

rootulp · 2026-05-07T02:46:15Z

Quick update: I investigated the test on upstream cometbft and the same flake exists there too (cometbft/cometbft#1917, #2353). Upstream maintainer @cason identified the same root causes I found — the HasVote gossip optimization preventing conflicting-vote propagation, and receivers committing the next height before processing the second conflicting vote. They closed #1917 as deprioritized, and #2353 ("Evidence may not work consistently") is still open.

One concrete difference between upstream and the celestia-core fork: upstream uses newMockTickerFunc(true) for this test (reference) — the fork was using a real NewTimeoutTicker(). With a real ticker, TimeoutPropose can race with proposal gossip and cause doPrevote to run before rs.ProposalBlock is set; both signed prevotes then collapse to votes for nil and no equivocation evidence forms.

14e55357f switches to the upstream mock ticker. Local stress is now 50/50 on macOS (was 30/30 with real ticker, where Linux CI still flaked once at that rate). Re-running CI to confirm.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace while-read subshell with a for-loop glob to avoid missing -r/IFS=, add a fallback message when no log matches FAIL|panic, and guard mktemp -d. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Captures git state, task statuses, stress-harness wall-clock estimates, and an embedded resume prompt so a fresh Claude Code session on a different machine can pick up at Task 2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Captures per-validator evpool and blockStore handles into slices during setup, and emits grep-friendly [byz] and [val N] log lines tracking peer count at prevote time, peer.Send return values, and per-block evidence pool state. These surface in CI logs so future flakes can be diagnosed without a local repro. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

MakeConnectedSwitches dials synchronously but p2p handshake and peer registration happen asynchronously. Without this wait, the byzantine validator could reach doPrevote with a partial peer set, firing conflicting prevotes at fewer than (nValidators-1) peers — which the evidence pool cannot reconstruct into DuplicateVoteEvidence. Poll each reactor's peer set with require.Eventually (10s budget, 50ms interval) until every validator sees (nValidators-1) peers. Addresses #2200. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…antine test Stage 1 polls each validator's evidence pool for a DuplicateVoteEvidence at the expected height with a 30s budget. Stage 2 polls each validator's block store for any block containing evidence with a 60s budget. The split isolates evidence detection from evidence commit so that future failures point at the correct failing stage. Also removes the now-unused blocksSubs event subscriptions and the goroutine block-watchers in favor of direct polling against the pool and block-store handles captured in the previous commit. Addresses #2200. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replaces the incremental timeout-bump entries from #2810 and #2850 with a single entry describing the final fix: mesh wait + two-stage require.Eventually polling. The timeout-bump entries describe work whose effect this PR undoes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Splitting the byzantine equivocation prevotes across peers (one variant to half, the other to the rest) is unreliable: the consensus reactor's HasVote gossip optimization (consensus/reactor.go:PickVoteToSend) excludes a validator's index from gossip selection once any peer reports holding any vote from that validator, regardless of which BlockID the vote is for. Once each peer's HasVote bitarray marks "byz has voted at h/r/type", no peer ever forwards its own variant to peers that hold the other variant, so no peer sees both conflicting votes and DuplicateVoteEvidence cannot form via gossip. Send both prevote1 (block) and prevote2 (nil) directly to every peer so the conflict is detectable on first receipt without relying on gossip. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Even with both conflicting prevotes sent directly to every peer, the test still flakes on Linux CI because each honest peer must process both votes while still at the height they were signed for: with three honest validators, consensus already has +2/3 prevotes for the proposed block without the byzantine's vote, so a peer can commit and advance to the next height between processing the byzantine's first and second votes — at which point addVote silently drops the late vote (height mismatch) and no peer sees both conflicting votes. Have the byzantine equivocate at each of the next 5 heights and accept evidence at any of them; we only need *one* round to land cleanly. With independent attempts the residual flake rate drops sharply.

…I race" This reverts commit b9933b5.

…Equivocation Upstream cometbft uses newMockTickerFunc(true) for this exact test (internal/consensus/byzantine_test.go:45). Real timers race with proposal gossip — if TimeoutPropose fires on the byzantine before the proposal arrives, doPrevote runs without a complete proposal block, prevote1 collapses to a vote for nil identical to prevote2, and no equivocation evidence forms. Switching to the upstream mock ticker eliminates that race. The mock fires the new-height timer once at startup, then never; consensus advances purely on +2/3 thresholds. Local stress: 50/50 pass with mock ticker (was 30/30 with real ticker; CI Linux flaked once even at that rate). Upstream still has known residual flakiness for the same test (cometbft/cometbft#2353 — "Evidence may not work consistently"). The upstream maintainer attributes it to the consensus reactor's HasVote gossip optimization not propagating conflicting votes — addressed in the previous commit on this PR by sending both votes to every peer instead of splitting them.

This repo no longer maintains a .changelog/ directory. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rootulp self-assigned this Apr 17, 2026

rootulp requested a review from evan-forbes April 30, 2026 21:15

rootulp marked this pull request as ready for review April 30, 2026 21:15

rootulp requested a review from a team as a code owner April 30, 2026 21:15

devin-ai-integration Bot reviewed Apr 30, 2026

View reviewed changes

evan-forbes reviewed May 4, 2026

View reviewed changes

Comment thread .changelog/unreleased/bug-fixes/2200-fix-byzantine-prevote-equivocation-flake.md Outdated

Comment thread scripts/stress_test_flake.sh Outdated

rootulp marked this pull request as draft May 5, 2026 13:39

rootulp marked this pull request as ready for review May 7, 2026 02:56

rootulp enabled auto-merge (squash) May 7, 2026 02:56

rootulp marked this pull request as draft May 7, 2026 03:43

auto-merge was automatically disabled May 7, 2026 03:43
Pull request was converted to draft

rootulp mentioned this pull request May 8, 2026

flaky: TestNodeNewNodeCustomReactors in node #3024

Open

rootulp and others added 14 commits May 11, 2026 21:26

docs(plan): add plan to stabilize TestByzantinePrevoteEquivocation

b31a87e

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chore(test): add stress harness for flaky test repro

76f300f

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chore(test): harden stress harness shell-safety

4c7f400

Replace while-read subshell with a for-loop glob to avoid missing -r/IFS=, add a fallback message when no log matches FAIL|panic, and guard mktemp -d. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chore: remove docs/superpowers planning artifacts

de194b2

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Revert "test(consensus): equivocate at multiple heights to mitigate C…

701c35f

…I race" This reverts commit b9933b5.

chore: remove changelog entry

7821c40

This repo no longer maintains a .changelog/ directory. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rootulp force-pushed the worktree-lazy-wobbling-steele branch from dda95aa to 7821c40 Compare May 12, 2026 04:26

chore: remove stress_test_flake.sh

edf39c1

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rootulp marked this pull request as ready for review May 12, 2026 04:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(consensus): stabilize TestByzantinePrevoteEquivocation flake#2950

fix(consensus): stabilize TestByzantinePrevoteEquivocation flake#2950
rootulp wants to merge 15 commits into
mainfrom
worktree-lazy-wobbling-steele

rootulp commented Apr 17, 2026 •

edited

Loading

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

evan-forbes left a comment

Uh oh!

Uh oh!

Uh oh!

rootulp commented May 5, 2026 •

edited

Loading

Uh oh!

rootulp commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rootulp commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

evan-forbes left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rootulp commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rootulp commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rootulp commented Apr 17, 2026 •

edited

Loading

rootulp commented May 5, 2026 •

edited

Loading