Skip to content

fix(consensus): stabilize TestByzantinePrevoteEquivocation flake#2950

Open
rootulp wants to merge 15 commits into
mainfrom
worktree-lazy-wobbling-steele
Open

fix(consensus): stabilize TestByzantinePrevoteEquivocation flake#2950
rootulp wants to merge 15 commits into
mainfrom
worktree-lazy-wobbling-steele

Conversation

@rootulp
Copy link
Copy Markdown
Collaborator

@rootulp rootulp commented Apr 17, 2026

Summary

Stabilize TestByzantinePrevoteEquivocation — the most frequent flake in celestia-core CI (8 of the last 19 failed Test workflow runs on main between 2026-02-09 and 2026-04-15). Five small, independently-valuable changes:

  1. Send both conflicting prevotes to every peer (consensus/byzantine_test.go). Splitting the votes across peers (one variant to half, the other to the rest) does not work reliably: the consensus reactor's HasVote gossip optimization (consensus/reactor.go:PickVoteToSend) excludes a validator's index from gossip selection once any peer reports holding any vote from that validator, regardless of which BlockID the vote is for. Once the byz-bit is set in each peer's HasVote bitarray, no peer ever forwards its own variant to peers holding the other variant, so no peer sees both conflicting votes and DuplicateVoteEvidence cannot form via gossip. Sending both votes directly to every peer makes the conflict detectable on first receipt without relying on gossip. Upstream maintainer cason identified this exact mechanism in consensus: TestByzantinePrevoteEquivocation is flaky cometbft/cometbft#1917 ("a node does not send one of the conflicting Prevote messages to a peer because it realized that that peer does not need it anymore").

  2. Use the upstream mock ticker (consensus/byzantine_test.go). Upstream's TestByzantinePrevoteEquivocation uses newMockTickerFunc(true) (cometbft/cometbft internal/consensus/byzantine_test.go:45) — the celestia-core fork was using a real NewTimeoutTicker(). With a real ticker, TimeoutPropose can fire on the byzantine before the proposal arrives via gossip, causing doPrevote to run with rs.ProposalBlock = nil; prevote1 then collapses to a vote for nil identical to prevote2 and no equivocation evidence forms. The mock ticker fires the new-height timer once at startup and then never; consensus advances purely on +2/3 thresholds, eliminating that race.

  3. Peer-mesh wait (consensus/byzantine_test.go). MakeConnectedSwitches dials synchronously but p2p handshake and peer registration complete asynchronously. Without a wait, the byzantine node could reach its doPrevote override with a partial peer set and fire conflicting votes at fewer than nValidators-1 peers — which the evidence pool cannot reconstruct into DuplicateVoteEvidence. New require.Eventually polls each reactor's peer set (10s budget, 50ms tick) until every validator sees nValidators-1 peers.

  4. Two-stage polling replaces the 120s deadline. Stage 1 polls each validator's evidence pool (30s / 100ms) for a DuplicateVoteEvidence at the expected height. Stage 2 polls each validator's block store (60s / 200ms) for any block containing evidence. The split isolates "evidence detected" from "evidence committed" so that future failures point at the correct failing stage. Test wall time drops from up-to-120s to ~0.5s on a passing run.

  5. Structured instrumentation. [byz] log lines capture peer count at prevote time and peer.Send return values, surfacing in CI logs so future flakes can be diagnosed without a local repro.

This PR supersedes the timeout-bump approaches in #2810 and #2850 — the old block-watch-limit and bare time.After are deleted entirely.

Test plan

  • go test -tags deadlock -v -run '^TestByzantinePrevoteEquivocation$' ./consensus/ — passes locally in ~0.5s
  • Local stress test: 50/50 pass with this PR (vs. 5/20 fails on `main` — 25% flake rate on macOS)
  • CI is green on this PR
  • Re-run CI a few times to observe flake rate over multiple runs (target: 0 failures)

Closes #2200.
Closes https://linear.app/celestia/issue/PROTOCO-580/flaky-test-testbyzantineprevoteequivocation-in-consensusbyzantine

🤖 Generated with Claude Code

@rootulp rootulp self-assigned this Apr 17, 2026
@rootulp rootulp requested a review from evan-forbes April 30, 2026 21:15
@rootulp rootulp marked this pull request as ready for review April 30, 2026 21:15
@rootulp rootulp requested a review from a team as a code owner April 30, 2026 21:15
Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 3 additional findings.

Open in Devin Review

Copy link
Copy Markdown
Member

@evan-forbes evan-forbes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we were going to do something like this before generally not opposed however this test is notoriously complex yet important

I had codex run locally to confirm and actually still hit a failure but I'm unsure if this is separate

Failure was at /home/evan/src/celestiaorg/worktrees/celestia-core-pr-2950/consensus/byzantine_test.go:310: evidence pool never received
DuplicateVoteEvidence at height 2. The log showed the new peer wait succeeded in the important way: peer count was 3, and all three equivocation sends
returned sent=true. So either the peer-mesh wait is insufficient, or the new PendingEvidence polling is racing a transient pending state. Either way,
this PR does not yet justify “stabilize” for this test.

Comment thread .changelog/unreleased/bug-fixes/2200-fix-byzantine-prevote-equivocation-flake.md Outdated
Comment thread scripts/stress_test_flake.sh Outdated
@rootulp rootulp marked this pull request as draft May 5, 2026 13:39
@rootulp
Copy link
Copy Markdown
Collaborator Author

rootulp commented May 5, 2026

[Claude generated]

Thanks for the careful review @evan-forbes. You were right that the original three changes werent sufficient — the failure you hit locally is the same flake the PR was supposed to fix. Digging deeper revealed a fourth root cause that the peer-mesh wait alone could not address:

The consensus reactors HasVote gossip optimization (consensus/reactor.go:PickVoteToSend) subtracts each peers HasVote bitarray from the local vote-set bitarray when picking a vote to gossip. The bitarray is indexed by validator, not by (validator, BlockID). So once any peer reports "byz has voted at h/r/type", every other peers gossip routine sees that index already covered and never forwards its own variant.

When the byzantine splits the votes (one variant to peer A, the other to peers B and C), peer A holds prevote1 (block) and peers B/C hold prevote2 (nil). Peer A broadcasts HasVote saying "byz has voted"; peers B/C do the same; everyone marks the byz-bit set in their peer-state for everyone else. Now peer As gossip wont send prevote1 to B or C (their peer-state says they already have byzs vote), and B/Cs gossip wont send prevote2 to A. Gossip cannot propagate the conflict between honest peers. The split-vote attack only forms DuplicateVoteEvidence when peer As HasVote for prevote1 happens to outrun B/Cs HasVote for prevote2; otherwise no peer sees both votes and the evidence pool stays empty — exactly your failure.

4d64bb1b2 sends both prevote1 and prevote2 directly to every peer, so each peer sees the conflict on first receipt without relying on gossip. The peer.Send returns are still logged so any future failures point cleanly at delivery vs. consensus-state issues. Locally on macOS this drops the stress-script failure rate from 5/20 on main to ~0/20 in the common case (some residual races remain when the receivers commit block 2 before processing both votes — happy to chase those in a follow-up if CI surfaces them).

@rootulp
Copy link
Copy Markdown
Collaborator Author

rootulp commented May 7, 2026

Quick update: I investigated the test on upstream cometbft and the same flake exists there too (cometbft/cometbft#1917, #2353). Upstream maintainer @cason identified the same root causes I found — the HasVote gossip optimization preventing conflicting-vote propagation, and receivers committing the next height before processing the second conflicting vote. They closed #1917 as deprioritized, and #2353 ("Evidence may not work consistently") is still open.

One concrete difference between upstream and the celestia-core fork: upstream uses newMockTickerFunc(true) for this test (reference) — the fork was using a real NewTimeoutTicker(). With a real ticker, TimeoutPropose can race with proposal gossip and cause doPrevote to run before rs.ProposalBlock is set; both signed prevotes then collapse to votes for nil and no equivocation evidence forms.

14e55357f switches to the upstream mock ticker. Local stress is now 50/50 on macOS (was 30/30 with real ticker, where Linux CI still flaked once at that rate). Re-running CI to confirm.

@rootulp rootulp marked this pull request as ready for review May 7, 2026 02:56
@rootulp rootulp enabled auto-merge (squash) May 7, 2026 02:56
@rootulp rootulp marked this pull request as draft May 7, 2026 03:43
auto-merge was automatically disabled May 7, 2026 03:43

Pull request was converted to draft

rootulp and others added 14 commits May 11, 2026 21:26
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace while-read subshell with a for-loop glob to avoid missing -r/IFS=,
add a fallback message when no log matches FAIL|panic, and guard mktemp -d.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captures git state, task statuses, stress-harness wall-clock estimates,
and an embedded resume prompt so a fresh Claude Code session on a
different machine can pick up at Task 2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captures per-validator evpool and blockStore handles into slices during
setup, and emits grep-friendly [byz] and [val N] log lines tracking peer
count at prevote time, peer.Send return values, and per-block evidence
pool state. These surface in CI logs so future flakes can be diagnosed
without a local repro.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MakeConnectedSwitches dials synchronously but p2p handshake and peer
registration happen asynchronously. Without this wait, the byzantine
validator could reach doPrevote with a partial peer set, firing
conflicting prevotes at fewer than (nValidators-1) peers — which the
evidence pool cannot reconstruct into DuplicateVoteEvidence. Poll each
reactor's peer set with require.Eventually (10s budget, 50ms interval)
until every validator sees (nValidators-1) peers.

Addresses #2200.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…antine test

Stage 1 polls each validator's evidence pool for a DuplicateVoteEvidence
at the expected height with a 30s budget. Stage 2 polls each validator's
block store for any block containing evidence with a 60s budget. The
split isolates evidence detection from evidence commit so that future
failures point at the correct failing stage.

Also removes the now-unused blocksSubs event subscriptions and the
goroutine block-watchers in favor of direct polling against the pool
and block-store handles captured in the previous commit.

Addresses #2200.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the incremental timeout-bump entries from #2810 and #2850
with a single entry describing the final fix: mesh wait + two-stage
require.Eventually polling. The timeout-bump entries describe work
whose effect this PR undoes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Splitting the byzantine equivocation prevotes across peers (one variant
to half, the other to the rest) is unreliable: the consensus reactor's
HasVote gossip optimization (consensus/reactor.go:PickVoteToSend)
excludes a validator's index from gossip selection once any peer
reports holding any vote from that validator, regardless of which
BlockID the vote is for. Once each peer's HasVote bitarray marks "byz
has voted at h/r/type", no peer ever forwards its own variant to peers
that hold the other variant, so no peer sees both conflicting votes
and DuplicateVoteEvidence cannot form via gossip.

Send both prevote1 (block) and prevote2 (nil) directly to every peer
so the conflict is detectable on first receipt without relying on
gossip.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Even with both conflicting prevotes sent directly to every peer, the
test still flakes on Linux CI because each honest peer must process
both votes while still at the height they were signed for: with three
honest validators, consensus already has +2/3 prevotes for the
proposed block without the byzantine's vote, so a peer can commit and
advance to the next height between processing the byzantine's first
and second votes — at which point addVote silently drops the late
vote (height mismatch) and no peer sees both conflicting votes.

Have the byzantine equivocate at each of the next 5 heights and accept
evidence at any of them; we only need *one* round to land cleanly.
With independent attempts the residual flake rate drops sharply.
…Equivocation

Upstream cometbft uses newMockTickerFunc(true) for this exact test
(internal/consensus/byzantine_test.go:45). Real timers race with
proposal gossip — if TimeoutPropose fires on the byzantine before the
proposal arrives, doPrevote runs without a complete proposal block,
prevote1 collapses to a vote for nil identical to prevote2, and no
equivocation evidence forms.

Switching to the upstream mock ticker eliminates that race. The mock
fires the new-height timer once at startup, then never; consensus
advances purely on +2/3 thresholds. Local stress: 50/50 pass with
mock ticker (was 30/30 with real ticker; CI Linux flaked once even at
that rate).

Upstream still has known residual flakiness for the same test
(cometbft/cometbft#2353 — "Evidence may not work consistently"). The
upstream maintainer attributes it to the consensus reactor's HasVote
gossip optimization not propagating conflicting votes — addressed in
the previous commit on this PR by sending both votes to every peer
instead of splitting them.
This repo no longer maintains a .changelog/ directory.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rootulp rootulp force-pushed the worktree-lazy-wobbling-steele branch from dda95aa to 7821c40 Compare May 12, 2026 04:26
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rootulp rootulp marked this pull request as ready for review May 12, 2026 04:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky test: TestByzantinePrevoteEquivocation in consensus/byzantine_test.go

2 participants