fix(consensus): stabilize TestByzantinePrevoteEquivocation flake#2950
fix(consensus): stabilize TestByzantinePrevoteEquivocation flake#2950rootulp wants to merge 15 commits into
Conversation
evan-forbes
left a comment
There was a problem hiding this comment.
we were going to do something like this before generally not opposed however this test is notoriously complex yet important
I had codex run locally to confirm and actually still hit a failure but I'm unsure if this is separate
Failure was at /home/evan/src/celestiaorg/worktrees/celestia-core-pr-2950/consensus/byzantine_test.go:310: evidence pool never received
DuplicateVoteEvidence at height 2. The log showed the new peer wait succeeded in the important way: peer count was 3, and all three equivocation sends
returned sent=true. So either the peer-mesh wait is insufficient, or the new PendingEvidence polling is racing a transient pending state. Either way,
this PR does not yet justify “stabilize” for this test.
|
[Claude generated] Thanks for the careful review @evan-forbes. You were right that the original three changes werent sufficient — the failure you hit locally is the same flake the PR was supposed to fix. Digging deeper revealed a fourth root cause that the peer-mesh wait alone could not address: The consensus reactors When the byzantine splits the votes (one variant to peer A, the other to peers B and C), peer A holds
|
|
Quick update: I investigated the test on upstream cometbft and the same flake exists there too (cometbft/cometbft#1917, #2353). Upstream maintainer @cason identified the same root causes I found — the HasVote gossip optimization preventing conflicting-vote propagation, and receivers committing the next height before processing the second conflicting vote. They closed #1917 as deprioritized, and #2353 ("Evidence may not work consistently") is still open. One concrete difference between upstream and the celestia-core fork: upstream uses
|
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace while-read subshell with a for-loop glob to avoid missing -r/IFS=, add a fallback message when no log matches FAIL|panic, and guard mktemp -d. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captures git state, task statuses, stress-harness wall-clock estimates, and an embedded resume prompt so a fresh Claude Code session on a different machine can pick up at Task 2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captures per-validator evpool and blockStore handles into slices during setup, and emits grep-friendly [byz] and [val N] log lines tracking peer count at prevote time, peer.Send return values, and per-block evidence pool state. These surface in CI logs so future flakes can be diagnosed without a local repro. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MakeConnectedSwitches dials synchronously but p2p handshake and peer registration happen asynchronously. Without this wait, the byzantine validator could reach doPrevote with a partial peer set, firing conflicting prevotes at fewer than (nValidators-1) peers — which the evidence pool cannot reconstruct into DuplicateVoteEvidence. Poll each reactor's peer set with require.Eventually (10s budget, 50ms interval) until every validator sees (nValidators-1) peers. Addresses #2200. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…antine test Stage 1 polls each validator's evidence pool for a DuplicateVoteEvidence at the expected height with a 30s budget. Stage 2 polls each validator's block store for any block containing evidence with a 60s budget. The split isolates evidence detection from evidence commit so that future failures point at the correct failing stage. Also removes the now-unused blocksSubs event subscriptions and the goroutine block-watchers in favor of direct polling against the pool and block-store handles captured in the previous commit. Addresses #2200. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Splitting the byzantine equivocation prevotes across peers (one variant to half, the other to the rest) is unreliable: the consensus reactor's HasVote gossip optimization (consensus/reactor.go:PickVoteToSend) excludes a validator's index from gossip selection once any peer reports holding any vote from that validator, regardless of which BlockID the vote is for. Once each peer's HasVote bitarray marks "byz has voted at h/r/type", no peer ever forwards its own variant to peers that hold the other variant, so no peer sees both conflicting votes and DuplicateVoteEvidence cannot form via gossip. Send both prevote1 (block) and prevote2 (nil) directly to every peer so the conflict is detectable on first receipt without relying on gossip. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Even with both conflicting prevotes sent directly to every peer, the test still flakes on Linux CI because each honest peer must process both votes while still at the height they were signed for: with three honest validators, consensus already has +2/3 prevotes for the proposed block without the byzantine's vote, so a peer can commit and advance to the next height between processing the byzantine's first and second votes — at which point addVote silently drops the late vote (height mismatch) and no peer sees both conflicting votes. Have the byzantine equivocate at each of the next 5 heights and accept evidence at any of them; we only need *one* round to land cleanly. With independent attempts the residual flake rate drops sharply.
…I race" This reverts commit b9933b5.
…Equivocation Upstream cometbft uses newMockTickerFunc(true) for this exact test (internal/consensus/byzantine_test.go:45). Real timers race with proposal gossip — if TimeoutPropose fires on the byzantine before the proposal arrives, doPrevote runs without a complete proposal block, prevote1 collapses to a vote for nil identical to prevote2, and no equivocation evidence forms. Switching to the upstream mock ticker eliminates that race. The mock fires the new-height timer once at startup, then never; consensus advances purely on +2/3 thresholds. Local stress: 50/50 pass with mock ticker (was 30/30 with real ticker; CI Linux flaked once even at that rate). Upstream still has known residual flakiness for the same test (cometbft/cometbft#2353 — "Evidence may not work consistently"). The upstream maintainer attributes it to the consensus reactor's HasVote gossip optimization not propagating conflicting votes — addressed in the previous commit on this PR by sending both votes to every peer instead of splitting them.
This repo no longer maintains a .changelog/ directory. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dda95aa to
7821c40
Compare
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Stabilize
TestByzantinePrevoteEquivocation— the most frequent flake in celestia-core CI (8 of the last 19 failedTestworkflow runs on main between 2026-02-09 and 2026-04-15). Five small, independently-valuable changes:Send both conflicting prevotes to every peer (
consensus/byzantine_test.go). Splitting the votes across peers (one variant to half, the other to the rest) does not work reliably: the consensus reactor'sHasVotegossip optimization (consensus/reactor.go:PickVoteToSend) excludes a validator's index from gossip selection once any peer reports holding any vote from that validator, regardless of which BlockID the vote is for. Once the byz-bit is set in each peer'sHasVotebitarray, no peer ever forwards its own variant to peers holding the other variant, so no peer sees both conflicting votes andDuplicateVoteEvidencecannot form via gossip. Sending both votes directly to every peer makes the conflict detectable on first receipt without relying on gossip. Upstream maintainer cason identified this exact mechanism in consensus: TestByzantinePrevoteEquivocation is flaky cometbft/cometbft#1917 ("a node does not send one of the conflicting Prevote messages to a peer because it realized that that peer does not need it anymore").Use the upstream mock ticker (
consensus/byzantine_test.go). Upstream's TestByzantinePrevoteEquivocation usesnewMockTickerFunc(true)(cometbft/cometbftinternal/consensus/byzantine_test.go:45) — the celestia-core fork was using a realNewTimeoutTicker(). With a real ticker,TimeoutProposecan fire on the byzantine before the proposal arrives via gossip, causingdoPrevoteto run withrs.ProposalBlock = nil; prevote1 then collapses to a vote for nil identical to prevote2 and no equivocation evidence forms. The mock ticker fires the new-height timer once at startup and then never; consensus advances purely on +2/3 thresholds, eliminating that race.Peer-mesh wait (
consensus/byzantine_test.go).MakeConnectedSwitchesdials synchronously but p2p handshake and peer registration complete asynchronously. Without a wait, the byzantine node could reach itsdoPrevoteoverride with a partial peer set and fire conflicting votes at fewer thannValidators-1peers — which the evidence pool cannot reconstruct intoDuplicateVoteEvidence. Newrequire.Eventuallypolls each reactor's peer set (10s budget, 50ms tick) until every validator seesnValidators-1peers.Two-stage polling replaces the 120s deadline. Stage 1 polls each validator's evidence pool (30s / 100ms) for a
DuplicateVoteEvidenceat the expected height. Stage 2 polls each validator's block store (60s / 200ms) for any block containing evidence. The split isolates "evidence detected" from "evidence committed" so that future failures point at the correct failing stage. Test wall time drops from up-to-120s to ~0.5s on a passing run.Structured instrumentation.
[byz]log lines capture peer count at prevote time andpeer.Sendreturn values, surfacing in CI logs so future flakes can be diagnosed without a local repro.This PR supersedes the timeout-bump approaches in #2810 and #2850 — the old block-watch-limit and bare
time.Afterare deleted entirely.Test plan
go test -tags deadlock -v -run '^TestByzantinePrevoteEquivocation$' ./consensus/— passes locally in ~0.5sCloses #2200.
Closes https://linear.app/celestia/issue/PROTOCO-580/flaky-test-testbyzantineprevoteequivocation-in-consensusbyzantine
🤖 Generated with Claude Code