fix(p2p/pex): de-flake TestPEXReactorRunning by ignoring duplicate-peer dial rejections by rootulp · Pull Request #3033 · celestiaorg/celestia-core

rootulp · 2026-05-11T16:32:05Z

Summary

TestPEXReactorRunning has been intermittently failing on celestia-core CI (see 25653712587, which failed on a clean push to main). Locally it reproduces in ~20% of runs without race and ~5% under -race.

Root cause: when two switches simultaneously dial each other, the loser's outbound DialPeerWithAddress returns ErrSwitchDuplicatePeerID or ErrRejected{isDuplicate: true, id: <peer ID>} because the inbound connection has already been added to the peer set. The PEX reactor previously treated these as ordinary dial failures: it stamped attemptsToDial[addr] = (count+1, now), which then blocks redials for minTimeBetweenDials = 30s. The test's 10s assertion window can't recover from that, even though the peer is actually connected (just via the other direction).

The fix is symmetric with the existing handling of ErrCurrentlyDialingOrExistingAddress — treat duplicate-peer-ID rejections from the address we were trying to reach as success-equivalent: don't mark the address bad, don't bump the attempt counter, don't backoff. The test timeout is also bumped from 10s to 30s for headroom under -race.

Per review feedback (#discussion_r3221861340), the check is narrowed to require that the duplicate peer ID in the rejection match addr.ID. Duplicate-IP rejections (ErrRejected with isDuplicate=true but no ID) take the normal failure path, since the colliding peer may not be the one we wanted — e.g. an unrelated peer (potentially malicious) sharing an IP shouldn't be able to suppress backoff for an honest address.

Changes

p2p/errors.go: add ErrRejected.DuplicatePeerID() (ID, bool) — returns the peer ID only when the rejection was caused by a duplicate ID, so callers can distinguish ID vs IP duplicates.
p2p/pex/pex_reactor.go: in dialPeer, skip the backoff/attempt-counter only when the duplicate rejection names the same peer ID we dialed (for both ErrRejected and ErrSwitchDuplicatePeerID).
p2p/pex/pex_reactor_test.go: bump TestPEXReactorRunning assertion timeout 10s → 30s.

Test plan

go test -race -count=1000 -run TestPEXReactorRunning ./p2p/pex/ — same flake rate as the prior commit on this branch (~1%), still a large improvement over the ~5% rate before the fix. Note: the original PR description's "0 failures in 200 runs" was statistical luck at the ~1% residual rate; the residual flake is unrelated to the duplicate-rejection handling and not introduced by this change.
go test -race ./p2p/... passes
make lint passes

When two peers simultaneously dial each other, the loser's outbound DialPeerWithAddress can return ErrSwitchDuplicatePeerID or ErrRejected{isDuplicate: true} because the inbound connection won the race into the peer set. The PEX reactor previously treated these as generic dial failures: it incremented the attempt counter and stamped lastDialed=now, locking the address out of redials for the 30s minTimeBetweenDials window even though the peer is actually connected. In TestPEXReactorRunning's small mesh (3 switches, EnsurePeersPeriod 250ms, 10s assertion timeout) the test couldn't recover within its window once this race tripped, producing the recurring flake on celestia-core CI. Treat duplicate-peer rejections like ErrCurrentlyDialingOrExistingAddress and bump the assertion timeout to 30s for extra headroom under -race. Locally, this brings the failure rate over 200 -race runs from ~5% to 0%. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

evan-forbes · 2026-05-11T20:15:46Z

+		// Treat duplicate-peer rejections as a successful dial: the peer is
+		// already connected (typically via an inbound connection raced against
+		// our outbound dial), so we shouldn't penalize this address with the
+		// 30s dial backoff or increment its attempt counter.
+		if e, ok := err.(p2p.ErrRejected); ok && e.IsDuplicate() {
+			r.attemptsToDial.Delete(addr.DialString())
+			return err
+		}
+		if _, ok := err.(p2p.ErrSwitchDuplicatePeerID); ok {
+			r.attemptsToDial.Delete(addr.DialString())
+			return err
+		}


did a little bit of digging with codex as my spidy senses were going off with the assumption,

the peer is already connected (typically via an inbound connection raced against our outbound dial)

this seems plausible to occur, however assuming this is the only cause is definitely not a safe assumption. Malicious peers can totally dial from the same ip. With this change, we'd be deleting the counter for that case as well (presumably the reason why the counter was added in the first place). Likely not a huge deal but I don't think we should merge as is.

IsDuplicate might be weird wording. it is telling us that this is a duplicate but not the reason why.

to fix this properly I think we should do

func (e ErrRejected) DuplicatePeerID() (ID, bool) { if e.isDuplicate && e.id != "" { return e.id, true } return "", false } // Then PEX can do: if e, ok := err.(p2p.ErrRejected); ok { if id, ok := e.DuplicatePeerID(); ok && id == addr.ID { r.attemptsToDial.Delete(addr.DialString()) return err } }

Good catch — agreed and applied in 8806f66.

Added ErrRejected.DuplicatePeerID() (ID, bool) per your suggestion.

dialPeer now requires the returned ID to match addr.ID before clearing attemptsToDial.

Applied the same e.ID == addr.ID gate to the ErrSwitchDuplicatePeerID branch for symmetry.

Duplicate-IP rejections (ErrRejected{isDuplicate: true} with no ID, from ConnDuplicateIPFilter / filterConn) now fall through to the normal failure path, so a noisy/malicious neighbor on a shared IP can't suppress backoff for an honest peer.

Verified go test -race -count=1000 -run TestPEXReactorRunning ./p2p/pex/ flake rate is unchanged (~1%) relative to the prior commit — confirms the narrower check still covers the racing-inbound case this PR targets.

Address review feedback: IsDuplicate() conflates duplicate-IP and duplicate-peer-ID rejections, but the racing-inbound case we want to exempt from backoff is specifically the peer-ID one. A duplicate-IP rejection could come from an unrelated peer sharing an IP, so resetting the attempt counter there could let a noisy neighbor effectively suppress backoff for an honest peer. - Add ErrRejected.DuplicatePeerID() (ID, bool) accessor that returns the peer ID only when the rejection was caused by a duplicate ID (id set). - In pex_reactor.dialPeer, require the rejected peer ID to match addr.ID before clearing attemptsToDial. - Apply the same id == addr.ID gate to ErrSwitchDuplicatePeerID. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rootulp requested a review from a team as a code owner May 11, 2026 16:32

rootulp requested review from evan-forbes and removed request for a team May 11, 2026 16:32

rootulp self-assigned this May 11, 2026

rootulp enabled auto-merge (squash) May 11, 2026 16:32

evan-forbes requested changes May 11, 2026

View reviewed changes

rootulp marked this pull request as draft May 12, 2026 04:20

auto-merge was automatically disabled May 12, 2026 04:20
Pull request was converted to draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(p2p/pex): de-flake TestPEXReactorRunning by ignoring duplicate-peer dial rejections#3033

fix(p2p/pex): de-flake TestPEXReactorRunning by ignoring duplicate-peer dial rejections#3033
rootulp wants to merge 2 commits into
mainfrom
claude/great-lichterman-29af71

rootulp commented May 11, 2026 •

edited

Loading

Uh oh!

evan-forbes May 11, 2026

Uh oh!

rootulp May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rootulp commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Uh oh!

evan-forbes May 11, 2026

Choose a reason for hiding this comment

Uh oh!

rootulp May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rootulp commented May 11, 2026 •

edited

Loading