Skip to content

fix(p2p/pex): de-flake TestPEXReactorRunning by ignoring duplicate-peer dial rejections#3033

Draft
rootulp wants to merge 2 commits into
mainfrom
claude/great-lichterman-29af71
Draft

fix(p2p/pex): de-flake TestPEXReactorRunning by ignoring duplicate-peer dial rejections#3033
rootulp wants to merge 2 commits into
mainfrom
claude/great-lichterman-29af71

Conversation

@rootulp
Copy link
Copy Markdown
Collaborator

@rootulp rootulp commented May 11, 2026

Summary

TestPEXReactorRunning has been intermittently failing on celestia-core CI (see 25653712587, which failed on a clean push to main). Locally it reproduces in ~20% of runs without race and ~5% under -race.

Root cause: when two switches simultaneously dial each other, the loser's outbound DialPeerWithAddress returns ErrSwitchDuplicatePeerID or ErrRejected{isDuplicate: true, id: <peer ID>} because the inbound connection has already been added to the peer set. The PEX reactor previously treated these as ordinary dial failures: it stamped attemptsToDial[addr] = (count+1, now), which then blocks redials for minTimeBetweenDials = 30s. The test's 10s assertion window can't recover from that, even though the peer is actually connected (just via the other direction).

The fix is symmetric with the existing handling of ErrCurrentlyDialingOrExistingAddress — treat duplicate-peer-ID rejections from the address we were trying to reach as success-equivalent: don't mark the address bad, don't bump the attempt counter, don't backoff. The test timeout is also bumped from 10s to 30s for headroom under -race.

Per review feedback (#discussion_r3221861340), the check is narrowed to require that the duplicate peer ID in the rejection match addr.ID. Duplicate-IP rejections (ErrRejected with isDuplicate=true but no ID) take the normal failure path, since the colliding peer may not be the one we wanted — e.g. an unrelated peer (potentially malicious) sharing an IP shouldn't be able to suppress backoff for an honest address.

Changes

  • p2p/errors.go: add ErrRejected.DuplicatePeerID() (ID, bool) — returns the peer ID only when the rejection was caused by a duplicate ID, so callers can distinguish ID vs IP duplicates.
  • p2p/pex/pex_reactor.go: in dialPeer, skip the backoff/attempt-counter only when the duplicate rejection names the same peer ID we dialed (for both ErrRejected and ErrSwitchDuplicatePeerID).
  • p2p/pex/pex_reactor_test.go: bump TestPEXReactorRunning assertion timeout 10s → 30s.

Test plan

  • go test -race -count=1000 -run TestPEXReactorRunning ./p2p/pex/ — same flake rate as the prior commit on this branch (~1%), still a large improvement over the ~5% rate before the fix. Note: the original PR description's "0 failures in 200 runs" was statistical luck at the ~1% residual rate; the residual flake is unrelated to the duplicate-rejection handling and not introduced by this change.
  • go test -race ./p2p/... passes
  • make lint passes

When two peers simultaneously dial each other, the loser's outbound
DialPeerWithAddress can return ErrSwitchDuplicatePeerID or
ErrRejected{isDuplicate: true} because the inbound connection won the
race into the peer set. The PEX reactor previously treated these as
generic dial failures: it incremented the attempt counter and stamped
lastDialed=now, locking the address out of redials for the 30s
minTimeBetweenDials window even though the peer is actually connected.

In TestPEXReactorRunning's small mesh (3 switches, EnsurePeersPeriod
250ms, 10s assertion timeout) the test couldn't recover within its
window once this race tripped, producing the recurring flake on
celestia-core CI. Treat duplicate-peer rejections like
ErrCurrentlyDialingOrExistingAddress and bump the assertion timeout to
30s for extra headroom under -race. Locally, this brings the failure
rate over 200 -race runs from ~5% to 0%.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rootulp rootulp requested a review from a team as a code owner May 11, 2026 16:32
@rootulp rootulp requested review from evan-forbes and removed request for a team May 11, 2026 16:32
@rootulp rootulp self-assigned this May 11, 2026
@rootulp rootulp enabled auto-merge (squash) May 11, 2026 16:32
Comment thread p2p/pex/pex_reactor.go Outdated
Comment on lines +563 to +574
// Treat duplicate-peer rejections as a successful dial: the peer is
// already connected (typically via an inbound connection raced against
// our outbound dial), so we shouldn't penalize this address with the
// 30s dial backoff or increment its attempt counter.
if e, ok := err.(p2p.ErrRejected); ok && e.IsDuplicate() {
r.attemptsToDial.Delete(addr.DialString())
return err
}
if _, ok := err.(p2p.ErrSwitchDuplicatePeerID); ok {
r.attemptsToDial.Delete(addr.DialString())
return err
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did a little bit of digging with codex as my spidy senses were going off with the assumption,

the peer is already connected (typically via an inbound connection raced against our outbound dial)

this seems plausible to occur, however assuming this is the only cause is definitely not a safe assumption. Malicious peers can totally dial from the same ip. With this change, we'd be deleting the counter for that case as well (presumably the reason why the counter was added in the first place). Likely not a huge deal but I don't think we should merge as is.

IsDuplicate might be weird wording. it is telling us that this is a duplicate but not the reason why.

to fix this properly I think we should do

  func (e ErrRejected) DuplicatePeerID() (ID, bool) {
      if e.isDuplicate && e.id != "" {
          return e.id, true
      }
      return "", false
  }

  // Then PEX can do:

  if e, ok := err.(p2p.ErrRejected); ok {
      if id, ok := e.DuplicatePeerID(); ok && id == addr.ID {
          r.attemptsToDial.Delete(addr.DialString())
          return err
      }
  }

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — agreed and applied in 8806f66.

  • Added ErrRejected.DuplicatePeerID() (ID, bool) per your suggestion.
  • dialPeer now requires the returned ID to match addr.ID before clearing attemptsToDial.
  • Applied the same e.ID == addr.ID gate to the ErrSwitchDuplicatePeerID branch for symmetry.

Duplicate-IP rejections (ErrRejected{isDuplicate: true} with no ID, from ConnDuplicateIPFilter / filterConn) now fall through to the normal failure path, so a noisy/malicious neighbor on a shared IP can't suppress backoff for an honest peer.

Verified go test -race -count=1000 -run TestPEXReactorRunning ./p2p/pex/ flake rate is unchanged (~1%) relative to the prior commit — confirms the narrower check still covers the racing-inbound case this PR targets.

@rootulp rootulp marked this pull request as draft May 12, 2026 04:20
auto-merge was automatically disabled May 12, 2026 04:20

Pull request was converted to draft

Address review feedback: IsDuplicate() conflates duplicate-IP and
duplicate-peer-ID rejections, but the racing-inbound case we want to
exempt from backoff is specifically the peer-ID one. A duplicate-IP
rejection could come from an unrelated peer sharing an IP, so resetting
the attempt counter there could let a noisy neighbor effectively suppress
backoff for an honest peer.

- Add ErrRejected.DuplicatePeerID() (ID, bool) accessor that returns the
  peer ID only when the rejection was caused by a duplicate ID (id set).
- In pex_reactor.dialPeer, require the rejected peer ID to match
  addr.ID before clearing attemptsToDial.
- Apply the same id == addr.ID gate to ErrSwitchDuplicatePeerID.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants