fix(p2p/pex): de-flake TestPEXReactorRunning by ignoring duplicate-peer dial rejections#3033
fix(p2p/pex): de-flake TestPEXReactorRunning by ignoring duplicate-peer dial rejections#3033rootulp wants to merge 2 commits into
Conversation
When two peers simultaneously dial each other, the loser's outbound
DialPeerWithAddress can return ErrSwitchDuplicatePeerID or
ErrRejected{isDuplicate: true} because the inbound connection won the
race into the peer set. The PEX reactor previously treated these as
generic dial failures: it incremented the attempt counter and stamped
lastDialed=now, locking the address out of redials for the 30s
minTimeBetweenDials window even though the peer is actually connected.
In TestPEXReactorRunning's small mesh (3 switches, EnsurePeersPeriod
250ms, 10s assertion timeout) the test couldn't recover within its
window once this race tripped, producing the recurring flake on
celestia-core CI. Treat duplicate-peer rejections like
ErrCurrentlyDialingOrExistingAddress and bump the assertion timeout to
30s for extra headroom under -race. Locally, this brings the failure
rate over 200 -race runs from ~5% to 0%.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| // Treat duplicate-peer rejections as a successful dial: the peer is | ||
| // already connected (typically via an inbound connection raced against | ||
| // our outbound dial), so we shouldn't penalize this address with the | ||
| // 30s dial backoff or increment its attempt counter. | ||
| if e, ok := err.(p2p.ErrRejected); ok && e.IsDuplicate() { | ||
| r.attemptsToDial.Delete(addr.DialString()) | ||
| return err | ||
| } | ||
| if _, ok := err.(p2p.ErrSwitchDuplicatePeerID); ok { | ||
| r.attemptsToDial.Delete(addr.DialString()) | ||
| return err | ||
| } |
There was a problem hiding this comment.
did a little bit of digging with codex as my spidy senses were going off with the assumption,
the peer is already connected (typically via an inbound connection raced against our outbound dial)
this seems plausible to occur, however assuming this is the only cause is definitely not a safe assumption. Malicious peers can totally dial from the same ip. With this change, we'd be deleting the counter for that case as well (presumably the reason why the counter was added in the first place). Likely not a huge deal but I don't think we should merge as is.
IsDuplicate might be weird wording. it is telling us that this is a duplicate but not the reason why.
to fix this properly I think we should do
func (e ErrRejected) DuplicatePeerID() (ID, bool) {
if e.isDuplicate && e.id != "" {
return e.id, true
}
return "", false
}
// Then PEX can do:
if e, ok := err.(p2p.ErrRejected); ok {
if id, ok := e.DuplicatePeerID(); ok && id == addr.ID {
r.attemptsToDial.Delete(addr.DialString())
return err
}
}There was a problem hiding this comment.
Good catch — agreed and applied in 8806f66.
- Added
ErrRejected.DuplicatePeerID() (ID, bool)per your suggestion. dialPeernow requires the returned ID to matchaddr.IDbefore clearingattemptsToDial.- Applied the same
e.ID == addr.IDgate to theErrSwitchDuplicatePeerIDbranch for symmetry.
Duplicate-IP rejections (ErrRejected{isDuplicate: true} with no ID, from ConnDuplicateIPFilter / filterConn) now fall through to the normal failure path, so a noisy/malicious neighbor on a shared IP can't suppress backoff for an honest peer.
Verified go test -race -count=1000 -run TestPEXReactorRunning ./p2p/pex/ flake rate is unchanged (~1%) relative to the prior commit — confirms the narrower check still covers the racing-inbound case this PR targets.
Pull request was converted to draft
Address review feedback: IsDuplicate() conflates duplicate-IP and duplicate-peer-ID rejections, but the racing-inbound case we want to exempt from backoff is specifically the peer-ID one. A duplicate-IP rejection could come from an unrelated peer sharing an IP, so resetting the attempt counter there could let a noisy neighbor effectively suppress backoff for an honest peer. - Add ErrRejected.DuplicatePeerID() (ID, bool) accessor that returns the peer ID only when the rejection was caused by a duplicate ID (id set). - In pex_reactor.dialPeer, require the rejected peer ID to match addr.ID before clearing attemptsToDial. - Apply the same id == addr.ID gate to ErrSwitchDuplicatePeerID. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
TestPEXReactorRunninghas been intermittently failing on celestia-core CI (see 25653712587, which failed on a clean push tomain). Locally it reproduces in ~20% of runs without race and ~5% under-race.Root cause: when two switches simultaneously dial each other, the loser's outbound
DialPeerWithAddressreturnsErrSwitchDuplicatePeerIDorErrRejected{isDuplicate: true, id: <peer ID>}because the inbound connection has already been added to the peer set. The PEX reactor previously treated these as ordinary dial failures: it stampedattemptsToDial[addr] = (count+1, now), which then blocks redials forminTimeBetweenDials = 30s. The test's 10s assertion window can't recover from that, even though the peer is actually connected (just via the other direction).The fix is symmetric with the existing handling of
ErrCurrentlyDialingOrExistingAddress— treat duplicate-peer-ID rejections from the address we were trying to reach as success-equivalent: don't mark the address bad, don't bump the attempt counter, don't backoff. The test timeout is also bumped from 10s to 30s for headroom under-race.Per review feedback (#discussion_r3221861340), the check is narrowed to require that the duplicate peer ID in the rejection match
addr.ID. Duplicate-IP rejections (ErrRejectedwithisDuplicate=truebut no ID) take the normal failure path, since the colliding peer may not be the one we wanted — e.g. an unrelated peer (potentially malicious) sharing an IP shouldn't be able to suppress backoff for an honest address.Changes
p2p/errors.go: addErrRejected.DuplicatePeerID() (ID, bool)— returns the peer ID only when the rejection was caused by a duplicate ID, so callers can distinguish ID vs IP duplicates.p2p/pex/pex_reactor.go: indialPeer, skip the backoff/attempt-counter only when the duplicate rejection names the same peer ID we dialed (for bothErrRejectedandErrSwitchDuplicatePeerID).p2p/pex/pex_reactor_test.go: bumpTestPEXReactorRunningassertion timeout 10s → 30s.Test plan
go test -race -count=1000 -run TestPEXReactorRunning ./p2p/pex/— same flake rate as the prior commit on this branch (~1%), still a large improvement over the ~5% rate before the fix. Note: the original PR description's "0 failures in 200 runs" was statistical luck at the ~1% residual rate; the residual flake is unrelated to the duplicate-rejection handling and not introduced by this change.go test -race ./p2p/...passesmake lintpasses