server: fix peer add/done race between peerHandler and syncManager#2480
server: fix peer add/done race between peerHandler and syncManager#2480Aharonee wants to merge 3 commits intobtcsuite:masterfrom
Conversation
ee422e0 to
50b62a3
Compare
Pull Request Test Coverage Report for Build 22067556256Details
💛 - Coveralls |
|
I experience the same issue.. Wow, really need this. |
server.go
Outdated
| ) | ||
|
|
||
| // peerLifecycleEvent represents a peer connection or disconnection event. | ||
| // Using a single channel for both event types guarantees FIFO ordering: |
There was a problem hiding this comment.
Do we have the "first-in" part? Can OnVerAck be delayed and send its part after "done" event is sent? E.g. if OnVerAck runs longer than negotiateTimeout.
There was a problem hiding this comment.
Good catch, there seems to still be a potential race in that scenario.
I've pushed a commit which changes the peerDoneHandler into peerLifecycleHandler, and delegates responsibility for both add peer and done peer events to it.
That way a single goroutine will manage synchronization and correct ordering of the peer lifecycle events.
Does that make sense?
There was a problem hiding this comment.
This change looks good to me!
peerDoneHandler ran as a separate goroutine per peer and independently notified both peerHandler (via donePeers channel) and the sync manager (via syncManager.DonePeer) about a peer disconnect. Because these two sends were unsynchronized, the sync manager could observe DonePeer before NewPeer when a peer connected and disconnected quickly. This caused the sync manager to log "unknown peer", then later register the already-dead peer as a sync candidate that was never cleaned up, potentially leaving it stuck with a dead sync peer. Two structural changes eliminate the race: 1. Merge the newPeers and donePeers channels into a single peerLifecycle channel. Since OnVerAck (add) always fires before WaitForDisconnect returns (done), a single FIFO channel guarantees peerHandler always processes add before done for a given peer, removing the select-ambiguity where Go could pick done first. 2. Move the syncManager.DonePeer call and orphan eviction from peerDoneHandler into handleDonePeerMsg, which runs inside peerHandler. All sync manager peer lifecycle notifications now originate from the single peerHandler goroutine and flow into sm.msgChan in guaranteed add-before-done order.
50b62a3 to
091b790
Compare
Address review feedback on the peer add/done race fix: - Make peerLifecycleHandler (renamed from peerDoneHandler) the sole sender of both peerAdd and peerDone events for each peer. OnVerAck now closes a signal channel (verAckCh) instead of sending directly, and peerLifecycleHandler selects on verAckCh vs peer.Done() to decide whether to send peerAdd before peerDone. This guarantees ordering by construction: a single goroutine sends both events sequentially, eliminating the negotiateTimeout race window. - Add Done() method to peer.Peer exposing the quit channel read-only, enabling select-based disconnect detection from server code. - Remove the now-unused AddPeer method. - Address style feedback: 80-char line limit, empty lines between switch cases, break long function calls, use require.GreaterOrEqualf instead of if+Fatalf, bump syncRaceConcurrency to 300 for backpressure testing, add TestPreVerackDisconnect for disconnect prior to verack.
| // peerAdd is always enqueued before peerDone. | ||
| func (s *server) peerLifecycleHandler(sp *serverPeer) { | ||
| // Wait for the handshake to complete or the peer to | ||
| // disconnect, whichever comes first. | ||
| select { | ||
| case <-sp.verAckCh: | ||
| s.peerLifecycle <- peerLifecycleEvent{ | ||
| action: peerAdd, sp: sp, | ||
| } | ||
|
|
||
| case <-sp.Peer.Done(): | ||
| // Disconnected before verack; no peerAdd needed. | ||
| } |
There was a problem hiding this comment.
If both sp.verAckCh and sp.Peer.Done() have messages to receive, select chooses pseudorandomly among them. So peerAdd can be skipped even if VerAckReceived is true, and handleDonePeerMsg will call DonePeer for an unknown peer.
Does it make sense to prioritize receiving from sp.verAckCh or check VerAckReceived if sp.Peer.Done() fired?
There was a problem hiding this comment.
I think it should be fine to skip add peer if done peer event has already occurred.
After all, the peer has disconnected so we can avoid notifying the server of a new peer just to notify it right away after to remove it.
My main concern was done peer being processed before add peer, but done peer processing for an unknown peer that has already disconnected seems harmless.
There was a problem hiding this comment.
You are right. My proposal would only improve log message clarity (avoiding "unknown peer" being logged), not the correctness of the code itself. It is optional.
|
|
||
| const ( | ||
| peerAdd peerLifecycleAction = iota | ||
| peerAdd peerLifecycleAction = iota |
There was a problem hiding this comment.
this formatting change should belong to the first commit
There was a problem hiding this comment.
I can squash both commits and force push if you prefer, but wouldn't it be more convenient for you to review the diff each time and only squash merge at the end?
There was a problem hiding this comment.
Sure, let's keep them separate for now.
server.go
Outdated
| close(sp.verAckCh) | ||
| } |
There was a problem hiding this comment.
In the current code version allows calling OnVerAck only once. Should we safeguard for the future using sync.Once?
There was a problem hiding this comment.
I think safeguarding this could potentially hide a bug, and if it is called twice we would prefer a loud panic.
This pattern is also a consistent pattern used in the codebase, for example: Peer.quit channel is not safeguarded and is closed by Peer.Disconnect().
There was a problem hiding this comment.
Hmm, maybe we can produce an error instead, if it is closed already?
select {
case <-sp.verAckCh:
log Error
default:
close(sp.verAckCh)
}The error won't let it pass unnoticed, but at least it won't panic and crash. What do you think?
server.go
Outdated
| // goroutine (peerLifecycleHandler), guaranteeing that peerAdd is | ||
| // always enqueued before peerDone. |
There was a problem hiding this comment.
I propose to adjust the comment to reflect that peerAdd may be skipped.
server.go
Outdated
| knownAddresses lru.Cache | ||
| banScore connmgr.DynamicBanScore | ||
| quit chan struct{} | ||
| verAckCh chan struct{} // closed when OnVerAck fires |
There was a problem hiding this comment.
Formatting:
// Closed when OnVerAck fires.
verAckCh chan struct{}
server.go
Outdated
| ) | ||
|
|
||
| // peerLifecycleEvent represents a peer connection or disconnection event. | ||
| // Using a single channel for both event types guarantees FIFO ordering: |
There was a problem hiding this comment.
This change looks good to me!
Prioritize verAckCh in peerLifecycleHandler select to avoid nondeterministic peerAdd skipping when both channels are ready. Guard OnVerAck against double-close by checking the channel before closing, logging an error instead of panicking. Adjust peerLifecycleEvent comment to reflect that peerAdd may be skipped when the peer disconnects before or concurrently with verack. Fix verAckCh field comment formatting.
Summary
Fix a race condition where the sync manager can permanently get stuck with a dead sync peer after rapid peer connect/disconnect cycles.
The Race Condition
peerDoneHandlerran as a separate goroutine per peer and independently notified two event loops about a disconnect:donePeerschannel (consumed bypeerHandler).syncManager.DonePeer()directly (sends tosm.msgChan, consumed byblockHandler).Meanwhile,
peerHandleronly calledsyncManager.NewPeer()when it processed thenewPeerschannel. Because these two paths were unsynchronized,blockHandlercould observeDonePeerbeforeNewPeerfor the same peer.A second vector existed even if
DonePeerwere moved intopeerHandler: two separate buffered channels (newPeers/donePeers) let Go'sselectrandomly pick the done case before the add case when both were ready simultaneously.A third vector existed due to
negotiateTimeout: if the 30s timeout inpeer.Peer.start()fired betweenverAckReceived = trueand theOnVerAckcallback completing,peerDoneHandlercould observeVerAckReceived() == trueand sendpeerDonebefore theOnVerAckcallback sentpeerAdd.Consequences: The sync manager receives
DonePeerfor an unknown peer (logged as a warning, no cleanup). ThenNewPeerarrives for the already-dead peer -- the sync manager registers it as a candidate and potentially selects it assyncPeer. Since it is already disconnected, no subsequentDonePeerarrives to clear it. The node is stuck: it believes it has a sync peer, ignores new candidates, and never makes chain progress.What Triggers It
Any scenario that produces rapid connect/disconnect cycles:
The Fix
Three structural changes eliminate all race vectors:
Merge
newPeersanddonePeersinto a singlepeerLifecyclechannel. A single FIFO channel eliminates the select-ambiguity vector where Go'sselectcould pick done before add.Move
syncManager.DonePeer()and orphan eviction intohandleDonePeerMsg. All sync manager notifications now flow through thepeerHandlergoroutine.Make
peerLifecycleHandler(renamed frompeerDoneHandler) the sole sender of bothpeerAddandpeerDonefor each peer.OnVerAckno longer sends to the channel directly; it closes a signal channel (verAckCh).peerLifecycleHandlerselects onverAckChvspeer.Done()(new method exposing the peer's quit channel), sendspeerAddif verack was received, then waits for disconnect and sendspeerDone. Because both sends originate from the same goroutine, ordering is guaranteed by construction -- no cross-goroutine synchronization or bookkeeping needed.Reproducing on
master(without the fix)The included integration test can demonstrate the corruption on an unpatched
masterbranch:git checkout master git checkout bugfix/peer_race_condition -- integration/sync_race_test.go go test -tags=rpctest -v -run TestSyncManagerRaceCorruption ./integration/ -count=10 -timeout 900sTest Plan
go build ./...compiles cleanlygo test -tags=rpctest -v -run TestSyncManagerRaceCorruption ./integration/ -count=10 -timeout 900spassesTestPreVerackDisconnectpasses (disconnect before verack)