fix(iroh-gossip): reap connection tasks and prune peer state on churn#146
fix(iroh-gossip): reap connection tasks and prune peer state on churn#146drlukeangel wants to merge 6 commits into
Conversation
Two connection-lifecycle fixes. (1) connection_loop ran the send and receive loops under tokio::join!, which returns only when BOTH finish; when a peer is removed locally send_fut ends but recv_fut blocks forever on accept_uni() — switched to tokio::select! so the loop returns as soon as either half finishes. (2) handle_connection_task_finished now removes the peer from self.peers on the active-connection-closed path (it previously sent PeerDisconnected but left the PeerState when the peer was not in the active_view). NOTE: select! alone does not stop the leak — see the following SendLoop fix, which is what actually lets send_fut resolve.
…eak) SendLoop::run's select! used `Some(msg) = self.send_rx.recv() => ...`. When a peer is removed from the gossip peer map its send_tx drops and recv() returns None, but the `Some(msg) =` pattern then SILENTLY DISABLES that branch instead of breaking. The only other live arm (`_ = &mut closed`) stays pending forever because the connection is still open, so SendLoop::run hangs, send_fut never resolves, connection_loop's select! never fires, and the Connection plus its background ConnectionDriver are stranded — one leaked connection per removed/rotated peer (worst on observers and multi-topic routers, which rotate peers constantly). Match the full Option and break on None. Evidence (rafka v2, 18-node 2-mesh chaos soak, kill+respawn churn): instrumented connection-task counters went from spawned=174 / finished=52 (122 stuck and climbing) to spawned approximately equals finished with live tasks bounded around 15. DHAT noq::connection::Connection 18.7 MB -> 3.85 MB; total retained heap ~50 MB -> ~25 MB. 35-min soak, observer RSS at 140 chaos events: 0.228 GB -> 0.115 GB (~66% lower).
The top-level peer_topics index (PeerId -> set of TopicId) was never pruned when a peer disconnected at the network level, so it grew with every peer ever seen under churn. Remove the peer entry on PeerDisconnected.
peer_data leaked when a peer was fully discarded from the active view (remove_active_by_index with keep_as_passive=false) and on passive-view eviction (add_passive); alive_disconnect_peers leaked on passive eviction and on pending-neighbor-request timeout. Added the matching removals so per-peer protocol metadata is reclaimed under churn.
lazy_push_queue retained entries for neighbors that went down, growing under churn. Prune the peer on on_neighbor_down.
|
related #147 |
Adopts the seven peer-state eviction prunes from n0-computer#146 (drlukeangel): two `peers`-map reclaims in net.rs (on dial failure and active send-connection close) and five proto-layer prunes (hyparview passive-view eviction, pending-neighbor timeout, and active-view discard; plumtree's lazy_push_queue on neighbor-down; state's peer_topics on PeerDisconnected). Each left one per-peer entry behind under churn with rotating node ids — byte-scale next to the connection leak fixed in the previous commit, but unbounded. Adds the regression tests n0-computer#146 lacks: five in-module tests that seed the relevant map, trigger the eviction, and assert the entry is pruned; each was verified to fail without its prune line.
|
Hi @drlukeangel 👋 Looks like we independently landed on the same fix for the same problem. I'd been chasing a memory leak I could reliably reproduce on both a local and a distributed mesh. You'd also caught several peer-state eviction gaps I'd missed. Please feel free to take anything useful from #147 and consolidate everything onto yours. Happy to close mine in favour of it. One heads-up from the same investigation: this connection leak is the dominant driver of the growth, but there's a smaller, separate one in noq too Thanks for the great work here! 🙇♂ |
|
Hi @caiogondim! 👋 Thank you so much for the incredibly detailed investigation and for those regression tests in #147! 🙇♂️ It's awesome that we independently landed on the exact same fixes. Your regression tests are fantastic and add exactly the verification this PR needed. I've gone ahead and folded all 6 of your regression tests into this PR so that we have the complete fix and test coverage in one place. I've also pushed the final updates just now. Since everything is consolidated here now, please feel free to close #147 whenever you're ready. Thanks again for your awesome work and collaboration on tracking this down! 🚀 |
4f417c3 to
790db7f
Compare
Description
While running iroh-gossip as the mesh layer of a downstream project, we hit steadily growing
memory under sustained peer churn (peers continuously joining and leaving, each with a fresh
node id). Investigation found one dominant connection-task leak plus five smaller peer-state
eviction gaps. This PR fixes all of them; they're independent, so happy to split if you'd prefer.
First — thank you for iroh-gossip. Being able to lean on HyParview + Plumtree over iroh instead of
hand-rolling gossip and membership let us delete a large amount of networking code, and the actor
design made this leak tractable to trace.
The dominant fix —
SendLoopdoesn't exit when its send channel closes (src/net/util.rs)SendLoop::run's select used:When a peer is removed from the gossip peer map, its
send_txis dropped andsend_rx.recv()returns
None. With theSome(msg) = …form, aNonesilently disables that select branch(rather than yielding the
None). The only other persistent arm,_ = &mut closed, then stayspending forever because the connection is still open — so
SendLoop::runnever returns,send_futnever resolves, and
connection_loopis stuck. TheConnectionand its backgroundConnectionDriverare never dropped: one stranded connection per removed/rotated peer.Fix: match the full
Optionand break onNone:This pairs with changing
connection_loopfromtokio::join!(send_fut, recv_fut)to atokio::select!(a necessary prerequisite — withjoin!, even a returningsend_futwould waitforever on
recv_fut'saccept_uni()because the peer is still alive), and removing the peer fromthe actor's
peersmap on dial failure and on send-connection close so the descriptor is reclaimed.Additional eviction gaps noticed while investigating
These are smaller (byte-scale) and reasoned rather than independently soak-validated; each is a
spot where peer metadata outlived the peer:
peer_topicsonPeerDisconnected(proto/state.rs) — a network-level disconnect neverremoved the peer from this top-level index. Safe to remove here: the entry is rebuilt on the next
message from the peer if it reconnects.
peer_dataon active-view discard (proto/hyparview.rs) — when a peer is removed from theactive view and not retained as passive, its
peer_dataentry was left behind. Removed only onthe not-kept branch, so peers that are kept as passive are untouched.
peer_data+alive_disconnect_peerson passive-view eviction (proto/hyparview.rs) — whenthe passive view is full and a peer is evicted at random to make room, both its
peer_dataandalive_disconnect_peersentries were left behind.peer_data+alive_disconnect_peerson pending-neighbor timeout (proto/hyparview.rs) —same lingering state when a pending neighbor request times out.
lazy_push_queueon neighbor down (proto/plumtree.rs) — when a neighbor goes down the peeris removed from the eager/lazy push peer sets, but its entry in the Plumtree
lazy_push_queuewasleft behind.
Breaking Changes
None. All changes are internal to the gossip actor and protocol state; no public API changes.
Notes & open questions
tasks under churn showed them climbing unbounded (e.g. spawned 174 / finished 52) while the
HyParview active view stayed bounded (~15) — the smoking gun for the
SendLoophang. With thefix, finished tracks spawned and live connection tasks stay bounded at the active-peer count; a
heap profile's QUIC
Connectionretention dropped accordingly. Numbers are from our soak, so I'dtreat them as supporting context — a before/after on your CI churn setup is the authoritative check.
the
SendLoophang needs a connected-endpoint harness (drop the peer'ssend_tx, assertrun()returns rather than hanging) — happy to add one in the style you prefer if useful.
SendLoop/select!fix?Related
This was one of three churn-related memory-leak fixes found together while running the iroh stack
under sustained churn; the connection-leak picture spans the gossip send loop here and a noq
Connectingdrop fix:mapped_addrseviction on actor shutdownConnectingis dropped pre-handshakeChecklist
None => breakcomment)Resolves #145