Skip to content

fix(iroh-gossip): reap connection tasks and prune peer state on churn#146

Open
drlukeangel wants to merge 6 commits into
n0-computer:mainfrom
drlukeangel:fix/churn-memory-leak
Open

fix(iroh-gossip): reap connection tasks and prune peer state on churn#146
drlukeangel wants to merge 6 commits into
n0-computer:mainfrom
drlukeangel:fix/churn-memory-leak

Conversation

@drlukeangel

@drlukeangel drlukeangel commented May 29, 2026

Copy link
Copy Markdown

Description

While running iroh-gossip as the mesh layer of a downstream project, we hit steadily growing
memory under sustained peer churn (peers continuously joining and leaving, each with a fresh
node id). Investigation found one dominant connection-task leak plus five smaller peer-state
eviction gaps. This PR fixes all of them; they're independent, so happy to split if you'd prefer.

First — thank you for iroh-gossip. Being able to lean on HyParview + Plumtree over iroh instead of
hand-rolling gossip and membership let us delete a large amount of networking code, and the actor
design made this leak tractable to trace.

The dominant fix — SendLoop doesn't exit when its send channel closes (src/net/util.rs)

SendLoop::run's select used:

Some(msg) = self.send_rx.recv() => self.write_message(&msg).await?,

When a peer is removed from the gossip peer map, its send_tx is dropped and send_rx.recv()
returns None. With the Some(msg) = … form, a None silently disables that select branch
(rather than yielding the None). The only other persistent arm, _ = &mut closed, then stays
pending forever because the connection is still open — so SendLoop::run never returns, send_fut
never resolves, and connection_loop is stuck. The Connection and its background
ConnectionDriver are never dropped: one stranded connection per removed/rotated peer.

Fix: match the full Option and break on None:

msg = self.send_rx.recv() => match msg {
    Some(msg) => self.write_message(&msg).await?,
    None => break, // all senders dropped: peer removed, tear the send side down
},

This pairs with changing connection_loop from tokio::join!(send_fut, recv_fut) to a
tokio::select! (a necessary prerequisite — with join!, even a returning send_fut would wait
forever on recv_fut's accept_uni() because the peer is still alive), and removing the peer from
the actor's peers map on dial failure and on send-connection close so the descriptor is reclaimed.

Additional eviction gaps noticed while investigating

These are smaller (byte-scale) and reasoned rather than independently soak-validated; each is a
spot where peer metadata outlived the peer:

  • peer_topics on PeerDisconnected (proto/state.rs) — a network-level disconnect never
    removed the peer from this top-level index. Safe to remove here: the entry is rebuilt on the next
    message from the peer if it reconnects.
  • peer_data on active-view discard (proto/hyparview.rs) — when a peer is removed from the
    active view and not retained as passive, its peer_data entry was left behind. Removed only on
    the not-kept branch, so peers that are kept as passive are untouched.
  • peer_data + alive_disconnect_peers on passive-view eviction (proto/hyparview.rs) — when
    the passive view is full and a peer is evicted at random to make room, both its peer_data and
    alive_disconnect_peers entries were left behind.
  • peer_data + alive_disconnect_peers on pending-neighbor timeout (proto/hyparview.rs)
    same lingering state when a pending neighbor request times out.
  • lazy_push_queue on neighbor down (proto/plumtree.rs) — when a neighbor goes down the peer
    is removed from the eager/lazy push peer sets, but its entry in the Plumtree lazy_push_queue was
    left behind.

Breaking Changes

None. All changes are internal to the gossip actor and protocol state; no public API changes.

Notes & open questions

  • Evidence (downstream, not reproducible here): instrumenting spawned-vs-finished connection
    tasks under churn showed them climbing unbounded (e.g. spawned 174 / finished 52) while the
    HyParview active view stayed bounded (~15) — the smoking gun for the SendLoop hang. With the
    fix, finished tracks spawned and live connection tasks stay bounded at the active-peer count; a
    heap profile's QUIC Connection retention dropped accordingly. Numbers are from our soak, so I'd
    treat them as supporting context — a before/after on your CI churn setup is the authoritative check.
  • Tests: the five eviction fixes are straightforward to assert. A focused regression test for
    the SendLoop hang needs a connected-endpoint harness (drop the peer's send_tx, assert run()
    returns rather than hanging) — happy to add one in the style you prefer if useful.
  • Would you like the five eviction fixes split into a separate PR from the SendLoop/select! fix?

Related

This was one of three churn-related memory-leak fixes found together while running the iroh stack
under sustained churn; the connection-leak picture spans the gossip send loop here and a noq
Connecting drop fix:

Checklist

  • Self-review
  • Documented the reasoning inline where the fix is non-obvious (the None => break comment)
  • Tests — see note above; glad to add a harness-based regression test on request
  • No breaking changes

Resolves #145

Two connection-lifecycle fixes. (1) connection_loop ran the send and receive loops under tokio::join!, which returns only when BOTH finish; when a peer is removed locally send_fut ends but recv_fut blocks forever on accept_uni() — switched to tokio::select! so the loop returns as soon as either half finishes. (2) handle_connection_task_finished now removes the peer from self.peers on the active-connection-closed path (it previously sent PeerDisconnected but left the PeerState when the peer was not in the active_view). NOTE: select! alone does not stop the leak — see the following SendLoop fix, which is what actually lets send_fut resolve.
…eak)

SendLoop::run's select! used `Some(msg) = self.send_rx.recv() => ...`. When a peer is removed from the gossip peer map its send_tx drops and recv() returns None, but the `Some(msg) =` pattern then SILENTLY DISABLES that branch instead of breaking. The only other live arm (`_ = &mut closed`) stays pending forever because the connection is still open, so SendLoop::run hangs, send_fut never resolves, connection_loop's select! never fires, and the Connection plus its background ConnectionDriver are stranded — one leaked connection per removed/rotated peer (worst on observers and multi-topic routers, which rotate peers constantly). Match the full Option and break on None.

Evidence (rafka v2, 18-node 2-mesh chaos soak, kill+respawn churn): instrumented connection-task counters went from spawned=174 / finished=52 (122 stuck and climbing) to spawned approximately equals finished with live tasks bounded around 15. DHAT noq::connection::Connection 18.7 MB -> 3.85 MB; total retained heap ~50 MB -> ~25 MB. 35-min soak, observer RSS at 140 chaos events: 0.228 GB -> 0.115 GB (~66% lower).
The top-level peer_topics index (PeerId -> set of TopicId) was never pruned when a peer disconnected at the network level, so it grew with every peer ever seen under churn. Remove the peer entry on PeerDisconnected.
peer_data leaked when a peer was fully discarded from the active view (remove_active_by_index with keep_as_passive=false) and on passive-view eviction (add_passive); alive_disconnect_peers leaked on passive eviction and on pending-neighbor-request timeout. Added the matching removals so per-peer protocol metadata is reclaimed under churn.
lazy_push_queue retained entries for neighbors that went down, growing under churn. Prune the peer on on_neighbor_down.
@caiogondim

Copy link
Copy Markdown

related #147

caiogondim added a commit to agent-habilis/iroh-gossip that referenced this pull request May 31, 2026
Adopts the seven peer-state eviction prunes from n0-computer#146
(drlukeangel): two `peers`-map reclaims in net.rs (on dial failure and active
send-connection close) and five proto-layer prunes (hyparview passive-view
eviction, pending-neighbor timeout, and active-view discard; plumtree's
lazy_push_queue on neighbor-down; state's peer_topics on PeerDisconnected). Each
left one per-peer entry behind under churn with rotating node ids — byte-scale
next to the connection leak fixed in the previous commit, but unbounded. Adds
the regression tests n0-computer#146 lacks: five in-module tests that seed the relevant
map, trigger the eviction, and assert the entry is pruned; each was verified to
fail without its prune line.
@caiogondim

Copy link
Copy Markdown

Hi @drlukeangel 👋

Looks like we independently landed on the same fix for the same problem.

I'd been chasing a memory leak I could reliably reproduce on both a local and a distributed mesh.
It led me to the same two core bugs you fixed here, which I wrote up in #147.

You'd also caught several peer-state eviction gaps I'd missed.
I've pulled those into #147 (credited to you), since I carry iroh-gossip as a dependency in one of my own projects and wanted the complete fix downstream.
The one thing I added on top is a regression test for each.

Please feel free to take anything useful from #147 and consolidate everything onto yours. Happy to close mine in favour of it.

One heads-up from the same investigation: this connection leak is the dominant driver of the growth, but there's a smaller, separate one in noq too
n0-computer/noq#683 (an unbounded abandoned_paths set) that you might also want.

Thanks for the great work here! 🙇‍♂

@drlukeangel

Copy link
Copy Markdown
Author

Hi @caiogondim! 👋 Thank you so much for the incredibly detailed investigation and for those regression tests in #147! 🙇‍♂️ It's awesome that we independently landed on the exact same fixes. Your regression tests are fantastic and add exactly the verification this PR needed. I've gone ahead and folded all 6 of your regression tests into this PR so that we have the complete fix and test coverage in one place. I've also pushed the final updates just now. Since everything is consolidated here now, please feel free to close #147 whenever you're ready. Thanks again for your awesome work and collaboration on tracking this down! 🚀

@drlukeangel drlukeangel force-pushed the fix/churn-memory-leak branch from 4f417c3 to 790db7f Compare June 1, 2026 15:21
@dignifiedquire dignifiedquire moved this from 🚑 Needs Triage to 👀 In review in iroh Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 👀 In review

Development

Successfully merging this pull request may close these issues.

Memory leak under peer churn: connection tasks and peer-state maps not reaped

3 participants