fix(iroh-gossip): reap connection tasks and prune peer state on churn by drlukeangel · Pull Request #146 · n0-computer/iroh-gossip

drlukeangel · 2026-05-29T17:24:52Z

Description

While running iroh-gossip as the mesh layer of a downstream project, we hit steadily growing
memory under sustained peer churn (peers continuously joining and leaving, each with a fresh
node id). Investigation found one dominant connection-task leak plus five smaller peer-state
eviction gaps. This PR fixes all of them; they're independent, so happy to split if you'd prefer.

First — thank you for iroh-gossip. Being able to lean on HyParview + Plumtree over iroh instead of
hand-rolling gossip and membership let us delete a large amount of networking code, and the actor
design made this leak tractable to trace.

The dominant fix — `SendLoop` doesn't exit when its send channel closes (`src/net/util.rs`)

SendLoop::run's select used:

Some(msg) = self.send_rx.recv() => self.write_message(&msg).await?,

When a peer is removed from the gossip peer map, its send_tx is dropped and send_rx.recv()
returns None. With the Some(msg) = … form, a None silently disables that select branch
(rather than yielding the None). The only other persistent arm, _ = &mut closed, then stays
pending forever because the connection is still open — so SendLoop::run never returns, send_fut
never resolves, and connection_loop is stuck. The Connection and its background
ConnectionDriver are never dropped: one stranded connection per removed/rotated peer.

Fix: match the full Option and break on None:

msg = self.send_rx.recv() => match msg {
    Some(msg) => self.write_message(&msg).await?,
    None => break, // all senders dropped: peer removed, tear the send side down
},

This pairs with changing connection_loop from tokio::join!(send_fut, recv_fut) to a
tokio::select! (a necessary prerequisite — with join!, even a returning send_fut would wait
forever on recv_fut's accept_uni() because the peer is still alive), and removing the peer from
the actor's peers map on dial failure and on send-connection close so the descriptor is reclaimed.

Additional eviction gaps noticed while investigating

These are smaller (byte-scale) and reasoned rather than independently soak-validated; each is a
spot where peer metadata outlived the peer:

peer_topics on PeerDisconnected (proto/state.rs) — a network-level disconnect never
removed the peer from this top-level index. Safe to remove here: the entry is rebuilt on the next
message from the peer if it reconnects.
peer_data on active-view discard (proto/hyparview.rs) — when a peer is removed from the
active view and not retained as passive, its peer_data entry was left behind. Removed only on
the not-kept branch, so peers that are kept as passive are untouched.
peer_data + alive_disconnect_peers on passive-view eviction (proto/hyparview.rs) — when
the passive view is full and a peer is evicted at random to make room, both its peer_data and
alive_disconnect_peers entries were left behind.
peer_data + alive_disconnect_peers on pending-neighbor timeout (proto/hyparview.rs) —
same lingering state when a pending neighbor request times out.
lazy_push_queue on neighbor down (proto/plumtree.rs) — when a neighbor goes down the peer
is removed from the eager/lazy push peer sets, but its entry in the Plumtree lazy_push_queue was
left behind.

Breaking Changes

None. All changes are internal to the gossip actor and protocol state; no public API changes.

Notes & open questions

Evidence (downstream, not reproducible here): instrumenting spawned-vs-finished connection
tasks under churn showed them climbing unbounded (e.g. spawned 174 / finished 52) while the
HyParview active view stayed bounded (~15) — the smoking gun for the SendLoop hang. With the
fix, finished tracks spawned and live connection tasks stay bounded at the active-peer count; a
heap profile's QUIC Connection retention dropped accordingly. Numbers are from our soak, so I'd
treat them as supporting context — a before/after on your CI churn setup is the authoritative check.
Tests: the five eviction fixes are straightforward to assert. A focused regression test for
the SendLoop hang needs a connected-endpoint harness (drop the peer's send_tx, assert run()
returns rather than hanging) — happy to add one in the style you prefer if useful.
Would you like the five eviction fixes split into a separate PR from the SendLoop/select! fix?

Checklist

Self-review
Documented the reasoning inline where the fix is non-obvious (the None => break comment)
Tests — see note above; glad to add a harness-based regression test on request
No breaking changes

Resolves #145

Two connection-lifecycle fixes. (1) connection_loop ran the send and receive loops under tokio::join!, which returns only when BOTH finish; when a peer is removed locally send_fut ends but recv_fut blocks forever on accept_uni() — switched to tokio::select! so the loop returns as soon as either half finishes. (2) handle_connection_task_finished now removes the peer from self.peers on the active-connection-closed path (it previously sent PeerDisconnected but left the PeerState when the peer was not in the active_view). NOTE: select! alone does not stop the leak — see the following SendLoop fix, which is what actually lets send_fut resolve.

…eak) SendLoop::run's select! used `Some(msg) = self.send_rx.recv() => ...`. When a peer is removed from the gossip peer map its send_tx drops and recv() returns None, but the `Some(msg) =` pattern then SILENTLY DISABLES that branch instead of breaking. The only other live arm (`_ = &mut closed`) stays pending forever because the connection is still open, so SendLoop::run hangs, send_fut never resolves, connection_loop's select! never fires, and the Connection plus its background ConnectionDriver are stranded — one leaked connection per removed/rotated peer (worst on observers and multi-topic routers, which rotate peers constantly). Match the full Option and break on None. Evidence (rafka v2, 18-node 2-mesh chaos soak, kill+respawn churn): instrumented connection-task counters went from spawned=174 / finished=52 (122 stuck and climbing) to spawned approximately equals finished with live tasks bounded around 15. DHAT noq::connection::Connection 18.7 MB -> 3.85 MB; total retained heap ~50 MB -> ~25 MB. 35-min soak, observer RSS at 140 chaos events: 0.228 GB -> 0.115 GB (~66% lower).

The top-level peer_topics index (PeerId -> set of TopicId) was never pruned when a peer disconnected at the network level, so it grew with every peer ever seen under churn. Remove the peer entry on PeerDisconnected.

peer_data leaked when a peer was fully discarded from the active view (remove_active_by_index with keep_as_passive=false) and on passive-view eviction (add_passive); alive_disconnect_peers leaked on passive eviction and on pending-neighbor-request timeout. Added the matching removals so per-peer protocol metadata is reclaimed under churn.

lazy_push_queue retained entries for neighbors that went down, growing under churn. Prune the peer on on_neighbor_down.

caiogondim · 2026-05-30T22:45:14Z

related #147

Adopts the seven peer-state eviction prunes from n0-computer#146 (drlukeangel): two `peers`-map reclaims in net.rs (on dial failure and active send-connection close) and five proto-layer prunes (hyparview passive-view eviction, pending-neighbor timeout, and active-view discard; plumtree's lazy_push_queue on neighbor-down; state's peer_topics on PeerDisconnected). Each left one per-peer entry behind under churn with rotating node ids — byte-scale next to the connection leak fixed in the previous commit, but unbounded. Adds the regression tests n0-computer#146 lacks: five in-module tests that seed the relevant map, trigger the eviction, and assert the entry is pruned; each was verified to fail without its prune line.

caiogondim · 2026-05-31T00:54:37Z

Hi @drlukeangel 👋

Looks like we independently landed on the same fix for the same problem.

I'd been chasing a memory leak I could reliably reproduce on both a local and a distributed mesh.
It led me to the same two core bugs you fixed here, which I wrote up in #147.

You'd also caught several peer-state eviction gaps I'd missed.
I've pulled those into #147 (credited to you), since I carry iroh-gossip as a dependency in one of my own projects and wanted the complete fix downstream.
The one thing I added on top is a regression test for each.

Please feel free to take anything useful from #147 and consolidate everything onto yours. Happy to close mine in favour of it.

One heads-up from the same investigation: this connection leak is the dominant driver of the growth, but there's a smaller, separate one in noq too
n0-computer/noq#683 (an unbounded abandoned_paths set) that you might also want.

Thanks for the great work here! 🙇‍♂

drlukeangel · 2026-06-01T15:21:15Z

Hi @caiogondim! 👋 Thank you so much for the incredibly detailed investigation and for those regression tests in #147! 🙇‍♂️ It's awesome that we independently landed on the exact same fixes. Your regression tests are fantastic and add exactly the verification this PR needed. I've gone ahead and folded all 6 of your regression tests into this PR so that we have the complete fix and test coverage in one place. I've also pushed the final updates just now. Since everything is consolidated here now, please feel free to close #147 whenever you're ready. Thanks again for your awesome work and collaboration on tracking this down! 🚀

drlukeangel added 5 commits May 29, 2026 08:10

fix: prune peer_topics on PeerDisconnected

3ee4d7b

The top-level peer_topics index (PeerId -> set of TopicId) was never pruned when a peer disconnected at the network level, so it grew with every peer ever seen under churn. Remove the peer entry on PeerDisconnected.

fix: prune lazy_push_queue on neighbor_down

6fc9a33

lazy_push_queue retained entries for neighbors that went down, growing under churn. Prune the peer on on_neighbor_down.

This was referenced May 29, 2026

fix(iroh): evict cached mapped addrs when a RemoteStateActor shuts down n0-computer/iroh#4294

Open

fix(noq): close the connection when a Connecting is dropped before handshake n0-computer/noq#682

Closed

drlukeangel marked this pull request as ready for review May 29, 2026 17:53

n0bot Bot added this to iroh May 29, 2026

github-project-automation Bot moved this to 🚑 Needs Triage in iroh May 29, 2026

caiogondim mentioned this pull request May 31, 2026

fix: close superseded connections and prune peer-state on churn #147

Open

test: add regression tests for connection leak and state eviction

790db7f

drlukeangel force-pushed the fix/churn-memory-leak branch from 4f417c3 to 790db7f Compare June 1, 2026 15:21

dignifiedquire moved this from 🚑 Needs Triage to 👀 In review in iroh Jun 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(iroh-gossip): reap connection tasks and prune peer state on churn#146

fix(iroh-gossip): reap connection tasks and prune peer state on churn#146
drlukeangel wants to merge 6 commits into
n0-computer:mainfrom
drlukeangel:fix/churn-memory-leak

drlukeangel commented May 29, 2026 •

edited

Loading

Uh oh!

caiogondim commented May 30, 2026

Uh oh!

caiogondim commented May 31, 2026

Uh oh!

drlukeangel commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

drlukeangel commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

The dominant fix — SendLoop doesn't exit when its send channel closes (src/net/util.rs)

Additional eviction gaps noticed while investigating

Breaking Changes

Notes & open questions

Related

Checklist

Uh oh!

caiogondim commented May 30, 2026

Uh oh!

caiogondim commented May 31, 2026

Uh oh!

drlukeangel commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

drlukeangel commented May 29, 2026 •

edited

Loading

The dominant fix — `SendLoop` doesn't exit when its send channel closes (`src/net/util.rs`)