Skip to content

hostagent: close stale GuestAgentClient on reconnect to stop ClientConn leak#4889

Merged
unsuman merged 1 commit intolima-vm:masterfrom
mn-ram:fix/guestagent-client-leak
Apr 27, 2026
Merged

hostagent: close stale GuestAgentClient on reconnect to stop ClientConn leak#4889
unsuman merged 1 commit intolima-vm:masterfrom
mn-ram:fix/guestagent-client-leak

Conversation

@mn-ram
Copy link
Copy Markdown
Contributor

@mn-ram mn-ram commented Apr 26, 2026

What

  • GuestAgentClient now retains *grpc.ClientConn and exposes Close().
  • HostAgent.getOrCreateClient closes the previous a.client before overwriting it.
  • A cleanUp is registered in watchGuestAgentEvents so the live client is closed on hostagent shutdown.

Why

Fixes #4888.

Every time the guest agent restarts or the VM reboots, getOrCreateClient overwrites a.client with a fresh GuestAgentClient whose *grpc.ClientConn was never reachable to any caller. The previous ClientConn and its goroutines (resolver, balancer, HTTP/2 transport) plus the dialed net.Conn to the forwarded ga.sock leak permanently. On long-running Lima instances (Colima / Rancher Desktop / Finch) this grows unbounded with reconnect count and eventually exhausts the FD limit.

Test plan

  • go build ./...
  • go vet ./pkg/guestagent/... ./pkg/hostagent/...
  • go test ./pkg/guestagent/... ./pkg/hostagent/...
  • Manual: loop limactl shell default sudo systemctl restart lima-guestagent 50× and watch /proc/<hostagent-pid>/status Threads count and ls /proc/<hostagent-pid>/fd | wc -l — both flat after this PR, both monotonically increasing on master.

@mn-ram mn-ram force-pushed the fix/guestagent-client-leak branch from e6ae279 to 3b4f6ce Compare April 26, 2026 17:18
@jandubois
Copy link
Copy Markdown
Member

jandubois commented Apr 27, 2026

I did re-run the failed CI test, it was just flaky. The PR looks good, but has a potential race condition; see AI review at https://jandubois.github.io/lima/20260426-165750-pr-4889.html

mn-ram added a commit to mn-ram/lima that referenced this pull request Apr 27, 2026
watchGuestAgentEvents reads a.client directly (twice on the same line)
without holding clientMu, while getOrCreateClient now nils a.client
mid-transition before reassigning it. A concurrent reader could observe
the transient nil and pass it to isGuestAgentSocketAccessible, panicking
on a nil-pointer dereference.

Introduce a getClient() helper that snapshots a.client under the lock,
and use it from both the inotify-startup goroutine and the main watch
loop.

Addresses review feedback on lima-vm#4889.

Signed-off-by: mn-ram <235066282+mn-ram@users.noreply.github.com>
@mn-ram
Copy link
Copy Markdown
Contributor Author

mn-ram commented Apr 27, 2026

Thanks @jandubois for the review!

Good catch — setting a.client = nil introduced a race with unsynchronized reads.

Fixed in fae74fb by adding getClient() (locked via clientMu) and updating call sites.

Build, vet, and tests pass locally.

Happy to squash commits if preferred.

@unsuman
Copy link
Copy Markdown
Member

unsuman commented Apr 27, 2026

Happy to squash commits if preferred.

Yes please!

@unsuman unsuman added this to the v2.2.0 milestone Apr 27, 2026
…nn leak

GuestAgentClient now retains its *grpc.ClientConn and exposes Close().
HostAgent.getOrCreateClient closes the previous client before replacing it,
and a cleanUp is registered so the live client is closed on shutdown.

Without this, each guest-agent restart or VM reboot leaks the gRPC
ClientConn (resolver, balancer, transport goroutines) and the underlying
net.Conn to the forwarded ga.sock, accumulating goroutines and file
descriptors for the lifetime of the hostagent.

Reads of a.client in watchGuestAgentEvents (the inotify-startup goroutine
and the main watch loop) are now serialized through a getClient() helper
under clientMu so callers cannot observe the brief nil window introduced
by getOrCreateClient when it closes the stale client before reassigning.

Signed-off-by: mn-ram <235066282+mn-ram@users.noreply.github.com>
@mn-ram mn-ram force-pushed the fix/guestagent-client-leak branch from fae74fb to 714becb Compare April 27, 2026 13:00
@mn-ram
Copy link
Copy Markdown
Contributor Author

mn-ram commented Apr 27, 2026

Happy to squash commits if preferred.

Yes please!

Done

@AkihiroSuda
Copy link
Copy Markdown
Member

the v2.2.0 milestone

This PR seems safe to merge in v2.1.2?

Copy link
Copy Markdown
Member

@unsuman unsuman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@unsuman unsuman modified the milestones: v2.2.0, v2.1.2 Apr 27, 2026
@unsuman unsuman merged commit 917327d into lima-vm:master Apr 27, 2026
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

hostagent: GuestAgentClient leaks gRPC ClientConn (goroutines + FDs) on every guest-agent reconnect

4 participants