hostagent: close stale GuestAgentClient on reconnect to stop ClientConn leak#4889
Merged
unsuman merged 1 commit intolima-vm:masterfrom Apr 27, 2026
Merged
Conversation
e6ae279 to
3b4f6ce
Compare
Member
|
I did re-run the failed CI test, it was just flaky. The PR looks good, but has a potential race condition; see AI review at https://jandubois.github.io/lima/20260426-165750-pr-4889.html |
mn-ram
added a commit
to mn-ram/lima
that referenced
this pull request
Apr 27, 2026
watchGuestAgentEvents reads a.client directly (twice on the same line) without holding clientMu, while getOrCreateClient now nils a.client mid-transition before reassigning it. A concurrent reader could observe the transient nil and pass it to isGuestAgentSocketAccessible, panicking on a nil-pointer dereference. Introduce a getClient() helper that snapshots a.client under the lock, and use it from both the inotify-startup goroutine and the main watch loop. Addresses review feedback on lima-vm#4889. Signed-off-by: mn-ram <235066282+mn-ram@users.noreply.github.com>
Contributor
Author
|
Thanks @jandubois for the review! Good catch — setting Fixed in Build, vet, and tests pass locally. Happy to squash commits if preferred. |
Member
Yes please! |
…nn leak GuestAgentClient now retains its *grpc.ClientConn and exposes Close(). HostAgent.getOrCreateClient closes the previous client before replacing it, and a cleanUp is registered so the live client is closed on shutdown. Without this, each guest-agent restart or VM reboot leaks the gRPC ClientConn (resolver, balancer, transport goroutines) and the underlying net.Conn to the forwarded ga.sock, accumulating goroutines and file descriptors for the lifetime of the hostagent. Reads of a.client in watchGuestAgentEvents (the inotify-startup goroutine and the main watch loop) are now serialized through a getClient() helper under clientMu so callers cannot observe the brief nil window introduced by getOrCreateClient when it closes the stale client before reassigning. Signed-off-by: mn-ram <235066282+mn-ram@users.noreply.github.com>
fae74fb to
714becb
Compare
Contributor
Author
Done |
Member
This PR seems safe to merge in v2.1.2? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
GuestAgentClientnow retains*grpc.ClientConnand exposesClose().HostAgent.getOrCreateClientcloses the previousa.clientbefore overwriting it.cleanUpis registered inwatchGuestAgentEventsso the live client is closed on hostagent shutdown.Why
Fixes #4888.
Every time the guest agent restarts or the VM reboots,
getOrCreateClientoverwritesa.clientwith a freshGuestAgentClientwhose*grpc.ClientConnwas never reachable to any caller. The previousClientConnand its goroutines (resolver, balancer, HTTP/2 transport) plus the dialednet.Connto the forwardedga.sockleak permanently. On long-running Lima instances (Colima / Rancher Desktop / Finch) this grows unbounded with reconnect count and eventually exhausts the FD limit.Test plan
go build ./...go vet ./pkg/guestagent/... ./pkg/hostagent/...go test ./pkg/guestagent/... ./pkg/hostagent/...limactl shell default sudo systemctl restart lima-guestagent50× and watch/proc/<hostagent-pid>/statusThreads count andls /proc/<hostagent-pid>/fd | wc -l— both flat after this PR, both monotonically increasing onmaster.