Skip to content

hostagent: stop destroying ga.sock on guest-agent reconnect#4911

Open
mn-ram wants to merge 1 commit intolima-vm:masterfrom
mn-ram:fix/ga-sock-reconnect
Open

hostagent: stop destroying ga.sock on guest-agent reconnect#4911
mn-ram wants to merge 1 commit intolima-vm:masterfrom
mn-ram:fix/ga-sock-reconnect

Conversation

@mn-ram
Copy link
Copy Markdown
Contributor

@mn-ram mn-ram commented Apr 30, 2026

Summary

Fixes a long-standing bug where Lima's host↔guest gRPC connection silently and permanently breaks after any guest-agent restart or in-VM reboot, requiring limactl stop && limactl start to recover.

watchGuestAgentEvents and the inotify-startup goroutine in pkg/hostagent/hostagent.go both call forwardSSH(verbForward, …) for the guest-agent unix socket on every reconnect tick, with no synchronization between them and no -O cancel of the prior forward. forwardSSH os.RemoveAlls the local socket file as its first step and only then asks the SSH ControlMaster to bind a new listener, so two things go wrong:

  1. The two goroutines race on os.RemoveAll/bind of the same path; any consumer dialing during the window observes ENOENT.
  2. The ControlMaster still has the previous forward registered, so ssh -O forward -L ga.sock:/run/lima-guestagent.sock exits non-zero with "forwarding for listen path X already exists". forwardSSH's failure branch then unlinks the socket a second time, leaving ga.sock permanently missing on disk while the mux still believes a forward is alive. getOrCreateClient cannot reconnect, dynamic port-forward announcement stops, and inotify mount invalidation goes silent.

The fix introduces a tiny helper reForwardGuestAgentSock that:

  • Serializes via a new gaSockForwardMu so the reconnect loop and the inotify goroutine cannot race.
  • Issues a best-effort verbCancel before verbForward, so the ControlMaster releases the prior registration and the new bind succeeds cleanly.

Both existing call sites are switched to the helper. No API changes, no architecture changes, ~40 lines.

This sits naturally next to the recent fixes in this area (#4889, #4895) and closes the oldest open user report of the symptom.

Closes #2227

Test plan

  • go vet ./pkg/hostagent/... ./cmd/... — clean
  • go test ./pkg/hostagent/... — pass
  • Manual repro on macOS host + Linux guest:
    limactl start default
    ls -l ~/.lima/default/ga.sock                              # exists
    limactl shell default -- sudo systemctl restart lima-guestagent.service
    sleep 12
    ls -l ~/.lima/default/ga.sock                              # still exists (was ENOENT before)
    nc -U ~/.lima/default/ga.sock                              # dial-able
  • Re-run the reproducer 5× in a loop and confirm ga.sock survives every iteration.
  • With mountInotify: true: edit a file on the host after the guest-agent restart and confirm the guest sees the change.

watchGuestAgentEvents and the inotify-startup goroutine both call
forwardSSH(verbForward) for the guest-agent unix socket on every
reconnect tick, with no synchronization between them and no -O cancel
of the prior forward. forwardSSH unlinks the local socket file before
asking the SSH ControlMaster to bind a new listener, so:

  * The two goroutines race on os.RemoveAll/bind of the same path,
    and consumers dialing during the window observe ENOENT.
  * The ControlMaster still has the previous forward registered, so
    `ssh -O forward -L localUnix:remoteUnix` exits non-zero with
    "forwarding for listen path X already exists". forwardSSH's
    failure branch then unlinks the socket a second time, leaving
    ga.sock permanently missing on disk while the mux still believes
    a forward is alive. getOrCreateClient cannot reconnect, dynamic
    port forwarding stops being announced, and inotify mount
    invalidation goes silent until the user runs `limactl stop`
    followed by `limactl start`.

Fix: introduce reForwardGuestAgentSock, which serializes via a new
gaSockForwardMu and issues a best-effort -O cancel before -O forward.
Both call sites in watchGuestAgentEvents now go through the helper.

Reproduces deterministically on master with:

    limactl start default
    limactl shell default -- sudo systemctl restart lima-guestagent.service
    sleep 12
    ls ~/.lima/default/ga.sock   # ENOENT before the fix; present after.

Closes: lima-vm#2227
Signed-off-by: mn-ram <235066282+mn-ram@users.noreply.github.com>
@mn-ram mn-ram force-pushed the fix/ga-sock-reconnect branch from 414fab2 to 679d5e1 Compare April 30, 2026 03:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Guest agent sometimes becomes unavailable from the host

1 participant