Skip to content

Add rdd-guest socket bridge for Windows Docker socket forwarding#37

Merged
jandubois merged 1 commit into
mainfrom
add-rdd-guest
May 1, 2026
Merged

Add rdd-guest socket bridge for Windows Docker socket forwarding#37
jandubois merged 1 commit into
mainfrom
add-rdd-guest

Conversation

@Nino-K
Copy link
Copy Markdown
Member

@Nino-K Nino-K commented Apr 23, 2026

Build rdd-guest from rancher-desktop-daemon's cmd/rdd-guest and include it in the VM image. Add a systemd unit that starts it on WSL2 instances as part of rancher-desktop.target.

Related to: rancher-sandbox/rancher-desktop-daemon#341
and the following issue: rancher-sandbox/rancher-desktop-daemon#157

@Nino-K Nino-K marked this pull request as draft April 24, 2026 17:02
@Nino-K Nino-K marked this pull request as ready for review April 24, 2026 17:51
Copy link
Copy Markdown
Member

@jandubois jandubois left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI review: https://jandubois.github.io/rancher-desktop-opensuse/20260427-144006-pr-37.html

I think only (I1) about the src/rdd-guest/rdd-guest binary is a real concern. That file should probably be added to .gitignore.

Comment thread root/usr/local/lib/systemd/system/rdd-guest.service Outdated
Comment thread root/usr/local/lib/systemd/system/rdd-guest.service
Comment thread src/rdd-guest/main.go
Comment thread src/rdd-guest/main.go
Comment thread src/rdd-guest/main.go
Comment thread src/rdd-guest/main.go Outdated
Comment thread src/rdd-guest/main.go Outdated
Comment thread src/rdd-guest/main.go Outdated
Comment thread src/rdd-guest/main.go Outdated
@Nino-K Nino-K requested a review from mook-as April 29, 2026 19:24
mook-as
mook-as previously approved these changes Apr 29, 2026
Comment thread src/rdd-guest/main.go Outdated
// Package main is the rdd-guest agent that runs inside the Lima/WSL2 VM.
// It listens on a vsock port and forwards connections to the Docker socket,
// enabling the Windows host to reach /var/run/docker.sock via Hyper-V vsock.
//
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be Command rdd-guest is…

jandubois
jandubois previously approved these changes Apr 30, 2026
Copy link
Copy Markdown
Member

@jandubois jandubois left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM

There is (I1) in https://jandubois.github.io/rancher-desktop-opensuse/20260430-134246-pr-37.html that may be worthwhile addressing if there are further changes in this area.

@Nino-K
Copy link
Copy Markdown
Member Author

Nino-K commented Apr 30, 2026

Thanks, LGTM

There is (I1) in https://jandubois.github.io/rancher-desktop-opensuse/20260430-134246-pr-37.html that may be worthwhile addressing if there are further changes in this area.

The suggestion requires sleep, wouldn't it make sense for the back off to go in the systemd?

@jandubois
Copy link
Copy Markdown
Member

jandubois commented Apr 30, 2026

The suggestion requires sleep, wouldn't it make sense for the back off to go in the systemd?

I think so, because I think even a 100ms sleep is not enough if it keeps failing. But check with @mook-as first.

@mook-as
Copy link
Copy Markdown
Contributor

mook-as commented May 1, 2026

The relevant code:

for {
conn, err := l.Accept()
if err != nil {
if ctx.Err() != nil {
return
}
log.Printf("rdd-guest: accept: %v", err)
continue
}
go handleConn(ctx, conn)
}

So the assumption is that Accept returns an error, but the context is not closed (e.g. there's something transiently wrong with the network). In that case, we log.Printf() and continue, so the process stays alive and therefore systemd isn't involved.

If you question is if it makes sense to just abort, and let systemd manage restarting the process:

  • This would mean breaking any existing (and working) connections.
  • We would probably want these kinds of errors to restart our service indefinitely; systemd restarting us would get us into backoffs and eventually into a failed state (i.e. we no longer get restarted).

Ultimately, it probably depends on why we're getting an error from Accept; without that information, just some sort of delay is probably fine. Honestly, though, I don't even know what kinds of errors here might be transient…

@jandubois
Copy link
Copy Markdown
Member

Honestly, though, I don't even know what kinds of errors here might be transient…

From Claude Code:

What (*vsock.Listener).Accept() can return

The library wraps a Linux AF_VSOCK socket; Accept() is accept4(2). Go's runtime swallows EINTR and EAGAIN, so what you actually see falls into four buckets:

1. "Permanent" — programming or setup errors

EBADF, EINVAL, ENOTSOCK, EOPNOTSUPP, EFAULT. These mean the listener is structurally broken (closed fd, not listening, not a socket). If Listen() succeeded, you essentially never see these at runtime — and if you do, retrying does nothing. The current code would hot-loop forever logging the same message.

2. "Genuinely transient" — succeeds on the very next call

ECONNABORTED — peer sent SYN but disconnected before we accepted. The 4.4BSD/Linux convention is "just call accept again." A continue-without-sleep is correct here.

3. "Recoverable with delay" — needs the kernel to free a resource

EMFILE (per-process fd cap), ENFILE (system-wide fd cap), ENOBUFS, ENOMEM. The kernel literally cannot allocate the new fd or socket buffer. Retrying microseconds later returns the same error. This is the case that produces the CPU peg + journal flood — and it's the only realistic non-shutdown scenario worth defending against on a small WSL2 VM with default LimitNOFILE.

4. Listener closed during shutdown

net.ErrClosed ("use of closed network connection") from the <-ctx.Done(); l.Close() goroutine. The if ctx.Err() != nil { return } branch handles this correctly.

So: in-process backoff or let systemd handle it?

Two reasons not to defer this to systemd, both already in @mook-as's reply:

  1. Connections in flight die. Aborting kills every active proxy.
  2. StartLimitBurst is a cliff. With the default RestartSec=5s + StartLimitIntervalSec=10s + StartLimitBurst=5, an EMFILE that persists for 30 seconds parks the unit in failed state. A user-driven container build that briefly exhausts fds takes the bridge down until manual intervention. Backoff inside the process keeps it alive through the squeeze.

And one reason for it:

  1. A fixed 100ms is wrong for EMFILE. @jandubois is right that 100ms doesn't really help — fds may take seconds to free. The right shape is exponential backoff with a cap (e.g., 100ms → 5s, reset on success), or even just a flat 1s.

The cleanest fix is in-process backoff that's error-class-aware:

conn, err := l.Accept()
if err != nil {
    if ctx.Err() != nil {
        return
    }
    log.Printf("rdd-guest: accept: %v", err)
    if errors.Is(err, syscall.ECONNABORTED) {
        continue // truly transient, retry immediately
    }
    select {
    case <-time.After(backoff): // 1s flat or capped exponential
    case <-ctx.Done():
        return
    }
    continue
}

The select on ctx.Done() is the part to not skip — without it, a SIGTERM during the sleep delays shutdown by the full backoff.

If that's more code than the team wants for a soon-to-be-replaced (pkg/socketbridge) inlined helper, a flat 1s sleep covers the EMFILE/ENFILE case without help-the-rare-ECONNABORTED-fast handling, and it's two lines instead of ten.

Build rdd-guest from rancher-desktop-daemon's cmd/rdd-guest and include
it in the VM image. Add a systemd unit that starts it on WSL2 instances
as part of rancher-desktop.target.

Signed-off-by: Nino Kodabande <nkodabande@suse.com>
@Nino-K Nino-K dismissed stale reviews from jandubois and mook-as via c57f71b May 1, 2026 17:17
@Nino-K Nino-K requested review from jandubois and mook-as May 1, 2026 17:18
@jandubois jandubois added this to the Tech Preview milestone May 1, 2026
Copy link
Copy Markdown
Member

@jandubois jandubois left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM

There are still some minor issues, but let's move on.

@jandubois jandubois merged commit a6608bd into main May 1, 2026
13 checks passed
@jandubois jandubois deleted the add-rdd-guest branch May 1, 2026 21:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants