Skip to content

procnet: replace iptables PREROUTING DNAT with userspace forwarder#10357

Open
jandubois wants to merge 12 commits into
rancher-sandbox:mainfrom
jandubois:procnet-userspace-forwarder
Open

procnet: replace iptables PREROUTING DNAT with userspace forwarder#10357
jandubois wants to merge 12 commits into
rancher-sandbox:mainfrom
jandubois:procnet-userspace-forwarder

Conversation

@jandubois
Copy link
Copy Markdown
Member

@jandubois jandubois commented May 26, 2026

Summary

Replace procnet's PREROUTING DNAT rules with a userspace TCP/UDP forwarder bound on the namespace's tap-interface IP. Eliminates procnet's nat-table writes. Drops the coordination surface #10262 and #10280 worked to manage with CNI portmap and Docker. Removes the route_localnet sysctl write.

Supersedes #10280.

Problem

procnet exposes --network=host 127.0.0.1 listeners to the Windows host by writing a PREROUTING DNAT rule in the nat table. The rule lives in the same chain as CNI portmap (CNI-HOSTPORT-DNAT) and Docker (DOCKER). On iptables-nft kernels (current WSL2 default) DNAT terminates the chain, so the first matching rule wins: procnet shadows
the engine for bridge-network ports.

#10262 fixed the bug with a heuristic stack. #10280 reworked the fix behind a 665-line state machine (portStateTracker) plus 1187 lines of unit tests, probing CNI-HOSTPORT-DNAT and DOCKER before each Append, handling partial failures, append-retry races, and the transient OCI-hook reservation socket.

Approach

Replace the DNAT with a userspace TCP/UDP forwarder bound on the namespace's tap-interface IP. host-switch's gvisor-tap-vsock already routes Windows-side host-port traffic to tapInterfaceIP:HostPort (see src/go/guestagent/pkg/tracker/apitracker.go:106), so a userspace listener there receives Windows-side traffic directly. The forwarder dials 127.0.0.1:` locally.

A two-scan stability gate stays to filter out the transient OCI-hook reservation socket. The gate shrinks because it no longer distinguishes engine-managed from locally-managed ports.

Functional equivalence to the iptables path

Tracing a packet for a --network=host container listening on 127.0.0.1:8009:

  1. host-switch's Expose request always sends Remote: tapInterfaceIP:HostPort. The destination IP inside the namespace is the tap IP, regardless of whether procnet observed 127.0.0.1 or 0.0.0.0 as the listener address.
  2. determineHostIP (apitracker.go:255) only chooses the Windows-side bind IP: admin install binds the observed HostIP; non-admin always binds 127.0.0.1. In-namespace routing does not depend on this.
  3. Existing DNAT path: packet arrives on eth0 with dst=tapInterfaceIP:8009; iptables rewrites to 127.0.0.1:8009; the route_localnet=1 sysctl lets the kernel route the rewritten loopback dst across eth0 → lo; the 127.0.0.1 listener accepts.
  4. This PR's path: packet arrives on eth0 with dst=tapInterfaceIP:8009; the userspace listener bound on tapInterfaceIP:8009 accepts; the proxy dials 127.0.0.1:8009 from a fresh socket; the 127.0.0.1 listener accepts.

Same Windows-side bind IP, same in-namespace destination, same delivery to the 127.0.0.1 listener. The route_localnet sysctl is referenced only in scanner_linux.go on main, so dropping it is safe.

What gets deleted versus #10280

  • port_state_tracker_linux.go (665 lines) — engine-chain probing, partial-failure handling, append-retry.
  • port_state_tracker_linux_test.go (1187 lines).
  • iptables helpers: applyLoopbackIPtablesRule, engineChainManagesPort, dnatChainContainsPort, preroutingHasLoopbackRule, isIptablesRuleAbsent, iptablesCommand, wrapExitError, engineDNATChains, dnatRuleRe.
  • enableLocalnetRouting and the route_localnet sysctl write.

What's added

  • loopback_forwarder_linux.go — TCP/UDP userspace proxy. TCP is a straight accept-and-pipe with a 30-second half-close drain deadline. UDP delegates to gvisor-tap-vsock's forwarder.UDPProxy, which keys flows by source address with a 90-second per-flow idle timeout (UDPConnTrackTimeout).
  • scanner_linux.go is restructured around a Tick(scanned) method with the two-scan gate, making the reconciliation logic unit-testable.
  • test:unit:guestagent yarn target so the procnet tests actually run in CI (matches a CI gap procnet: don't shadow CNI portmap with PREROUTING DNAT #10280 also flagged).

jandubois added 9 commits May 26, 2026 16:51
procnet exposes --network=host 127.0.0.1 listeners to the Windows host
by writing PREROUTING DNAT rules in the nat table. Those rules share a
chain with CNI portmap and Docker, so on iptables-nft kernels (current
WSL2 default) the first matching rule wins -- procnet shadows the
engine for bridge-network ports.

Replace the DNAT with a userspace TCP/UDP forwarder bound on the
namespace's tap-interface IP. host-switch's gvisor-tap-vsock already
routes Windows-side host-port traffic to tapInterfaceIP:HostPort (see
apitracker.go's Expose request), so a userspace listener there
receives the same packets the DNAT rule would have rewritten. The
forwarder dials 127.0.0.1:<port> locally.

A two-scan stability gate stays to filter out nerdctl's OCI
createRuntime hook reservation socket. The gate shrinks because it no
longer distinguishes engine-managed from locally-managed ports.

The route_localnet sysctl write is unnecessary; userspace dial does
not ask the kernel to route packets to a loopback address.

Signed-off-by: Jan Dubois <jan.dubois@suse.com>
The guestagent module had no test:unit target, so its tests never ran
in CI. Add the target and chain it into test:unit alongside
nerdctl-stub, wsl-helper, and rdctl.

Cover the procnet scanner with unit tests for the two-scan stability
gate (transient port suppressed, persistent port promoted on second
sighting, removal effective immediately, wildcard skips forwarder,
mixed loopback/wildcard) and the userspace forwarder with end-to-end
TCP smoke tests (Add/Remove/idempotent Add).

Signed-off-by: Jan Dubois <jan.dubois@suse.com>
relayUDP initially used a single goroutine that serialized all clients
through one outbound socket and blocked the main loop on each upstream
reply. That regressed UDP host-network parity with the iptables
PREROUTING DNAT path, which supports concurrent flows via kernel
conntrack.

Replace it with a per-client outbound socket keyed by source address.
Each flow runs its own reply-pump goroutine. A 30-second
SetReadDeadline closes idle flows; the per-flow map cleans up on flow
close and on relayUDP return.

Add UDP smoke tests covering single-flow round-trip and four-client
concurrent send/receive.

Signed-off-by: Jan Dubois <jan.dubois@suse.com>
The userspace forwarder branch deletes the routeLocalnet sysctl write
and the execLoopbackIPtablesRule log message; drop the now-orphaned
`localnet` and `portbinding` entries from expect.txt. Add `portfwd`
for the Lima package name referenced in a comment.

Rename the ad-hoc short identifiers `derr`, `rbuf`, and `rerr` to
spelled-out names (`dialErr`, `replyBuf`, plain `err` in the test) so
check-spelling does not flag them.

Signed-off-by: Jan Dubois <jan.dubois@suse.com>
The noctx linter blocks raw `net.Listen`, `net.DialTimeout`, and
`net.ListenPacket`. Route those through a `dial` helper that uses
`(*net.Dialer).DialContext` with a one-shot Timeout, plus
`(*net.ListenConfig).Listen` / `ListenPacket` for the upstreams.

`loopbackPortMap` always received port 8009, which triggered unparam.
Switch the mixed-binding test to port 8011 (and update its assertion)
so the helper sees more than one input.

Signed-off-by: Jan Dubois <jan.dubois@suse.com>
Signed-off-by: Jan Dubois <jan.dubois@suse.com>
The scanner observed its own bindIP listeners in /proc/net, which
kept the proto/port key alive after the upstream loopback listener
exited and prevented unpublish. Filter those entries out of the
snapshot before reconciliation; the forwarder owns the authoritative
bindIP set.

Promote pending ports using the current scan's bindings, not the
ones captured on first sighting. publish now returns an error and
rolls back the tracker entry on partial failure, so Tick leaves the
port in pending for next-tick retry instead of recording a stuck
publication.

Replace the hand-rolled UDP demuxer with gvisor-tap-vsock's
forwarder.UDPProxy (the same code Lima uses in pkg/portfwd/client.go).
Retry TCP Accept with exponential backoff on transient errors
instead of exiting the goroutine and leaving a leaked listener.
Half-close per direction on the TCP pipe so a client signaling
end-of-request via CloseWrite still receives the upstream response;
a 30s drain deadline caps wedged peers.

Add eight scanner tests covering UDP, mixed-protocol same-port,
publish retry and rollback, gate restart after a one-tick absence,
binding-shape changes between ticks, and the forwarder-IP filter.

Signed-off-by: Jan Dubois <jan.dubois@suse.com>
Tick now compares scanned bindings against published bindings. A
container restart that switches from a wildcard to a loopback bind
(or vice versa) leaves the port present in /proc/net but with new
addresses; the prior key-only check skipped both unpublish and
re-publish, so a fresh loopback listener stayed unreachable until
the upstream truly vanished. A bindingsEqual check forces a full
unpublish/re-publish cycle on shape mismatch.

publish now routes Error logs through a per-port throttle. The
first failure for a port logs at Error; subsequent failures log at
Debug until publish either succeeds or the port leaves both
pending and published (swept at the end of Tick). Without this,
every tick (~3 s) emitted two Error lines per stuck port when
wsl-proxy or host-switch was down. The throttle holds across both
the tracker.Add failure path and the forwarder.Add failure path;
the success Info log fires only on full publish success.

unpublish's "removed port" Info now fires only when tracker.Remove
succeeded; previously it fired unconditionally, so an operator saw
"removed port" even when wsl-proxy notification failed.

The TCP pipeline's drain deadline now bounds writes as well as
reads, so a copy stuck in Write under TCP backpressure exits at
the 30 s window.

Documentation and contract comments cover: the tap-IP filter's
trade-off (legitimate tap-IP listeners are filtered out), the
EADDRINUSE retry behaviour, the UDP forwarder dial's lifetime ctx
contract, the IPv6 limitation (tcp6/udp6 entries are not scanned),
and godoc on ProcNetScanner, NewProcNetScanner, and
loopbackForwarder.Close.

New tests cover: binding-shape change after publish, throttle
state across tracker.Add and forwarder.Add failure recoveries,
throttle sweep when a transient port vanishes, end-to-end TCP
through the scanner-to-forwarder seam, and end-to-end UDP plus
removal through the same seam.

Signed-off-by: Jan Dubois <jan.dubois@suse.com>
When a host-network container holds both 127.0.0.1:port and
0.0.0.0:port, the wildcard listener inside the engine namespace
already accepts bindIP:port directly. publish previously called
forwarder.Add for the loopback binding anyway; the bind collided
with the wildcard listener's claim, returned EADDRINUSE, and
rolled back tracker.Add -- unexposing the port from host-switch
every tick. Skip forwarder.Add when bindings include a wildcard
entry; the tracker entry alone keeps the port reachable from
Windows.

acceptTCP routes Accept errors through a per-listener log-once
flag matching the publish-failure throttle. Sustained FD pressure
no longer floods the log with one Error line per backoff tick.

pending now stores a key set rather than a PortMap; the promote
loop reads scanned[port] for bindings, so the bindings the
rebuild copied into pending went unread.

The IPv6 docstring on the scanner package describes the common
invisible case (bind([::]:port) with IPV6_V6ONLY=0, the Go and
Python default) instead of the rare both-127.0.0.1-and-[::1]
case. NewProcNetScanner's docstring covers the wildcard-skip path
alongside the loopback-forwarder path.

A regression test pins the mixed-binding short-circuit: a port
carrying both 127.0.0.1 and 0.0.0.0 publishes via tracker without
firing forwarder.Add, with forwarder.Add wired to return an error
that would force a rollback if it ever ran.

Signed-off-by: Jan Dubois <jan.dubois@suse.com>
@jandubois jandubois force-pushed the procnet-userspace-forwarder branch from 09d133b to ba1c841 Compare May 26, 2026 23:53
Signed-off-by: Jan Dubois <jan.dubois@suse.com>
@jandubois jandubois force-pushed the procnet-userspace-forwarder branch from 6d2c369 to cbed3ed Compare May 27, 2026 01:04
The package, type, and scanListeners doc comments said the scanner
reads /proc/net/{tcp,udp} only. In fact procnettcp.ParseFiles reads
all four files; the IPv6 entries are dropped one layer deeper, in
addValidProtoEntryToPortMap's Kind switch. Update the three
comments to match the code path and name the actual filter site.

Move the tap-interface IP validation from inside the procnet
goroutine to the top of runAgent. The same flag drives both
NewAPITracker and NewProcNetScanner; failing fast lets both
consumers surface the malformed value at the same point instead
of running through tracker setup before the lazy procnet check
fires.

Document the rollback invariant at the publish loop: every binding
under a nat.Port carries the same HostPort (addEntryToPortMap
derives both from entry.Port), so forwarder.Add sees one key and
the rollback unwinds at most one listener.

Add EADDRINUSE coverage to the loopback forwarder by binding
bindIP:port from outside before calling forwarder.Add, for both
TCP and UDP.

Signed-off-by: Jan Dubois <jan.dubois@suse.com>
@jandubois jandubois force-pushed the procnet-userspace-forwarder branch from cbed3ed to bcb8ada Compare May 27, 2026 01:23
// a matching forwarder; mirror the forwarder.Add rollback
// above.
log.Errorf("/proc/net scanner: bad port %q: %s", b.HostPort, err)
continue
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not mistaken, the code below does not "mirror the forwarder.Add rollback above" because of the continue; it mentions this could leak an entry, since tracker.Add already ran successfully before this loop. And then, after continue, the port is captured in p.published, but no forwarder listener is opened.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to replace the continue with something like

     if removeErr := p.tracker.Remove(id); removeErr != nil {
         p.logAddFailure(port, fmt.Sprintf("rollback after bad port: %s", removeErr))
     }
     return fmt.Errorf("/proc/net scanner: bad port %q: %s", b.HostPort, err)

p.addErrorLogged[port] = true
}

func (p *ProcNetScanner) unpublish(port nat.Port, bindings []nat.PortBinding) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the same check hasWildcardBinding as thepublish to make sure we are not removing published ports with both bindings 127.0.0.1 and 0.0.0.0.

The publish loop's defensive ParseUint guard hit continue after
logging, but tracker.Add had already succeeded. The caller
recorded the port as published with no forwarder, despite the
comment claiming to mirror the forwarder.Add rollback. Roll the
tracker entry back and return the error so the next Tick re-pends
the port.

The unpublish path called forwarder.Remove for every loopback
binding, even when publish had skipped opening the forwarder
because the bindings included a wildcard. The real forwarder
returns nil for unknown keys, so the call stayed a no-op; the
asymmetry against publish read as an oversight. Add the matching
hasWildcardBinding short-circuit.

Extend TestMixedBindingsSkipForwarderAndKeepTracker with a
vanish-Tick that asserts forwarder.Remove stays empty during
mixed-binding unpublish.

Signed-off-by: Jan Dubois <jan.dubois@suse.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants