procnet: replace iptables PREROUTING DNAT with userspace forwarder#10357
Open
jandubois wants to merge 12 commits into
Open
procnet: replace iptables PREROUTING DNAT with userspace forwarder#10357jandubois wants to merge 12 commits into
jandubois wants to merge 12 commits into
Conversation
This was referenced May 26, 2026
procnet exposes --network=host 127.0.0.1 listeners to the Windows host by writing PREROUTING DNAT rules in the nat table. Those rules share a chain with CNI portmap and Docker, so on iptables-nft kernels (current WSL2 default) the first matching rule wins -- procnet shadows the engine for bridge-network ports. Replace the DNAT with a userspace TCP/UDP forwarder bound on the namespace's tap-interface IP. host-switch's gvisor-tap-vsock already routes Windows-side host-port traffic to tapInterfaceIP:HostPort (see apitracker.go's Expose request), so a userspace listener there receives the same packets the DNAT rule would have rewritten. The forwarder dials 127.0.0.1:<port> locally. A two-scan stability gate stays to filter out nerdctl's OCI createRuntime hook reservation socket. The gate shrinks because it no longer distinguishes engine-managed from locally-managed ports. The route_localnet sysctl write is unnecessary; userspace dial does not ask the kernel to route packets to a loopback address. Signed-off-by: Jan Dubois <jan.dubois@suse.com>
The guestagent module had no test:unit target, so its tests never ran in CI. Add the target and chain it into test:unit alongside nerdctl-stub, wsl-helper, and rdctl. Cover the procnet scanner with unit tests for the two-scan stability gate (transient port suppressed, persistent port promoted on second sighting, removal effective immediately, wildcard skips forwarder, mixed loopback/wildcard) and the userspace forwarder with end-to-end TCP smoke tests (Add/Remove/idempotent Add). Signed-off-by: Jan Dubois <jan.dubois@suse.com>
relayUDP initially used a single goroutine that serialized all clients through one outbound socket and blocked the main loop on each upstream reply. That regressed UDP host-network parity with the iptables PREROUTING DNAT path, which supports concurrent flows via kernel conntrack. Replace it with a per-client outbound socket keyed by source address. Each flow runs its own reply-pump goroutine. A 30-second SetReadDeadline closes idle flows; the per-flow map cleans up on flow close and on relayUDP return. Add UDP smoke tests covering single-flow round-trip and four-client concurrent send/receive. Signed-off-by: Jan Dubois <jan.dubois@suse.com>
The userspace forwarder branch deletes the routeLocalnet sysctl write and the execLoopbackIPtablesRule log message; drop the now-orphaned `localnet` and `portbinding` entries from expect.txt. Add `portfwd` for the Lima package name referenced in a comment. Rename the ad-hoc short identifiers `derr`, `rbuf`, and `rerr` to spelled-out names (`dialErr`, `replyBuf`, plain `err` in the test) so check-spelling does not flag them. Signed-off-by: Jan Dubois <jan.dubois@suse.com>
The noctx linter blocks raw `net.Listen`, `net.DialTimeout`, and `net.ListenPacket`. Route those through a `dial` helper that uses `(*net.Dialer).DialContext` with a one-shot Timeout, plus `(*net.ListenConfig).Listen` / `ListenPacket` for the upstreams. `loopbackPortMap` always received port 8009, which triggered unparam. Switch the mixed-binding test to port 8011 (and update its assertion) so the helper sees more than one input. Signed-off-by: Jan Dubois <jan.dubois@suse.com>
Signed-off-by: Jan Dubois <jan.dubois@suse.com>
The scanner observed its own bindIP listeners in /proc/net, which kept the proto/port key alive after the upstream loopback listener exited and prevented unpublish. Filter those entries out of the snapshot before reconciliation; the forwarder owns the authoritative bindIP set. Promote pending ports using the current scan's bindings, not the ones captured on first sighting. publish now returns an error and rolls back the tracker entry on partial failure, so Tick leaves the port in pending for next-tick retry instead of recording a stuck publication. Replace the hand-rolled UDP demuxer with gvisor-tap-vsock's forwarder.UDPProxy (the same code Lima uses in pkg/portfwd/client.go). Retry TCP Accept with exponential backoff on transient errors instead of exiting the goroutine and leaving a leaked listener. Half-close per direction on the TCP pipe so a client signaling end-of-request via CloseWrite still receives the upstream response; a 30s drain deadline caps wedged peers. Add eight scanner tests covering UDP, mixed-protocol same-port, publish retry and rollback, gate restart after a one-tick absence, binding-shape changes between ticks, and the forwarder-IP filter. Signed-off-by: Jan Dubois <jan.dubois@suse.com>
Tick now compares scanned bindings against published bindings. A container restart that switches from a wildcard to a loopback bind (or vice versa) leaves the port present in /proc/net but with new addresses; the prior key-only check skipped both unpublish and re-publish, so a fresh loopback listener stayed unreachable until the upstream truly vanished. A bindingsEqual check forces a full unpublish/re-publish cycle on shape mismatch. publish now routes Error logs through a per-port throttle. The first failure for a port logs at Error; subsequent failures log at Debug until publish either succeeds or the port leaves both pending and published (swept at the end of Tick). Without this, every tick (~3 s) emitted two Error lines per stuck port when wsl-proxy or host-switch was down. The throttle holds across both the tracker.Add failure path and the forwarder.Add failure path; the success Info log fires only on full publish success. unpublish's "removed port" Info now fires only when tracker.Remove succeeded; previously it fired unconditionally, so an operator saw "removed port" even when wsl-proxy notification failed. The TCP pipeline's drain deadline now bounds writes as well as reads, so a copy stuck in Write under TCP backpressure exits at the 30 s window. Documentation and contract comments cover: the tap-IP filter's trade-off (legitimate tap-IP listeners are filtered out), the EADDRINUSE retry behaviour, the UDP forwarder dial's lifetime ctx contract, the IPv6 limitation (tcp6/udp6 entries are not scanned), and godoc on ProcNetScanner, NewProcNetScanner, and loopbackForwarder.Close. New tests cover: binding-shape change after publish, throttle state across tracker.Add and forwarder.Add failure recoveries, throttle sweep when a transient port vanishes, end-to-end TCP through the scanner-to-forwarder seam, and end-to-end UDP plus removal through the same seam. Signed-off-by: Jan Dubois <jan.dubois@suse.com>
When a host-network container holds both 127.0.0.1:port and 0.0.0.0:port, the wildcard listener inside the engine namespace already accepts bindIP:port directly. publish previously called forwarder.Add for the loopback binding anyway; the bind collided with the wildcard listener's claim, returned EADDRINUSE, and rolled back tracker.Add -- unexposing the port from host-switch every tick. Skip forwarder.Add when bindings include a wildcard entry; the tracker entry alone keeps the port reachable from Windows. acceptTCP routes Accept errors through a per-listener log-once flag matching the publish-failure throttle. Sustained FD pressure no longer floods the log with one Error line per backoff tick. pending now stores a key set rather than a PortMap; the promote loop reads scanned[port] for bindings, so the bindings the rebuild copied into pending went unread. The IPv6 docstring on the scanner package describes the common invisible case (bind([::]:port) with IPV6_V6ONLY=0, the Go and Python default) instead of the rare both-127.0.0.1-and-[::1] case. NewProcNetScanner's docstring covers the wildcard-skip path alongside the loopback-forwarder path. A regression test pins the mixed-binding short-circuit: a port carrying both 127.0.0.1 and 0.0.0.0 publishes via tracker without firing forwarder.Add, with forwarder.Add wired to return an error that would force a rollback if it ever ran. Signed-off-by: Jan Dubois <jan.dubois@suse.com>
09d133b to
ba1c841
Compare
Signed-off-by: Jan Dubois <jan.dubois@suse.com>
6d2c369 to
cbed3ed
Compare
The package, type, and scanListeners doc comments said the scanner
reads /proc/net/{tcp,udp} only. In fact procnettcp.ParseFiles reads
all four files; the IPv6 entries are dropped one layer deeper, in
addValidProtoEntryToPortMap's Kind switch. Update the three
comments to match the code path and name the actual filter site.
Move the tap-interface IP validation from inside the procnet
goroutine to the top of runAgent. The same flag drives both
NewAPITracker and NewProcNetScanner; failing fast lets both
consumers surface the malformed value at the same point instead
of running through tracker setup before the lazy procnet check
fires.
Document the rollback invariant at the publish loop: every binding
under a nat.Port carries the same HostPort (addEntryToPortMap
derives both from entry.Port), so forwarder.Add sees one key and
the rollback unwinds at most one listener.
Add EADDRINUSE coverage to the loopback forwarder by binding
bindIP:port from outside before calling forwarder.Add, for both
TCP and UDP.
Signed-off-by: Jan Dubois <jan.dubois@suse.com>
cbed3ed to
bcb8ada
Compare
Nino-K
reviewed
May 27, 2026
| // a matching forwarder; mirror the forwarder.Add rollback | ||
| // above. | ||
| log.Errorf("/proc/net scanner: bad port %q: %s", b.HostPort, err) | ||
| continue |
Member
There was a problem hiding this comment.
If I'm not mistaken, the code below does not "mirror the forwarder.Add rollback above" because of the continue; it mentions this could leak an entry, since tracker.Add already ran successfully before this loop. And then, after continue, the port is captured in p.published, but no forwarder listener is opened.
Member
There was a problem hiding this comment.
Would it make sense to replace the continue with something like
if removeErr := p.tracker.Remove(id); removeErr != nil {
p.logAddFailure(port, fmt.Sprintf("rollback after bad port: %s", removeErr))
}
return fmt.Errorf("/proc/net scanner: bad port %q: %s", b.HostPort, err)
| p.addErrorLogged[port] = true | ||
| } | ||
|
|
||
| func (p *ProcNetScanner) unpublish(port nat.Port, bindings []nat.PortBinding) { |
Member
There was a problem hiding this comment.
Do we need the same check hasWildcardBinding as thepublish to make sure we are not removing published ports with both bindings 127.0.0.1 and 0.0.0.0.
The publish loop's defensive ParseUint guard hit continue after logging, but tracker.Add had already succeeded. The caller recorded the port as published with no forwarder, despite the comment claiming to mirror the forwarder.Add rollback. Roll the tracker entry back and return the error so the next Tick re-pends the port. The unpublish path called forwarder.Remove for every loopback binding, even when publish had skipped opening the forwarder because the bindings included a wildcard. The real forwarder returns nil for unknown keys, so the call stayed a no-op; the asymmetry against publish read as an oversight. Add the matching hasWildcardBinding short-circuit. Extend TestMixedBindingsSkipForwarderAndKeepTracker with a vanish-Tick that asserts forwarder.Remove stays empty during mixed-binding unpublish. Signed-off-by: Jan Dubois <jan.dubois@suse.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replace procnet's PREROUTING DNAT rules with a userspace TCP/UDP forwarder bound on the namespace's tap-interface IP. Eliminates procnet's nat-table writes. Drops the coordination surface #10262 and #10280 worked to manage with CNI portmap and Docker. Removes the
route_localnetsysctl write.Supersedes #10280.
Problem
procnet exposes
--network=host127.0.0.1 listeners to the Windows host by writing a PREROUTING DNAT rule in the nat table. The rule lives in the same chain as CNI portmap (CNI-HOSTPORT-DNAT) and Docker (DOCKER). On iptables-nft kernels (current WSL2 default) DNAT terminates the chain, so the first matching rule wins: procnet shadowsthe engine for bridge-network ports.
#10262 fixed the bug with a heuristic stack. #10280 reworked the fix behind a 665-line state machine (
portStateTracker) plus 1187 lines of unit tests, probingCNI-HOSTPORT-DNATandDOCKERbefore each Append, handling partial failures, append-retry races, and the transient OCI-hook reservation socket.Approach
Replace the DNAT with a userspace TCP/UDP forwarder bound on the namespace's tap-interface IP. host-switch's gvisor-tap-vsock already routes Windows-side host-port traffic to
tapInterfaceIP:HostPort(see src/go/guestagent/pkg/tracker/apitracker.go:106), so a userspace listener there receives Windows-side traffic directly. The forwarder dials127.0.0.1:` locally.A two-scan stability gate stays to filter out the transient OCI-hook reservation socket. The gate shrinks because it no longer distinguishes engine-managed from locally-managed ports.
Functional equivalence to the iptables path
Tracing a packet for a
--network=hostcontainer listening on127.0.0.1:8009:Remote: tapInterfaceIP:HostPort. The destination IP inside the namespace is the tap IP, regardless of whether procnet observed127.0.0.1or0.0.0.0as the listener address.determineHostIP(apitracker.go:255) only chooses the Windows-side bind IP: admin install binds the observed HostIP; non-admin always binds 127.0.0.1. In-namespace routing does not depend on this.dst=tapInterfaceIP:8009; iptables rewrites to127.0.0.1:8009; theroute_localnet=1sysctl lets the kernel route the rewritten loopback dst across eth0 → lo; the 127.0.0.1 listener accepts.dst=tapInterfaceIP:8009; the userspace listener bound ontapInterfaceIP:8009accepts; the proxy dials127.0.0.1:8009from a fresh socket; the 127.0.0.1 listener accepts.Same Windows-side bind IP, same in-namespace destination, same delivery to the 127.0.0.1 listener. The
route_localnetsysctl is referenced only inscanner_linux.goon main, so dropping it is safe.What gets deleted versus #10280
port_state_tracker_linux.go(665 lines) — engine-chain probing, partial-failure handling, append-retry.port_state_tracker_linux_test.go(1187 lines).applyLoopbackIPtablesRule,engineChainManagesPort,dnatChainContainsPort,preroutingHasLoopbackRule,isIptablesRuleAbsent,iptablesCommand,wrapExitError,engineDNATChains,dnatRuleRe.enableLocalnetRoutingand theroute_localnetsysctl write.What's added
loopback_forwarder_linux.go— TCP/UDP userspace proxy. TCP is a straight accept-and-pipe with a 30-second half-close drain deadline. UDP delegates to gvisor-tap-vsock'sforwarder.UDPProxy, which keys flows by source address with a 90-second per-flow idle timeout (UDPConnTrackTimeout).scanner_linux.gois restructured around aTick(scanned)method with the two-scan gate, making the reconciliation logic unit-testable.test:unit:guestagentyarn target so the procnet tests actually run in CI (matches a CI gap procnet: don't shadow CNI portmap with PREROUTING DNAT #10280 also flagged).