Skip to content

net/transport: fix yamux dial fd-reuse race (H1) and outgoing sub-stream leak (H2)#707

Merged
requilence merged 1 commit into
mainfrom
worktree-net-transport-audit
Jun 11, 2026
Merged

net/transport: fix yamux dial fd-reuse race (H1) and outgoing sub-stream leak (H2)#707
requilence merged 1 commit into
mainfrom
worktree-net-transport-audit

Conversation

@requilence

@requilence requilence commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Audit of the net transport stack (net/transport/{yamux,quic}, net/peer, net/pool, net/peerservice, net/secureservice + handshake, net/connutil), cross-checked against hashicorp/yamux 0.1.2, quic-go 0.60.0, libp2p tls 0.48.0, storj.io/drpc, and app/ocache.

Fixes in this PR

H1 — yamux dial closed the raw socket fd (fd-reuse / double-close)

Dial installed a net.Dialer.Control that, on ctx cancellation, ran unix.Shutdown(fd); unix.Close(fd) (and a Windows Closesocket variant) on a descriptor the Go netpoller owns. DialContext(ctx) already aborts the connect and closes its own fd, so the two raced to close the same fd number — the classic fd-reuse hazard, where a concurrently-opened fd can be torn down. Removed the Control/abortOnCancel machinery and the abort_{unix,windows,other}.go files; Dial now relies on net.Dialer.Timeout + DialContext(ctx), which handle timeout and cancellation safely. TestDialContextCancellation still aborts in ~100ms.

H2 — outgoing drpc sub-stream leaked on proto-handshake errors

peer.openDrpcConn opened a yamux/quic sub-stream then ran OutgoingProtoHandshake, returning the error without closing the stream. The handshake closes the conn on I/O errors and ctx cancellation, but not on three protocol-level paths (ErrRemoteIncompatibleProto, non-null ack, ErrUnexpectedPayload), so a version-skewed/misbehaving peer caused unbounded stream accumulation. openDrpcConn now closes the sub-stream on any handshake error, matching the incoming side's defer conn.Close().

Verification

  • go build ./net/... and go vet ./net/transport/yamux/... ./net/peer/... — clean
  • go test ./net/transport/yamux/... ./net/peer/... ./net/pool/... ./net/peerservice/... — pass
  • go test -race -run 'TestDialContextCancellation|TestYamuxTransport_Dial' ./net/transport/yamux/... — pass, no races

Refs GO-7313

@requilence requilence requested a review from cheggaaa June 11, 2026 06:11
@github-actions

Copy link
Copy Markdown

New Coverage 58.0% of statements
Patch Coverage 80.0% of changed statements (8/10)

Coverage provided by https://github.com/seriousben/go-patch-cover-action

H1: yamux Dial installed a net.Dialer.Control that, on ctx cancellation,
called Shutdown+Close on the raw socket fd. DialContext(ctx) already aborts
the connect and closes its own fd, so the two raced to close the same fd
number (classic fd-reuse hazard). Drop the Control/abortOnCancel machinery
and the abort_{unix,windows,other}.go files; rely on net.Dialer.Timeout +
DialContext(ctx), which handle timeout and cancellation safely.

H2: peer.openDrpcConn opened a sub-stream then ran OutgoingProtoHandshake,
returning the error without closing the stream. The handshake does not close
the conn on three protocol-level error paths (incompatible/declined/
unexpected proto), so a version-skewed peer leaked streams unboundedly. Close
the sub-stream on any handshake error, matching the incoming serve() path.

Refs GO-7313
@requilence requilence force-pushed the worktree-net-transport-audit branch from 039c05d to b47706c Compare June 11, 2026 13:58
@requilence requilence merged commit ce0d233 into main Jun 11, 2026
4 checks passed
@requilence requilence deleted the worktree-net-transport-audit branch June 11, 2026 14:05
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 11, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants