Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions test/main_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,12 @@ func TestMain(m *testing.M) {
cancel()
}()

defer func() {
if err := testEnv.Close(); err != nil {
fmt.Fprintln(os.Stderr, "Error closing test environment:", err)
}
}()
Comment on lines +99 to +103
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This defer testEnv.Close() won’t run on the timeout path in the interrupt watcher goroutine, because that goroutine calls os.Exit(...) (which skips defers). If the suite hits that 30s timeout, dial-stdio can still be orphaned. Consider explicitly calling testEnv.Close() (and possibly tp.Shutdown) just before the os.Exit in the timeout path.

Copilot uses AI. Check for mistakes.

if err := testEnv.Load(ctx, phonyRef, fixtures.PhonyFrontend); err != nil {
panic(err)
}
Expand Down
88 changes: 37 additions & 51 deletions test/testenv/buildx.go
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,8 @@ import (
"net"
"os"
"os/exec"
"runtime"
"strings"
"sync"
"syscall"
"testing"
"time"

Expand Down Expand Up @@ -56,6 +54,20 @@ func (b *BuildxEnv) WithBuilder(builder string) *BuildxEnv {
return b
}

// Close closes the underlying buildkit client connection, which triggers
// cleanup of the dial-stdio process.
func (b *BuildxEnv) Close() error {
b.mu.Lock()
defer b.mu.Unlock()

if b.client != nil {
err := b.client.Close()
b.client = nil
return err
}
return nil
}

// Load loads the output of the specified [gwclient.BuildFunc] into the buildkit instance.
func (b *BuildxEnv) Load(ctx context.Context, id string, f gwclient.BuildFunc) error {
if b.refs == nil {
Expand Down Expand Up @@ -87,10 +99,7 @@ func (c *connCloseWrapper) Close() error {
if c.close != nil {
c.close()
}
if err := c.Conn.Close(); err != nil {
return err
}
return nil
return c.Conn.Close()
}

func (b *BuildxEnv) dialStdio(ctx context.Context) error {
Expand All @@ -106,19 +115,10 @@ func (b *BuildxEnv) dialStdio(ctx context.Context) error {
// the buildx dial-stdio process from cleaning up its resources properly.
cmd := exec.Command("docker", args...)
cmd.Env = os.Environ()
cmd.SysProcAttr = &syscall.SysProcAttr{
// Put the child in its own process group so we can kill the entire
// group (docker + docker-buildx plugin) during cleanup.
Setpgid: true,
// Send SIGTERM to the child process when the parent (test process) dies.
// This prevents dial-stdio processes from being orphaned when the test
// suite is interrupted or crashes.
Pdeathsig: syscall.SIGTERM,
}

c1, c2 := net.Pipe()
cmd.Stdin = c1
cmd.Stdout = c1
dialStdioConn, clientConn := net.Pipe()
cmd.Stdin = dialStdioConn
cmd.Stdout = dialStdioConn

// Use a pipe to check when the connection is actually complete
// Also write all of stderr to an error buffer so we can have more details
Expand All @@ -128,27 +128,17 @@ func (b *BuildxEnv) dialStdio(ctx context.Context) error {
ww := io.MultiWriter(w, errBuf)
cmd.Stderr = ww

// processDone is closed when cmd.Wait() returns, signaling the cleanup
if err := cmd.Start(); err != nil {
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If cmd.Start() fails, this returns immediately without closing the pipe/pipe ends created above (dialStdioConn/clientConn and the io.Pipe reader/writer). That can leave resources lingering on this error path; it’s safer to close those conns and the pipe writer/reader before returning the error (or set up defers before calling Start).

Suggested change
if err := cmd.Start(); err != nil {
if err := cmd.Start(); err != nil {
// Clean up all pipes if the command fails to start.
_ = dialStdioConn.Close()
_ = clientConn.Close()
_ = w.Close()
_ = r.Close()

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot code review[agent] those are in-memory pipes and should get deallocated once there is no reference to them, so the explicit cleanup shouldn't be necessary?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should do the cleanup here since this is not in func main()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are all in-memory and will be deallocated on error, there is no need for explicit cleanup. It would be different if we would use e.g. os.Pipe().

return nil, err
}
Comment on lines +131 to +133
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After cmd.Start() succeeds, the dialer no longer has any context-based cancellation path (since it intentionally avoids exec.CommandContext). If ctx is cancelled while waiting for the dial-stdio connection handshake (later in this function), the docker/buildx process can keep running and the dial can hang. Consider wiring ctx.Done() into the startup/handshake flow and ensuring the process is signaled/killed and cmd.Wait() is unblocked on cancellation.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm investigating that.


// chWait is closed when cmd.Wait() returns, signaling the cleanup
// function that the process has exited.
processDone := make(chan struct{})
chWait := make(chan struct{})
go func() {
// Lock this goroutine to its OS thread for the lifetime of the child process.
// Pdeathsig delivers the signal when the *thread* that forked the child exits,
// not when the process exits. Without locking, the Go runtime may destroy the
// thread that called cmd.Start(), prematurely delivering SIGTERM to the child.
runtime.LockOSThread()
defer runtime.UnlockOSThread()

if err := cmd.Start(); err != nil {
// Propagate the start error through the stderr pipe so the
// scanner below will surface it via scanner.Err().
w.CloseWithError(err)
return
}

err := cmd.Wait()
close(processDone)
c1.Close()
close(chWait)
dialStdioConn.Close()
// pkgerrors.Wrap will return nil if err is nil, otherwise it will give
// us a wrapped error with the buffered stderr from the command.
w.CloseWithError(pkgerrors.Wrapf(err, "%s", errBuf))
Comment on lines +140 to 144
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stderr-drain logic can be accidentally defeated if the io.PipeReader (r) is closed as soon as the dialer returns (e.g. via a defer r.Close()), because that makes the exec stderr copy goroutine stop early and can let the child block once its stderr pipe buffer fills. Consider keeping r open for the lifetime of the process and only closing it after cmd.Wait() completes (e.g., close r from this cmd.Wait() goroutine after w.CloseWithError).

Copilot uses AI. Check for mistakes.
Expand All @@ -173,28 +163,26 @@ func (b *BuildxEnv) dialStdio(ctx context.Context) error {
}

out := &connCloseWrapper{
Conn: c2,
Conn: clientConn,
close: sync.OnceFunc(func() {
// Send 2 interrupt signals to the process to ensure it exits gracefully
// This is how buildx/docker plugins handle termination

cmd.Process.Signal(os.Interrupt) //nolint:errcheck // We don't care about this error, we are going to send another one anyway
if err := cmd.Process.Signal(os.Interrupt); err != nil {
cmd.Process.Kill() //nolint:errcheck // Force kill if interrupt fails
}
// Close the stdin/stdout pipe to the process.
// This causes stdin EOF in buildx's dial-stdio, which triggers
// closeWrite(conn) on the buildkit connection and should start
// the chain reaction for docker CLI process to exit.
dialStdioConn.Close()

select {
case <-processDone:
case <-chWait:
case <-time.After(10 * time.Second):
// If it still doesn't exit, force kill
cmd.Process.Kill() //nolint:errcheck // Force kill if it doesn't exit after interrupt
// Safety net: force kill if still running.
cmd.Process.Kill() //nolint:errcheck
<-chWait
}
}),
}

return out, nil
}))

if err != nil {
return err
}
Expand Down Expand Up @@ -355,9 +343,7 @@ func (b *BuildxEnv) RunTest(ctx context.Context, t *testing.T, f TestFunc, opts
t.Fatalf("%+v", err)
}

var (
ch chan *client.SolveStatus
)
var ch chan *client.SolveStatus

if cfg.SolveStatusFn != nil {
ch = make(chan *client.SolveStatus, 1)
Expand Down