Skip to content

test/testenv: let Docker CLI shut down dial-stdio gracefully#982

Merged
cpuguy83 merged 2 commits intoproject-dalec:mainfrom
invidian:fix-leaking-dial-stdio
Mar 3, 2026
Merged

test/testenv: let Docker CLI shut down dial-stdio gracefully#982
cpuguy83 merged 2 commits intoproject-dalec:mainfrom
invidian:fix-leaking-dial-stdio

Conversation

@invidian
Copy link
Copy Markdown
Contributor

@invidian invidian commented Mar 2, 2026

What this PR does / why we need it:

After looking into buildx and Docker CLI plugin runtime implementation,
eventually I got to this code, which seems to be properly cleaning up
the dial-stdio process when test suite gets interrupted or panics,
without using process group and threads locking, which are OS specific.

Integration tests now explicitly close client connection which makes
dial-stdio process to exit on its own gracefully, even without passing
the interrupt signal, which I think is an elegant solution.

Which issue(s) this PR fixes (optional, using fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when the PR gets merged):
Closes #974

Special notes for your reviewer:

Tested with following patch:

diff --git test/linux_target_test.go test/linux_target_test.go
index 83f856f..8a898a0 100644
--- test/linux_target_test.go
+++ test/linux_target_test.go
@@ -918,6 +918,11 @@ index 0000000..5260cb1
                                        withIgnoreCache(targets.IgnoreCacheKeyContainer),
                                )

+                               go func() {
+                                       <-time.After(5 * time.Second)
+                                       panic("hehe")
+                               }()
+
                                res := solveT(ctx, t, gwc, sr)

                                ops, err := test.LLBOpsFromState(ctx, resultToState(t, res))

and loop running while true; do ps faux | grep 'docker-buildx buildx dial-stdio' | grep -v grep | wc -l; sleep 1; done to count the processes running.

@invidian invidian requested a review from cpuguy83 March 2, 2026 20:04
This reverts commit 5fa6484.

Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>
@invidian invidian force-pushed the fix-leaking-dial-stdio branch 2 times, most recently from 6e2a539 to 8a8ae24 Compare March 3, 2026 07:11
After looking into buildx and Docker CLI plugin runtime implementation,
eventually I got to this code, which seems to be properly cleaning up
the dial-stdio process when test suite gets interrupted or panics,
without using process group and threads locking, which are OS specific.

Integration tests now explicitly close client connection which makes
dial-stdio process to exit on its own gracefully, even without passing
the interrupt signal, which I think is an elegant solution.

Closes project-dalec#974

Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>
@invidian invidian force-pushed the fix-leaking-dial-stdio branch from 8a8ae24 to 19aa2aa Compare March 3, 2026 07:11
@invidian invidian marked this pull request as ready for review March 3, 2026 07:13
Copilot AI review requested due to automatic review settings March 3, 2026 07:13
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the integration-test Buildx/BuildKit wiring to stop leaking docker buildx dial-stdio processes by letting the Docker CLI/plugin exit naturally when the client connection is closed, rather than relying on OS-specific process-group/thread handling.

Changes:

  • Add (*testenv.BuildxEnv).Close() to explicitly close the underlying BuildKit client and trigger dial-stdio cleanup.
  • Rework dialStdio lifecycle management to prefer pipe-closure driven shutdown, with a kill timeout as a safety net.
  • Ensure the test suite always closes the shared test environment via a defer in TestMain.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
test/testenv/buildx.go Adds BuildxEnv.Close() and changes dial-stdio teardown to close the stdio pipe instead of sending interrupts / using Pdeathsig.
test/main_test.go Ensures testEnv.Close() is deferred so the dial-stdio process is cleaned up when the test run ends.

Comment thread test/testenv/buildx.go
cmd.Stderr = ww

// processDone is closed when cmd.Wait() returns, signaling the cleanup
if err := cmd.Start(); err != nil {
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If cmd.Start() fails, this returns immediately without closing the pipe/pipe ends created above (dialStdioConn/clientConn and the io.Pipe reader/writer). That can leave resources lingering on this error path; it’s safer to close those conns and the pipe writer/reader before returning the error (or set up defers before calling Start).

Suggested change
if err := cmd.Start(); err != nil {
if err := cmd.Start(); err != nil {
// Clean up all pipes if the command fails to start.
_ = dialStdioConn.Close()
_ = clientConn.Close()
_ = w.Close()
_ = r.Close()

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot code review[agent] those are in-memory pipes and should get deallocated once there is no reference to them, so the explicit cleanup shouldn't be necessary?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should do the cleanup here since this is not in func main()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are all in-memory and will be deallocated on error, there is no need for explicit cleanup. It would be different if we would use e.g. os.Pipe().

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Comment thread test/main_test.go
Comment on lines +99 to +103
defer func() {
if err := testEnv.Close(); err != nil {
fmt.Fprintln(os.Stderr, "Error closing test environment:", err)
}
}()
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This defer testEnv.Close() won’t run on the timeout path in the interrupt watcher goroutine, because that goroutine calls os.Exit(...) (which skips defers). If the suite hits that 30s timeout, dial-stdio can still be orphaned. Consider explicitly calling testEnv.Close() (and possibly tp.Shutdown) just before the os.Exit in the timeout path.

Copilot uses AI. Check for mistakes.
Comment thread test/testenv/buildx.go
Comment on lines +140 to 144
close(chWait)
dialStdioConn.Close()
// pkgerrors.Wrap will return nil if err is nil, otherwise it will give
// us a wrapped error with the buffered stderr from the command.
w.CloseWithError(pkgerrors.Wrapf(err, "%s", errBuf))
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stderr-drain logic can be accidentally defeated if the io.PipeReader (r) is closed as soon as the dialer returns (e.g. via a defer r.Close()), because that makes the exec stderr copy goroutine stop early and can let the child block once its stderr pipe buffer fills. Consider keeping r open for the lifetime of the process and only closing it after cmd.Wait() completes (e.g., close r from this cmd.Wait() goroutine after w.CloseWithError).

Copilot uses AI. Check for mistakes.
Comment thread test/testenv/buildx.go
Comment on lines +131 to +133
if err := cmd.Start(); err != nil {
return nil, err
}
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After cmd.Start() succeeds, the dialer no longer has any context-based cancellation path (since it intentionally avoids exec.CommandContext). If ctx is cancelled while waiting for the dial-stdio connection handshake (later in this function), the docker/buildx process can keep running and the dial can hang. Consider wiring ctx.Done() into the startup/handshake flow and ensuring the process is signaled/killed and cmd.Wait() is unblocked on cancellation.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm investigating that.

@cpuguy83 cpuguy83 merged commit 9a57e74 into project-dalec:main Mar 3, 2026
100 of 109 checks passed
@invidian invidian deleted the fix-leaking-dial-stdio branch March 4, 2026 06:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] docker-buildx buildx dial-stdio --progress=plain processes leaks from integration tests when suite is interrupted

3 participants