test/testenv: let Docker CLI shut down dial-stdio gracefully#982
test/testenv: let Docker CLI shut down dial-stdio gracefully#982cpuguy83 merged 2 commits intoproject-dalec:mainfrom
Conversation
This reverts commit 5fa6484. Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>
6e2a539 to
8a8ae24
Compare
After looking into buildx and Docker CLI plugin runtime implementation, eventually I got to this code, which seems to be properly cleaning up the dial-stdio process when test suite gets interrupted or panics, without using process group and threads locking, which are OS specific. Integration tests now explicitly close client connection which makes dial-stdio process to exit on its own gracefully, even without passing the interrupt signal, which I think is an elegant solution. Closes project-dalec#974 Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>
8a8ae24 to
19aa2aa
Compare
There was a problem hiding this comment.
Pull request overview
This PR updates the integration-test Buildx/BuildKit wiring to stop leaking docker buildx dial-stdio processes by letting the Docker CLI/plugin exit naturally when the client connection is closed, rather than relying on OS-specific process-group/thread handling.
Changes:
- Add
(*testenv.BuildxEnv).Close()to explicitly close the underlying BuildKit client and triggerdial-stdiocleanup. - Rework
dialStdiolifecycle management to prefer pipe-closure driven shutdown, with a kill timeout as a safety net. - Ensure the test suite always closes the shared test environment via a
deferinTestMain.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
test/testenv/buildx.go |
Adds BuildxEnv.Close() and changes dial-stdio teardown to close the stdio pipe instead of sending interrupts / using Pdeathsig. |
test/main_test.go |
Ensures testEnv.Close() is deferred so the dial-stdio process is cleaned up when the test run ends. |
| cmd.Stderr = ww | ||
|
|
||
| // processDone is closed when cmd.Wait() returns, signaling the cleanup | ||
| if err := cmd.Start(); err != nil { |
There was a problem hiding this comment.
If cmd.Start() fails, this returns immediately without closing the pipe/pipe ends created above (dialStdioConn/clientConn and the io.Pipe reader/writer). That can leave resources lingering on this error path; it’s safer to close those conns and the pipe writer/reader before returning the error (or set up defers before calling Start).
| if err := cmd.Start(); err != nil { | |
| if err := cmd.Start(); err != nil { | |
| // Clean up all pipes if the command fails to start. | |
| _ = dialStdioConn.Close() | |
| _ = clientConn.Close() | |
| _ = w.Close() | |
| _ = r.Close() |
There was a problem hiding this comment.
@copilot code review[agent] those are in-memory pipes and should get deallocated once there is no reference to them, so the explicit cleanup shouldn't be necessary?
There was a problem hiding this comment.
I think we should do the cleanup here since this is not in func main()
There was a problem hiding this comment.
Those are all in-memory and will be deallocated on error, there is no need for explicit cleanup. It would be different if we would use e.g. os.Pipe().
| defer func() { | ||
| if err := testEnv.Close(); err != nil { | ||
| fmt.Fprintln(os.Stderr, "Error closing test environment:", err) | ||
| } | ||
| }() |
There was a problem hiding this comment.
This defer testEnv.Close() won’t run on the timeout path in the interrupt watcher goroutine, because that goroutine calls os.Exit(...) (which skips defers). If the suite hits that 30s timeout, dial-stdio can still be orphaned. Consider explicitly calling testEnv.Close() (and possibly tp.Shutdown) just before the os.Exit in the timeout path.
| close(chWait) | ||
| dialStdioConn.Close() | ||
| // pkgerrors.Wrap will return nil if err is nil, otherwise it will give | ||
| // us a wrapped error with the buffered stderr from the command. | ||
| w.CloseWithError(pkgerrors.Wrapf(err, "%s", errBuf)) |
There was a problem hiding this comment.
The stderr-drain logic can be accidentally defeated if the io.PipeReader (r) is closed as soon as the dialer returns (e.g. via a defer r.Close()), because that makes the exec stderr copy goroutine stop early and can let the child block once its stderr pipe buffer fills. Consider keeping r open for the lifetime of the process and only closing it after cmd.Wait() completes (e.g., close r from this cmd.Wait() goroutine after w.CloseWithError).
| if err := cmd.Start(); err != nil { | ||
| return nil, err | ||
| } |
There was a problem hiding this comment.
After cmd.Start() succeeds, the dialer no longer has any context-based cancellation path (since it intentionally avoids exec.CommandContext). If ctx is cancelled while waiting for the dial-stdio connection handshake (later in this function), the docker/buildx process can keep running and the dial can hang. Consider wiring ctx.Done() into the startup/handshake flow and ensuring the process is signaled/killed and cmd.Wait() is unblocked on cancellation.
There was a problem hiding this comment.
I'm investigating that.
What this PR does / why we need it:
Which issue(s) this PR fixes (optional, using
fixes #<issue number>(, fixes #<issue_number>, ...)format, will close the issue(s) when the PR gets merged):Closes #974
Special notes for your reviewer:
Tested with following patch:
and loop running
while true; do ps faux | grep 'docker-buildx buildx dial-stdio' | grep -v grep | wc -l; sleep 1; doneto count the processes running.