test/testenv: run dial-stdio with Pdeathsig and process group#980
test/testenv: run dial-stdio with Pdeathsig and process group#980cpuguy83 merged 1 commit intoproject-dalec:mainfrom
Conversation
That should ensure we don't leave orphaned dial-stdio processes if the test suite is interrupted or crashes, and that all processes are cleaned up properly when the test finishes. This approach is easier than explicit cleanup, since we delegate the cleanup to the OS. Closes project-dalec#974 Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>
There was a problem hiding this comment.
Pull request overview
This PR updates the integration-test Buildx “dial-stdio” launcher to rely more on OS-level process cleanup semantics (process groups / parent-death signal) so interrupted or crashing test runs don’t leave orphaned docker buildx dial-stdio processes behind.
Changes:
- Configure the spawned
docker buildx dial-stdioprocess with a new process group andPdeathsig. - Restructure process start/wait handling to signal completion via a channel and propagate start errors through the stderr pipe.
- Add
runtime.LockOSThread()around the child process lifetime (intended to supportPdeathsigbehavior).
| cmd.SysProcAttr = &syscall.SysProcAttr{ | ||
| // Put the child in its own process group so we can kill the entire | ||
| // group (docker + docker-buildx plugin) during cleanup. | ||
| Setpgid: true, | ||
| // Send SIGTERM to the child process when the parent (test process) dies. |
There was a problem hiding this comment.
syscall.SysProcAttr{Setpgid: ..., Pdeathsig: ...} is Linux/Unix-specific and this file has no build tags, so go test ./... will fail to compile on platforms where SysProcAttr lacks these fields (e.g., darwin/windows). Consider moving the SysProcAttr setup behind per-OS build-tagged helpers (no-op on unsupported OSes) or gating the entire implementation to Linux if these tests are Linux-only.
| select { | ||
| case <-chWait: | ||
| case <-processDone: | ||
| case <-time.After(10 * time.Second): | ||
| // If it still doesn't exit, force kill | ||
| cmd.Process.Kill() //nolint:errcheck // Force kill if it doesn't exit after interrupt |
There was a problem hiding this comment.
The code sets Setpgid: true (and the comment says cleanup will kill the whole process group), but cleanup only signals cmd.Process and never targets the process group. If the intent is to avoid orphaned buildx plugin subprocesses, consider signaling/killing the process group (e.g., via a negative PID on platforms that support it) or dropping Setpgid if it’s not used.
| // Send SIGTERM to the child process when the parent (test process) dies. | ||
| // This prevents dial-stdio processes from being orphaned when the test | ||
| // suite is interrupted or crashes. | ||
| Pdeathsig: syscall.SIGTERM, |
There was a problem hiding this comment.
I realized this makes it so we can't invoke tests from non-Linux. We should at least gate this. But also definitely not a full solution.
What this PR does / why we need it:
That should ensure we don't leave orphaned dial-stdio processes if the test suite is interrupted or crashes, and that all processes are cleaned up properly when the test finishes.
This approach is easier than explicit cleanup, since we delegate the cleanup to the OS.
Which issue(s) this PR fixes (optional, using
fixes #<issue number>(, fixes #<issue_number>, ...)format, will close the issue(s) when the PR gets merged):Closes #974
Special notes for your reviewer:
AI generated patch with manual testing. I am not familiar with nuances of
Pdeathsigandruntime.LockOSThread(), but it seems to be addressing my issue.