-
-
Notifications
You must be signed in to change notification settings - Fork 905
do not sleep to kill a process that has already cleaned up and exited by itself #672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
do not sleep to kill a process that has already cleaned up and exited by itself #672
Conversation
if the process to be killed honours the SIGINT and closes down, we should not sleep before trying to kill a process that is no longer there
the test paniced if it ever got a nil err since t.Failf does not imply that test is stopped selecting a random pid to kill might have interesting side effects when run and its result can never be guaranteed to be the same, meaning that the test had intermittent failures no code depends on the specific return code of killCmd, so there does not seem to be a reason to have a unit test for it
|
@xiantang What is the status/probability of accepting this PR? |
|
could u rebase to master? |
|
Because I'm doing #809, I think there may be conflicts. |
|
I need to merge #809 to master first, because once I remove the pty, Linux behavior will be broken, and it cannot clean child processes correctly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the killCmd function in Unix and Linux implementations to use goroutines and channels for asynchronous process termination, replacing a blocking approach with concurrent handling of process waiting and killing with configurable delays.
Key changes:
- Introduces goroutine-based concurrent handling of process wait and kill operations
- Adds configurable kill delay support when
SendInterruptis enabled - Removes
Test_killCmd_no_processtest case
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| runner/util_unix.go | Refactored killCmd to use goroutines for concurrent wait/kill with delay support |
| runner/util_linux.go | Identical refactoring of killCmd for Linux platform |
| runner/util_test.go | Removed Test_killCmd_no_process test and unused imports |
Comments suppressed due to low confidence (2)
runner/util_unix.go:1
- The kill goroutine can leak when
killDelayis 0 and the process exits quickly. WhenkillDelayis 0,time.After(0)fires immediately, but if the process has already exited andwaitResultis read first in the main loop (line 64), the context is canceled via defer at line 40. However, the kill goroutine may have already sent tokillResult(line 48) before readingctx.Done(), causing it to block forever if nothing reads fromkillResult. Consider closing the killResult channel withdefer close(killResult)or making it buffered:killResult := make(chan error, 1).
//go:build unix && !linux
runner/util_linux.go:1
- The kill goroutine can leak when
killDelayis 0 and the process exits quickly. WhenkillDelayis 0,time.After(0)fires immediately, but if the process has already exited andwaitResultis read first in the main loop (line 63), the context is canceled via defer at line 39. However, the kill goroutine may have already sent tokillResult(line 47) before readingctx.Done(), causing it to block forever if nothing reads fromkillResult. Consider closing the killResult channel withdefer close(killResult)or making it buffered:killResult := make(chan error, 1).
package runner
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
runner/util_unix.go
Outdated
| for { | ||
| // collect the responses from the kill and wait goroutines | ||
| select { | ||
| case err = <-killResult: | ||
| results = append(results, err) | ||
| case err = <-waitResult: | ||
| results = append(results, err) | ||
| // if we have a kill delay, we ignore the kill result | ||
| if killDelay > 0 && len(results) == 1 { | ||
| results = append(results, nil) | ||
| } | ||
| } | ||
|
|
||
| if len(results) == 2 { | ||
| err = errors.Join(results...) | ||
| return | ||
| } | ||
| } |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When killDelay is 0, the kill goroutine sends immediately to killResult and the select statement may non-deterministically choose between reading from killResult or waitResult. If killResult is read first (line 62), the code appends the error and continues looping. Then when waitResult is read (line 64), the logic at lines 67-69 will not execute because killDelay is 0, leaving len(results) at 1 and causing an infinite loop. The condition at line 67 should be if killDelay == 0 && len(results) == 1 to handle the case when there is no kill delay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same incorrect assumption about an error here as in util_linux and the proposed "fix" does not work.
runner/util_linux.go
Outdated
| for { | ||
| // collect the responses from the kill and wait goroutines | ||
| select { | ||
| case err = <-killResult: | ||
| results = append(results, err) | ||
| case err = <-waitResult: | ||
| results = append(results, err) | ||
| // if we have a kill delay, we ignore the kill result | ||
| if killDelay > 0 && len(results) == 1 { | ||
| results = append(results, nil) | ||
| } | ||
| } | ||
|
|
||
| if len(results) == 2 { | ||
| err = errors.Join(results...) | ||
| return | ||
| } | ||
| } |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When killDelay is 0, the kill goroutine sends immediately to killResult and the select statement may non-deterministically choose between reading from killResult or waitResult. If killResult is read first (line 61), the code appends the error and continues looping. Then when waitResult is read (line 63), the logic at lines 66-68 will not execute because killDelay is 0, leaving len(results) at 1 and causing an infinite loop. The condition at line 66 should be if killDelay == 0 && len(results) == 1 to handle the case when there is no kill delay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When
killDelayis 0, the kill goroutine sends immediately tokillResultand the select statement may non-deterministically choose between reading fromkillResultorwaitResult. IfkillResultis read first (line 61), the code appends the error and continues looping. Then whenwaitResultis read (line 63), the logic at lines 66-68 will not execute becausekillDelayis 0, leavinglen(results)at 1 and causing an infinite loop. The condition at line 66 should beif killDelay == 0 && len(results) == 1to handle the case when there is no kill delay.
@istyf did u test this case?
this pr totally ok, but for this code block i dont think the impl is neat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you basing that on Copilots incorrect assumption that there is a bug here?
"Then when waitResult is read (line 63), the logic at lines 66-68 will not execute because killDelay is 0, leaving len(results) at 1 and causing an infinite loop." is wrong. What does it think happens at line 64? We add the wait result (which is always nil by design) to the results slice, causing an exit on line 73.
I do not understand. Why would you merge something that breaks Linux behaviour? |
|
Oh, I see, you just changed the killcmd code, it should be okay. because for linux is still using pty to start a new process |
Codecov Report❌ Patch coverage is
... and 6 files with indirect coverage changes 🚀 New features to boost your workflow:
|
istyf
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.
|
Hello @istyf, could you help resolve the conflict? |
|
Upstream merged into PR and conflict resolved. However this has now not been tested on my side. |
|
Okay, please test this feature again as well, thanks. |
…rly exit optimization) Add three comprehensive regression tests to verify the fix in commit 4d26204: 1. Test_killCmd_SendInterrupt_FastGracefulExit - Verifies processes that exit quickly on SIGINT return immediately - Saves ~2s when process exits in <1ms vs 2s kill_delay 2. Test_killCmd_SendInterrupt_IgnoresSIGINT - Verifies processes ignoring SIGINT still get SIGKILL after kill_delay - Ensures optimization doesn't break fallback behavior 3. Test_killCmd_SendInterrupt_SlowGracefulExit - Verifies processes that take time to cleanup still benefit - Saves ~700ms when process exits in 300ms vs 1s kill_delay These tests ensure the goroutine-based optimization continues to work correctly and prevent future regressions. Related: air-verse#671
5885492 to
0da8425
Compare
| return | ||
| } | ||
| time.Sleep(e.config.killDelay()) | ||
| // Start a goroutine to wait for the process to exit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@istyf hello, i think this implement is more neat.
This PR fixes #671 by replacing the time.Sleep in killCmd (for linux and unix/mac) with a cancelable context that can be used to stop the kill timer and return to the caller as soon as the interrupted process has closed down.
While developing this PR I took the liberty of deleting the unit test for testing killing of "non existing" processes for reasons outlined in 52428bf.
This PR has been tested on wsl2 linux and an Apple M2 MBP.