Skip to content

Conversation

@istyf
Copy link
Contributor

@istyf istyf commented Nov 1, 2024

This PR fixes #671 by replacing the time.Sleep in killCmd (for linux and unix/mac) with a cancelable context that can be used to stop the kill timer and return to the caller as soon as the interrupted process has closed down.

While developing this PR I took the liberty of deleting the unit test for testing killing of "non existing" processes for reasons outlined in 52428bf.

This PR has been tested on wsl2 linux and an Apple M2 MBP.

istyf added 5 commits November 1, 2024 16:39
if the process to be killed honours the SIGINT and closes down,
we should not sleep before trying to kill a process that is
no longer there
the test paniced if it ever got a nil err since t.Failf does not
imply that test is stopped

selecting a random pid to kill might have interesting side effects
when run and its result can never be guaranteed to be the same, meaning
that the test had intermittent failures

no code depends on the specific return code of killCmd, so there
does not seem to be a reason to have a unit test for it
@istyf
Copy link
Contributor Author

istyf commented Nov 10, 2025

@xiantang What is the status/probability of accepting this PR?

@xiantang
Copy link
Collaborator

could u rebase to master?

@xiantang
Copy link
Collaborator

Because I'm doing #809, I think there may be conflicts.

@xiantang
Copy link
Collaborator

I need to merge #809 to master first, because once I remove the pty, Linux behavior will be broken, and it cannot clean child processes correctly.

@xiantang xiantang requested a review from Copilot November 10, 2025 15:39
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the killCmd function in Unix and Linux implementations to use goroutines and channels for asynchronous process termination, replacing a blocking approach with concurrent handling of process waiting and killing with configurable delays.

Key changes:

  • Introduces goroutine-based concurrent handling of process wait and kill operations
  • Adds configurable kill delay support when SendInterrupt is enabled
  • Removes Test_killCmd_no_process test case

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
runner/util_unix.go Refactored killCmd to use goroutines for concurrent wait/kill with delay support
runner/util_linux.go Identical refactoring of killCmd for Linux platform
runner/util_test.go Removed Test_killCmd_no_process test and unused imports
Comments suppressed due to low confidence (2)

runner/util_unix.go:1

  • The kill goroutine can leak when killDelay is 0 and the process exits quickly. When killDelay is 0, time.After(0) fires immediately, but if the process has already exited and waitResult is read first in the main loop (line 64), the context is canceled via defer at line 40. However, the kill goroutine may have already sent to killResult (line 48) before reading ctx.Done(), causing it to block forever if nothing reads from killResult. Consider closing the killResult channel with defer close(killResult) or making it buffered: killResult := make(chan error, 1).
//go:build unix && !linux

runner/util_linux.go:1

  • The kill goroutine can leak when killDelay is 0 and the process exits quickly. When killDelay is 0, time.After(0) fires immediately, but if the process has already exited and waitResult is read first in the main loop (line 63), the context is canceled via defer at line 39. However, the kill goroutine may have already sent to killResult (line 47) before reading ctx.Done(), causing it to block forever if nothing reads from killResult. Consider closing the killResult channel with defer close(killResult) or making it buffered: killResult := make(chan error, 1).
package runner

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 59 to 76
for {
// collect the responses from the kill and wait goroutines
select {
case err = <-killResult:
results = append(results, err)
case err = <-waitResult:
results = append(results, err)
// if we have a kill delay, we ignore the kill result
if killDelay > 0 && len(results) == 1 {
results = append(results, nil)
}
}

if len(results) == 2 {
err = errors.Join(results...)
return
}
}
Copy link

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When killDelay is 0, the kill goroutine sends immediately to killResult and the select statement may non-deterministically choose between reading from killResult or waitResult. If killResult is read first (line 62), the code appends the error and continues looping. Then when waitResult is read (line 64), the logic at lines 67-69 will not execute because killDelay is 0, leaving len(results) at 1 and causing an infinite loop. The condition at line 67 should be if killDelay == 0 && len(results) == 1 to handle the case when there is no kill delay.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same incorrect assumption about an error here as in util_linux and the proposed "fix" does not work.

Comment on lines 58 to 75
for {
// collect the responses from the kill and wait goroutines
select {
case err = <-killResult:
results = append(results, err)
case err = <-waitResult:
results = append(results, err)
// if we have a kill delay, we ignore the kill result
if killDelay > 0 && len(results) == 1 {
results = append(results, nil)
}
}

if len(results) == 2 {
err = errors.Join(results...)
return
}
}
Copy link

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When killDelay is 0, the kill goroutine sends immediately to killResult and the select statement may non-deterministically choose between reading from killResult or waitResult. If killResult is read first (line 61), the code appends the error and continues looping. Then when waitResult is read (line 63), the logic at lines 66-68 will not execute because killDelay is 0, leaving len(results) at 1 and causing an infinite loop. The condition at line 66 should be if killDelay == 0 && len(results) == 1 to handle the case when there is no kill delay.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When killDelay is 0, the kill goroutine sends immediately to killResult and the select statement may non-deterministically choose between reading from killResult or waitResult. If killResult is read first (line 61), the code appends the error and continues looping. Then when waitResult is read (line 63), the logic at lines 66-68 will not execute because killDelay is 0, leaving len(results) at 1 and causing an infinite loop. The condition at line 66 should be if killDelay == 0 && len(results) == 1 to handle the case when there is no kill delay.

@istyf did u test this case?
this pr totally ok, but for this code block i dont think the impl is neat

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you basing that on Copilots incorrect assumption that there is a bug here?

"Then when waitResult is read (line 63), the logic at lines 66-68 will not execute because killDelay is 0, leaving len(results) at 1 and causing an infinite loop." is wrong. What does it think happens at line 64? We add the wait result (which is always nil by design) to the results slice, causing an exit on line 73.

@istyf
Copy link
Contributor Author

istyf commented Nov 10, 2025

I need to merge #809 to master first, because once I remove the pty, Linux behavior will be broken, and it cannot clean child processes correctly.

I do not understand. Why would you merge something that breaks Linux behaviour?

@xiantang
Copy link
Collaborator

Oh, I see, you just changed the killcmd code, it should be okay. because for linux is still using pty to start a new process

@codecov
Copy link

codecov bot commented Nov 10, 2025

Codecov Report

❌ Patch coverage is 91.48936% with 4 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
runner/util_linux.go 91.30% 1 Missing and 1 partial ⚠️
runner/util_unix.go 91.66% 1 Missing and 1 partial ⚠️
Files with missing lines Coverage Δ
runner/util_linux.go 67.00% <91.30%> (-1.00%) ⬇️
runner/util_unix.go 80.95% <91.66%> (+12.95%) ⬆️

... and 6 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor Author

@istyf istyf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

@xiantang
Copy link
Collaborator

Hello @istyf, could you help resolve the conflict?

@istyf
Copy link
Contributor Author

istyf commented Nov 12, 2025

Upstream merged into PR and conflict resolved. However this has now not been tested on my side.

@xiantang
Copy link
Collaborator

Okay, please test this feature again as well, thanks.

…rly exit optimization)

Add three comprehensive regression tests to verify the fix in commit 4d26204:

1. Test_killCmd_SendInterrupt_FastGracefulExit
   - Verifies processes that exit quickly on SIGINT return immediately
   - Saves ~2s when process exits in <1ms vs 2s kill_delay

2. Test_killCmd_SendInterrupt_IgnoresSIGINT
   - Verifies processes ignoring SIGINT still get SIGKILL after kill_delay
   - Ensures optimization doesn't break fallback behavior

3. Test_killCmd_SendInterrupt_SlowGracefulExit
   - Verifies processes that take time to cleanup still benefit
   - Saves ~700ms when process exits in 300ms vs 1s kill_delay

These tests ensure the goroutine-based optimization continues to work
correctly and prevent future regressions.

Related: air-verse#671
@xiantang xiantang force-pushed the fix/only-kill-after-interrupt-when-required branch from 5885492 to 0da8425 Compare January 5, 2026 15:09
return
}
time.Sleep(e.config.killDelay())
// Start a goroutine to wait for the process to exit
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@istyf hello, i think this implement is more neat.

@xiantang xiantang merged commit 6f43af7 into air-verse:master Jan 5, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Do not sleep the full kill_delay when using send_interrupt and the current application shuts down by itself

2 participants