(torchx/local_scheduler) Use os.kill instead of os.killpg when sending SIGTERM to the replica pid. Add runner.wait() for torchx.runner.test.api_test#test_empty_session_id to gracefully wait for the replicas to finish running #1062

kiukchung · 2025-05-05T20:05:09Z

Summary:
Fixes macos unittest failures: https://github.com/pytorch/torchx/actions/runs/14844253481/job/41674243877

When looking into the test failure I noticed two things:

local_scheduler was trying to SIGTERM the process group by passing the replica's pid: os.killpg(replica.pid, signal.SIGTERM) . Changed to call os.kill. (note that os.killpg is not available on iOS which is why the test was failing).
The torchx.runner.test.api_test.test_empty_session_id() test case doesn't wait for the echo test command to finish hence there was a race condition where in certain cases the runner's __exit__() SIGTERMs the replica pids but since the local_scheduler was (wronfully) using os.killpg not os.kill it threw an uncaught error in iOS.

Differential Revision: D74197282

…g SIGTERM to the replica pid. Add runner.wait() for torchx.runner.test.api_test#test_empty_session_id to gracefully wait for the replicas to finish running Summary: Fixes macos unittest failures: https://github.com/pytorch/torchx/actions/runs/14844253481/job/41674243877 When looking into the test failure I noticed two things: 1. `local_scheduler` was trying to SIGTERM the process group by passing the replica's pid: `os.killpg(replica.pid, signal.SIGTERM)` . Changed to call `os.kill`. (note that `os.killpg` is not available on iOS which is why the test was failing). 2. The `torchx.runner.test.api_test.test_empty_session_id()` test case doesn't wait for the `echo` test command to finish hence there was a race condition where in certain cases the runner's `__exit__()` SIGTERMs the replica pids but since the `local_scheduler` was (wronfully) using `os.killpg` not `os.kill` it threw an uncaught error in iOS. Differential Revision: D74197282

facebook-github-bot · 2025-05-05T20:05:20Z

This pull request was exported from Phabricator. Differential Revision: D74197282

highker

I just want to try again

Summary: Reverting the os.killpg -> os.kill part of #1062. Reviewed By: d4l3k Differential Revision: D74205288

#1064) Summary: Reverting the os.killpg -> os.kill part of #1062. Reviewed By: d4l3k Differential Revision: D74205288

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 5, 2025

facebook-github-bot added the fb-exported label May 5, 2025

highker approved these changes May 5, 2025

View reviewed changes

kiukchung mentioned this pull request May 5, 2025

(torchx/local_scheduler) Use os.kill instead of os.killpg when sending SIGTERM to the replica pid. Add runner.wait() for torchx.runner.test.api_test#test_empty_session_id to gracefully wait for the replicas to finish running #1063

Closed

highker approved these changes May 5, 2025

View reviewed changes

facebook-github-bot merged commit 8216ac4 into pytorch:main May 5, 2025
24 checks passed

facebook-github-bot pushed a commit that referenced this pull request May 5, 2025

(torchx/local_scheduler) go back to using os.killpg in local_scheduler

10cc6f4

Summary: Reverting the os.killpg -> os.kill part of #1062. Reviewed By: d4l3k Differential Revision: D74205288

kiukchung mentioned this pull request May 5, 2025

(torchx/local_scheduler) go back to using os.killpg in local_scheduler #1064

Merged

facebook-github-bot pushed a commit that referenced this pull request May 6, 2025

(torchx/local_scheduler) go back to using os.killpg in local_scheduler (

363dfc5

#1064) Summary: Reverting the os.killpg -> os.kill part of #1062. Reviewed By: d4l3k Differential Revision: D74205288

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

(torchx/local_scheduler) Use os.kill instead of os.killpg when sending SIGTERM to the replica pid. Add runner.wait() for torchx.runner.test.api_test#test_empty_session_id to gracefully wait for the replicas to finish running #1062

(torchx/local_scheduler) Use os.kill instead of os.killpg when sending SIGTERM to the replica pid. Add runner.wait() for torchx.runner.test.api_test#test_empty_session_id to gracefully wait for the replicas to finish running #1062

Uh oh!

kiukchung commented May 5, 2025

Uh oh!

facebook-github-bot commented May 5, 2025

Uh oh!

highker left a comment

Uh oh!

Uh oh!

Uh oh!

(torchx/local_scheduler) Use os.kill instead of os.killpg when sending SIGTERM to the replica pid. Add runner.wait() for torchx.runner.test.api_test#test_empty_session_id to gracefully wait for the replicas to finish running #1062

(torchx/local_scheduler) Use os.kill instead of os.killpg when sending SIGTERM to the replica pid. Add runner.wait() for torchx.runner.test.api_test#test_empty_session_id to gracefully wait for the replicas to finish running #1062

Uh oh!

Conversation

kiukchung commented May 5, 2025

Uh oh!

facebook-github-bot commented May 5, 2025

Uh oh!

highker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!