[fray] Use graceful actor termination to avoid Ray task_manager assertion#4106
[fray] Use graceful actor termination to avoid Ray task_manager assertion#4106
Conversation
ray.kill() races with task completion callbacks in Ray's C++ task_manager, triggering a fatal assertion (ray-project/ray#54260) that crashes the process. Switch to __ray_terminate__ which queues behind pending tasks and escalates to force-kill after 30s. Fixes flaky CI failures in the integration test where the zephyr html-to-md step succeeds but the process crashes during actor cleanup: task_manager.cc:983: Check failed: it != submissible_tasks_.end() Tried to complete task that was not pending
|
Claude finished @rjpower's task in 43s —— View job Review: PR #4106 — Graceful actor termination
Switches One potential bug:
If actors created via handle._resolve().__ray_terminate__.remote()This is a minor robustness concern, not a blocker. Verdict: LGTM. The fix is correct for the stated problem and matches the recommended Ray pattern for graceful shutdown. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3ddaca385d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for handle in self._handles: | ||
| try: | ||
| ray.kill(handle._actor_ref) | ||
| handle._actor_ref.__ray_terminate__.remote() |
There was a problem hiding this comment.
Add force-kill fallback after graceful terminate
__ray_terminate__.remote() only enqueues a termination task, so if an actor is currently running a long or blocked method, the terminate task may never execute and shutdown() returns while the actor is still alive. In this path, the group can keep consuming resources indefinitely and actor names remain occupied, which is a regression from the previous immediate ray.kill(...) behavior. Please wait on the terminate refs and fall back to ray.kill on timeout (as Ray’s own worker-group shutdown does) so shutdown reliably terminates actors.
Useful? React with 👍 / 👎.
|
@claude should we use https://docs.ray.io/en/latest/ray-core/actors/terminating-actors.html#actor-cleanup-with-ray-shutdown ray_shutdown instead, which auto switches to kill after 60 seconds? review the other PR feedback and fix/respond as well |
|
Claude finished @rjpower's task in 5m 40s —— View job Investigating
|
| Method | Behavior |
|---|---|
ray.kill(actor) |
Immediate force kill (causes the task_manager race) |
actor.__ray_terminate__.remote() |
Enqueues graceful termination task |
So __ray_terminate__ is the right call for initiating graceful shutdown. If you want cleanup logic in the actor, you'd add a __ray_shutdown__ method to _RayActorHostBase — but that's orthogonal to this fix.
Changes pushed (addresses both Codex and prior Claude review):
-
Force-kill fallback (Codex feedback): Now collects the
__ray_terminate__object refs, waits on them withray.wait(..., timeout=30s), and force-kills viaray.kill()any actors that didn't terminate in time. This guarantees actors are cleaned up even if they're blocked. -
_resolve()safety (prior Claude review): Switched fromhandle._actor_reftohandle._resolve()so name-based handles that haven't been resolved yet work correctly instead of silently failing onNone.
Testing: Pre-commit passes. test_v2_actor.py (8/8 passed, including test_actor_group_shutdown) and test_v2_client.py (15/15 passed) all green.
·
…tdown - Use handle._resolve() instead of handle._actor_ref to handle name-based handles that haven't been resolved yet - Wait on __ray_terminate__ refs with a 30s timeout - Force-kill via ray.kill() any actors that don't terminate in time - Addresses review feedback from both Claude and Codex reviews Co-authored-by: Russell Power <rjpower@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tion (#4106) Switch RayActorGroup.shutdown() from ray.kill() to __ray_terminate__. ray.kill() races with task completion callbacks in Ray's C++ task_manager, triggering a fatal assertion (ray-project/ray#54260) that crashes the process. __ray_terminate__ queues behind pending tasks and escalates to force-kill after 30s, avoiding the race. Fixes flaky CI integration test crash: task_manager.cc:983: Check failed: it != submissible_tasks_.end() Tried to complete task that was not pending
Switch RayActorGroup.shutdown() from ray.kill() to ray_terminate.
ray.kill() races with task completion callbacks in Ray's C++ task_manager,
triggering a fatal assertion (ray-project/ray#54260) that crashes the
process. ray_terminate queues behind pending tasks and escalates to
force-kill after 30s, avoiding the race.
Fixes flaky CI integration test crash:
task_manager.cc:983: Check failed: it != submissible_tasks_.end()
Tried to complete task that was not pending