Skip to content

[fray] Use graceful actor termination to avoid Ray task_manager assertion#4106

Merged
rjpower merged 2 commits intomainfrom
fix/ray-actor-graceful-terminate
Mar 25, 2026
Merged

[fray] Use graceful actor termination to avoid Ray task_manager assertion#4106
rjpower merged 2 commits intomainfrom
fix/ray-actor-graceful-terminate

Conversation

@rjpower
Copy link
Copy Markdown
Collaborator

@rjpower rjpower commented Mar 24, 2026

Switch RayActorGroup.shutdown() from ray.kill() to ray_terminate.
ray.kill() races with task completion callbacks in Ray's C++ task_manager,
triggering a fatal assertion (ray-project/ray#54260) that crashes the
process. ray_terminate queues behind pending tasks and escalates to
force-kill after 30s, avoiding the race.

Fixes flaky CI integration test crash:
task_manager.cc:983: Check failed: it != submissible_tasks_.end()
Tried to complete task that was not pending

ray.kill() races with task completion callbacks in Ray's C++ task_manager,
triggering a fatal assertion (ray-project/ray#54260) that crashes the
process. Switch to __ray_terminate__ which queues behind pending tasks
and escalates to force-kill after 30s.

Fixes flaky CI failures in the integration test where the zephyr
html-to-md step succeeds but the process crashes during actor cleanup:
  task_manager.cc:983: Check failed: it != submissible_tasks_.end()
  Tried to complete task that was not pending
@rjpower rjpower added the agent-generated Created by automation/agent label Mar 24, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Mar 24, 2026

Claude finished @rjpower's task in 43s —— View job


Review: PR #4106 — Graceful actor termination

  • Gather context and read changed files
  • Review for correctness and regressions
  • Post review

Switches RayActorGroup.shutdown() from ray.kill() to __ray_terminate__.remote() to avoid the Ray task_manager assertion race (ray-project/ray#54260). The change is small, well-motivated, and the docstring clearly explains the rationale.

One potential bug:

handle._actor_ref can be None when the RayActorHandle was constructed with a name string (lazy resolution). In that case, handle._actor_ref.__ray_terminate__.remote() will raise AttributeError: 'NoneType' object has no attribute '__ray_terminate__', which gets swallowed by the broad except Exception. The old code had the same latent issue with ray.kill(None), so this isn't a regression — but it means shutdown silently no-ops for name-based handles that were never resolved.

If actors created via create_actor_group always store a direct ref (line 413: RayActorHandle(actor_ref) where actor_ref is a Ray actor handle), this is fine in practice. But if shutdown could ever be called on a group whose handles were serialized/deserialized (name-based), it would silently skip termination. Consider using handle._resolve() instead of handle._actor_ref to be safe:

handle._resolve().__ray_terminate__.remote()

This is a minor robustness concern, not a blocker.

Verdict: LGTM. The fix is correct for the stated problem and matches the recommended Ray pattern for graceful shutdown.

@rjpower rjpower requested a review from ravwojdyla March 24, 2026 23:36
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3ddaca385d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

for handle in self._handles:
try:
ray.kill(handle._actor_ref)
handle._actor_ref.__ray_terminate__.remote()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Add force-kill fallback after graceful terminate

__ray_terminate__.remote() only enqueues a termination task, so if an actor is currently running a long or blocked method, the terminate task may never execute and shutdown() returns while the actor is still alive. In this path, the group can keep consuming resources indefinitely and actor names remain occupied, which is a regression from the previous immediate ray.kill(...) behavior. Please wait on the terminate refs and fall back to ray.kill on timeout (as Ray’s own worker-group shutdown does) so shutdown reliably terminates actors.

Useful? React with 👍 / 👎.

@rjpower rjpower enabled auto-merge (squash) March 24, 2026 23:41
@rjpower
Copy link
Copy Markdown
Collaborator Author

rjpower commented Mar 24, 2026

@claude should we use https://docs.ray.io/en/latest/ray-core/actors/terminating-actors.html#actor-cleanup-with-ray-shutdown ray_shutdown instead, which auto switches to kill after 60 seconds? review the other PR feedback and fix/respond as well

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Mar 25, 2026

Claude finished @rjpower's task in 5m 40s —— View job


Investigating ray.shutdown vs __ray_terminate__ and addressing feedback

  • Read current code and PR diff
  • Research Ray actor termination options
  • Address Codex feedback (force-kill fallback after graceful terminate)
  • Address previous review feedback (_actor_ref None safety)
  • Implement changes
  • Run pre-commit and tests
  • Push changes

Summary

Re: ray_shutdown — The docs link refers to __ray_shutdown__, which is a cleanup hook you define on the actor class (Ray calls it during graceful termination). It's not a termination mechanism — it's a callback. The actual termination methods are:

Method Behavior
ray.kill(actor) Immediate force kill (causes the task_manager race)
actor.__ray_terminate__.remote() Enqueues graceful termination task

So __ray_terminate__ is the right call for initiating graceful shutdown. If you want cleanup logic in the actor, you'd add a __ray_shutdown__ method to _RayActorHostBase — but that's orthogonal to this fix.

Changes pushed (addresses both Codex and prior Claude review):

  1. Force-kill fallback (Codex feedback): Now collects the __ray_terminate__ object refs, waits on them with ray.wait(..., timeout=30s), and force-kills via ray.kill() any actors that didn't terminate in time. This guarantees actors are cleaned up even if they're blocked.

  2. _resolve() safety (prior Claude review): Switched from handle._actor_ref to handle._resolve() so name-based handles that haven't been resolved yet work correctly instead of silently failing on None.

Testing: Pre-commit passes. test_v2_actor.py (8/8 passed, including test_actor_group_shutdown) and test_v2_client.py (15/15 passed) all green.


·

…tdown

- Use handle._resolve() instead of handle._actor_ref to handle
  name-based handles that haven't been resolved yet
- Wait on __ray_terminate__ refs with a 30s timeout
- Force-kill via ray.kill() any actors that don't terminate in time
- Addresses review feedback from both Claude and Codex reviews

Co-authored-by: Russell Power <rjpower@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@ravwojdyla ravwojdyla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@rjpower rjpower merged commit 484a73d into main Mar 25, 2026
39 checks passed
@rjpower rjpower deleted the fix/ray-actor-graceful-terminate branch March 25, 2026 00:30
Helw150 pushed a commit that referenced this pull request Apr 8, 2026
…tion (#4106)

Switch RayActorGroup.shutdown() from ray.kill() to __ray_terminate__.
ray.kill() races with task completion callbacks in Ray's C++
task_manager,
triggering a fatal assertion (ray-project/ray#54260) that crashes the
process. __ray_terminate__ queues behind pending tasks and escalates to
force-kill after 30s, avoiding the race.

Fixes flaky CI integration test crash:
task_manager.cc:983: Check failed: it != submissible_tasks_.end()
Tried to complete task that was not pending
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants