[fray] Use graceful actor termination to avoid Ray task_manager assertion by rjpower · Pull Request #4106 · marin-community/marin

rjpower · 2026-03-24T23:36:05Z

Switch RayActorGroup.shutdown() from ray.kill() to ray_terminate.
ray.kill() races with task completion callbacks in Ray's C++ task_manager,
triggering a fatal assertion (ray-project/ray#54260) that crashes the
process. ray_terminate queues behind pending tasks and escalates to
force-kill after 30s, avoiding the race.

Fixes flaky CI integration test crash:
task_manager.cc:983: Check failed: it != submissible_tasks_.end()
Tried to complete task that was not pending

ray.kill() races with task completion callbacks in Ray's C++ task_manager, triggering a fatal assertion (ray-project/ray#54260) that crashes the process. Switch to __ray_terminate__ which queues behind pending tasks and escalates to force-kill after 30s. Fixes flaky CI failures in the integration test where the zephyr html-to-md step succeeds but the process crashes during actor cleanup: task_manager.cc:983: Check failed: it != submissible_tasks_.end() Tried to complete task that was not pending

claude · 2026-03-24T23:36:19Z

Claude finished @rjpower's task in 43s —— View job

Review: PR #4106 — Graceful actor termination

Gather context and read changed files
Review for correctness and regressions
Post review

Switches RayActorGroup.shutdown() from ray.kill() to __ray_terminate__.remote() to avoid the Ray task_manager assertion race (ray-project/ray#54260). The change is small, well-motivated, and the docstring clearly explains the rationale.

One potential bug:

handle._actor_ref can be None when the RayActorHandle was constructed with a name string (lazy resolution). In that case, handle._actor_ref.__ray_terminate__.remote() will raise AttributeError: 'NoneType' object has no attribute '__ray_terminate__', which gets swallowed by the broad except Exception. The old code had the same latent issue with ray.kill(None), so this isn't a regression — but it means shutdown silently no-ops for name-based handles that were never resolved.

If actors created via create_actor_group always store a direct ref (line 413: RayActorHandle(actor_ref) where actor_ref is a Ray actor handle), this is fine in practice. But if shutdown could ever be called on a group whose handles were serialized/deserialized (name-based), it would silently skip termination. Consider using handle._resolve() instead of handle._actor_ref to be safe:

handle._resolve().__ray_terminate__.remote()

This is a minor robustness concern, not a blocker.

Verdict: LGTM. The fix is correct for the stated problem and matches the recommended Ray pattern for graceful shutdown.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3ddaca385d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-24T23:41:10Z

        for handle in self._handles:
            try:
-                ray.kill(handle._actor_ref)
+                handle._actor_ref.__ray_terminate__.remote()


Add force-kill fallback after graceful terminate

__ray_terminate__.remote() only enqueues a termination task, so if an actor is currently running a long or blocked method, the terminate task may never execute and shutdown() returns while the actor is still alive. In this path, the group can keep consuming resources indefinitely and actor names remain occupied, which is a regression from the previous immediate ray.kill(...) behavior. Please wait on the terminate refs and fall back to ray.kill on timeout (as Ray’s own worker-group shutdown does) so shutdown reliably terminates actors.

Useful? React with 👍 / 👎.

rjpower · 2026-03-24T23:59:35Z

@claude should we use https://docs.ray.io/en/latest/ray-core/actors/terminating-actors.html#actor-cleanup-with-ray-shutdown ray_shutdown instead, which auto switches to kill after 60 seconds? review the other PR feedback and fix/respond as well

claude · 2026-03-25T00:00:07Z

Claude finished @rjpower's task in 5m 40s —— View job

Investigating `ray.shutdown` vs `__ray_terminate__` and addressing feedback

Read current code and PR diff
Research Ray actor termination options
Address Codex feedback (force-kill fallback after graceful terminate)
Address previous review feedback (_actor_ref None safety)
Implement changes
Run pre-commit and tests
Push changes

Summary

Re: ray_shutdown — The docs link refers to __ray_shutdown__, which is a cleanup hook you define on the actor class (Ray calls it during graceful termination). It's not a termination mechanism — it's a callback. The actual termination methods are:

Method	Behavior
`ray.kill(actor)`	Immediate force kill (causes the task_manager race)
`actor.__ray_terminate__.remote()`	Enqueues graceful termination task

So __ray_terminate__ is the right call for initiating graceful shutdown. If you want cleanup logic in the actor, you'd add a __ray_shutdown__ method to _RayActorHostBase — but that's orthogonal to this fix.

Changes pushed (addresses both Codex and prior Claude review):

Force-kill fallback (Codex feedback): Now collects the __ray_terminate__ object refs, waits on them with ray.wait(..., timeout=30s), and force-kills via ray.kill() any actors that didn't terminate in time. This guarantees actors are cleaned up even if they're blocked.
_resolve() safety (prior Claude review): Switched from handle._actor_ref to handle._resolve() so name-based handles that haven't been resolved yet work correctly instead of silently failing on None.

Testing: Pre-commit passes. test_v2_actor.py (8/8 passed, including test_actor_group_shutdown) and test_v2_client.py (15/15 passed) all green.

·

…tdown - Use handle._resolve() instead of handle._actor_ref to handle name-based handles that haven't been resolved yet - Wait on __ray_terminate__ refs with a 30s timeout - Force-kill via ray.kill() any actors that don't terminate in time - Addresses review feedback from both Claude and Codex reviews Co-authored-by: Russell Power <rjpower@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ravwojdyla

…tion (#4106) Switch RayActorGroup.shutdown() from ray.kill() to __ray_terminate__. ray.kill() races with task completion callbacks in Ray's C++ task_manager, triggering a fatal assertion (ray-project/ray#54260) that crashes the process. __ray_terminate__ queues behind pending tasks and escalates to force-kill after 30s, avoiding the race. Fixes flaky CI integration test crash: task_manager.cc:983: Check failed: it != submissible_tasks_.end() Tried to complete task that was not pending

rjpower added the agent-generated Created by automation/agent label Mar 24, 2026

rjpower requested a review from ravwojdyla March 24, 2026 23:36

chatgpt-codex-connector Bot reviewed Mar 24, 2026

View reviewed changes

rjpower enabled auto-merge (squash) March 24, 2026 23:41

ravwojdyla approved these changes Mar 25, 2026

View reviewed changes

rjpower merged commit 484a73d into main Mar 25, 2026
39 checks passed

rjpower deleted the fix/ray-actor-graceful-terminate branch March 25, 2026 00:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fray] Use graceful actor termination to avoid Ray task_manager assertion#4106

[fray] Use graceful actor termination to avoid Ray task_manager assertion#4106
rjpower merged 2 commits intomainfrom
fix/ray-actor-graceful-terminate

rjpower commented Mar 24, 2026

Uh oh!

claude Bot commented Mar 24, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Mar 24, 2026

Uh oh!

rjpower commented Mar 24, 2026

Uh oh!

claude Bot commented Mar 25, 2026 •

edited

Loading

Uh oh!

ravwojdyla left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rjpower commented Mar 24, 2026

Uh oh!

claude Bot commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review: PR #4106 — Graceful actor termination

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

rjpower commented Mar 24, 2026

Uh oh!

claude Bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Investigating ray.shutdown vs __ray_terminate__ and addressing feedback

Summary

Uh oh!

ravwojdyla left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude Bot commented Mar 24, 2026 •

edited

Loading

claude Bot commented Mar 25, 2026 •

edited

Loading

Investigating `ray.shutdown` vs `__ray_terminate__` and addressing feedback