Skip to content

[core] Avoid fork preexec for Ray Client specific servers#63408

Open
Jerry2003826 wants to merge 3 commits into
ray-project:masterfrom
Jerry2003826:codex/fix-ray-client-specific-server-fork-race
Open

[core] Avoid fork preexec for Ray Client specific servers#63408
Jerry2003826 wants to merge 3 commits into
ray-project:masterfrom
Jerry2003826:codex/fix-ray-client-specific-server-fork-race

Conversation

@Jerry2003826

@Jerry2003826 Jerry2003826 commented May 17, 2026

Copy link
Copy Markdown

Fixes #63202

Summary

  • Add an opt-in use_posix_spawn path to start_ray_process() for POSIX subprocesses that do not require kernel fate sharing.
  • Use that path for Ray Client specific-server subprocesses, which are spawned by the gRPC proxier while other gRPC poller threads may be active.
  • Disable kernel fate sharing only for specific-server launches; the proxy server path and other Ray processes keep their existing startup behavior.
  • Add regression coverage that the proxy path does not opt in, while the specific-server path does.

Root Cause

Ray Client's proxier is a multi-threaded gRPC server. Starting a per-client specific server currently goes through subprocess.Popen(..., preexec_fn=...), which forces a fork+preexec path on POSIX. When other proxier threads are inside gRPC at the same time, gRPC can skip fork handlers and the child can inherit an unsafe poller state before it opens its own channel.

Specific servers already self-terminate after being idle and are also cleaned up by the proxier, so this path can trade Linux kernel fate sharing for CPython's posix_spawn-eligible subprocess path.

Validation

  • python -m compileall -q python/ray/_private/services.py python/ray/tests/test_debug_tools.py
  • ruff check python/ray/_private/services.py python/ray/tests/test_debug_tools.py

I could not run the focused pytest locally in this checkout because Ray's native _raylet extension is not built:

ModuleNotFoundError: No module named 'ray._raylet'

Attempted command:

PYTHONPATH=python python -m pytest python/ray/tests/test_debug_tools.py::test_start_ray_client_specific_server_uses_fork_safe_spawn -q

@Jerry2003826 Jerry2003826 force-pushed the codex/fix-ray-client-specific-server-fork-race branch from 08d6a5a to 26caf95 Compare May 17, 2026 12:20
@Jerry2003826 Jerry2003826 marked this pull request as ready for review May 17, 2026 12:21
@Jerry2003826 Jerry2003826 requested a review from a team as a code owner May 17, 2026 12:21

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a use_posix_spawn parameter to start_ray_process to enable a fork-safe spawning path for Ray Client specific servers, which helps avoid crashes in multi-threaded gRPC environments. Feedback was provided regarding the use of close_fds=False, which may cause file descriptor leaks; the reviewer suggested setting it to True as modern CPython versions can still utilize the posix_spawn fast path while closing descriptors.

Comment thread python/ray/_private/services.py Outdated
@@ -1048,7 +1055,8 @@ def preexec_fn():
stdout=stdout_file,
stderr=stderr_file,
stdin=subprocess.PIPE if pipe_stdin else None,
preexec_fn=preexec_fn if sys.platform != "win32" else None,
preexec_fn=(None if sys.platform == "win32" or use_posix_spawn else preexec_fn),
close_fds=not use_posix_spawn,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Setting close_fds=False when use_posix_spawn is enabled causes the child process to inherit all open file descriptors from the parent (except those explicitly marked with O_CLOEXEC). In a multi-threaded environment like the Ray Client proxier, this can lead to significant file descriptor leaks. For example, the child process might inherit listening sockets or active client connections, which can prevent the parent from restarting or lead to resource exhaustion.

CPython 3.9+ is capable of using the posix_spawn fast path even with close_fds=True on most modern POSIX systems (using posix_spawn_file_actions_addclose). Unless you have confirmed that close_fds=True forces a fallback to fork() on your target platforms and that this fallback is unacceptable, it is safer to maintain close_fds=True.

Suggested change
close_fds=not use_posix_spawn,
close_fds=True,

@Jerry2003826 Jerry2003826 Jun 8, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the current version. start_ray_process() now keeps close_fds=True when the runtime exposes os.POSIX_SPAWN_CLOSEFROM, and only falls back to False for older runtimes where closing fds would force the unsafe fork path. Added coverage for both branches in test_start_ray_process_posix_spawn_close_fds_when_supported and test_start_ray_process_posix_spawn_leaves_fds_open_for_older_runtime.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled in the current branch. start_ray_process() now keeps close_fds=True when the runtime exposes os.POSIX_SPAWN_CLOSEFROM, and only falls back to close_fds=False on older runtimes where that is needed to avoid the fork path. Added tests cover both branches.

Comment thread python/ray/tests/test_debug_tools.py
Comment thread python/ray/_private/services.py Outdated
@@ -1048,7 +1055,8 @@ def preexec_fn():
stdout=stdout_file,
stderr=stderr_file,
stdin=subprocess.PIPE if pipe_stdin else None,
preexec_fn=preexec_fn if sys.platform != "win32" else None,
preexec_fn=(None if sys.platform == "win32" or use_posix_spawn else preexec_fn),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specific-server child loses SIGINT blocking from preexec_fn

Low Severity

When use_posix_spawn is True, preexec_fn is set to None, which silently drops the signal.pthread_sigmask(signal.SIG_BLOCK, {signal.SIGINT}) call that every other Ray child process receives. A SIGINT delivered to the process group (e.g., Ctrl+C) will now reach the specific-server as an unhandled KeyboardInterrupt, potentially causing an unclean mid-operation crash rather than an orderly proxier-driven teardown.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 26caf95cbba474957220a5564bd2ab13b642159e. Configure here.

@Jerry2003826 Jerry2003826 Jun 8, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 6d809e568b. The specific-server process now blocks SIGINT in its own entrypoint via _block_sigint_for_specific_server(), so the fork-safe posix_spawn path still avoids preexec_fn while preserving the previous child-side SIGINT mask behavior. Added test_ray_client_specific_server_blocks_sigint to cover this.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled in the current branch. Specific-server startup now calls _block_sigint_for_specific_server() inside server.py, preserving the SIGINT mask even when the process is launched through the posix_spawn path without preexec_fn.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 26caf95cbb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

assert captured["process_type"] == ray_constants.PROCESS_TYPE_RAY_CLIENT_SERVER
assert "--mode=specific-server" in captured["command"]
assert captured["kwargs"]["fate_share"] is False
assert captured["kwargs"]["use_posix_spawn"] is True

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Make spawn assertion platform-aware in specific-server test

start_ray_client_server() only enables use_posix_spawn on non-Windows (server_type == "specific-server" and sys.platform != "win32" in python/ray/_private/services.py), but this test unconditionally asserts it is True. On Windows runners, the code correctly passes False and this assertion fails, creating a deterministic cross-platform test regression rather than validating the intended behavior.

Useful? React with 👍 / 👎.

@Jerry2003826 Jerry2003826 Jun 8, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed by making the test explicitly platform-scoped: it now monkeypatches services.sys.platform to linux before asserting the POSIX-specific use_posix_spawn=True behavior. The duplicate Cursor thread is already marked resolved.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled in the current branch. The test now patches services.sys.platform to linux before invoking start_ray_client_server(), so it only asserts the POSIX fork-safe path.

Signed-off-by: JerryLee <223425819+Jerry2003826@users.noreply.github.com>
@Jerry2003826 Jerry2003826 force-pushed the codex/fix-ray-client-specific-server-fork-race branch from 26caf95 to 07fe244 Compare May 17, 2026 12:42
@ray-gardener ray-gardener Bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels May 17, 2026
@Yicheng-Lu-llll Yicheng-Lu-llll self-assigned this May 28, 2026
@prince8273

Copy link
Copy Markdown
Contributor

The Cursor bot flagged that dropping preexec_fn removes SIGINT blocking for specific-server subprocesses.

Looking at start_ray_process(), there is already an os.setpgrp() call that moves every child Ray process into its own process group. That means a user Ctrl+C never reaches specific-server subprocesses regardless of pthread_sigmask, so the signal blocking was redundant on this path.

The fate_share concern is worth a closer look though. detect_fate_sharing_support() returns True on most Linux kernels, so set_kill_on_parent_death_linux() would have been active before this change. Specific servers are also cleaned up by the proxier directly, but it may be worth confirming they are still reliably terminated if the proxier dies unexpectedly.

@Jerry2003826

Copy link
Copy Markdown
Author

Addressed the fate-sharing cleanup concern in 7c5431fce7.

What changed:

  • Specific-server launches still use the fork-safe posix_spawn path and do not restore preexec_fn.
  • For that path, the proxier now passes --parent-pid=<proxier pid> to the specific server.
  • The specific server checks that parent PID in its existing 1s lifecycle loop and shuts down if the proxier process disappears.
  • Proxy/legacy server launches are unchanged and still use the existing fate-sharing behavior.

On the SIGINT point: start_ray_process() already starts subprocesses in their own process group via ConsolePopen, so Ctrl+C from the user's terminal is not delivered to the specific-server subprocess through the proxier. The missing piece was the abnormal-proxier-death path, which the parent PID monitor now covers without going back through fork+preexec in the gRPC proxier.

Validation:

  • python -m compileall -q python/ray/_private/services.py python/ray/util/client/server/server.py python/ray/tests/test_debug_tools.py
  • ruff check python/ray/_private/services.py python/ray/util/client/server/server.py python/ray/tests/test_debug_tools.py
  • git diff --check

I also retried the focused pytest, but this local checkout still cannot import Ray's native extension:

ModuleNotFoundError: No module named 'ray._raylet'

Attempted command:

PYTHONPATH=python python -m pytest python/ray/tests/test_debug_tools.py::test_start_ray_client_specific_server_uses_fork_safe_spawn python/ray/tests/test_debug_tools.py::test_ray_client_specific_server_parent_liveness python/ray/tests/test_debug_tools.py::test_ray_client_specific_server_parent_liveness_missing -q

Comment thread python/ray/util/client/server/server.py Outdated
@Jerry2003826 Jerry2003826 force-pushed the codex/fix-ray-client-specific-server-fork-race branch 2 times, most recently from 80d7465 to 4396f99 Compare June 4, 2026 17:59
@Jerry2003826

Copy link
Copy Markdown
Author

Follow-up pushed in 4396f99f6e for the zombie-parent case from Cursor Bugbot.

The parent monitor now first verifies that the specific server is still directly parented by the recorded proxier PID via os.getppid(). If the proxier exits, the server is reparented, so the monitor returns false before relying on os.kill(pid, 0). This covers the case where the old proxier PID may still exist as a zombie.

I also added a focused reparented-parent unit test and kept the previous missing-PID test.

Validation:

  • python -m compileall -q python/ray/_private/services.py python/ray/util/client/server/server.py python/ray/tests/test_debug_tools.py
  • ruff check python/ray/_private/services.py python/ray/util/client/server/server.py python/ray/tests/test_debug_tools.py
  • git diff --check

DCO is now passing. The focused pytest remains blocked in this local checkout by the missing native extension:

ModuleNotFoundError: No module named 'ray._raylet'

@Jerry2003826

Copy link
Copy Markdown
Author

I checked the failing buildkite/microcheck run. The failed job is :database: data: doc tests [g7_s21], specifically doc/source/data/loading-data.rst::loading-data.rst.

The failure appears to be unrelated to this PR: the doctest timed out 3/3 attempts while downloading CIFAR10 (30% ... 51.2M/170M, then SystemExit: 15). This PR only changes:

  • python/ray/_private/services.py
  • python/ray/util/client/server/server.py
  • python/ray/tests/test_debug_tools.py

Could someone retry the failed microcheck job?

@Jerry2003826 Jerry2003826 force-pushed the codex/fix-ray-client-specific-server-fork-race branch from 4396f99 to 6d809e5 Compare June 8, 2026 08:24
@Jerry2003826

Copy link
Copy Markdown
Author

Follow-up pushed in 6d809e568b for the remaining SIGINT review concern.

What changed:

  • Specific-server processes now block SIGINT in their own entrypoint via _block_sigint_for_specific_server().
  • The posix_spawn path still avoids preexec_fn, so it remains fork-safe for the gRPC proxier path.
  • Added test_ray_client_specific_server_blocks_sigint.
  • Replied to the remaining review threads for close_fds, platform-scoped assertions, and SIGINT masking.

Validation:

  • python -m compileall -q python/ray/_private/services.py python/ray/util/client/server/server.py python/ray/tests/test_debug_tools.py
  • ruff check python/ray/_private/services.py python/ray/util/client/server/server.py python/ray/tests/test_debug_tools.py
  • git diff --check

Focused pytest is still blocked in this local checkout by the missing native extension:

ModuleNotFoundError: No module named 'ray._raylet'

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 6d809e5. Configure here.

Comment thread python/ray/util/client/server/server.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6d809e568b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread python/ray/tests/test_debug_tools.py Outdated
calls.append((how, signals))

with monkeypatch.context() as m:
m.setattr(ray_client_server.signal, "pthread_sigmask", fake_pthread_sigmask)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Make SIGINT-mask test platform-aware

When this test runs on Windows or any Python build without signal.pthread_sigmask, monkeypatch.setattr(..., raising=True) fails before the helper can exercise its hasattr guard. Since _block_sigint_for_specific_server() is intentionally a no-op on those platforms, this creates a deterministic cross-platform test failure rather than validating the intended behavior; skip/branch the test or patch with raising=False and assert the no-op case.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 52992f3e06. _block_sigint_for_specific_server() now requires both pthread_sigmask and SIG_BLOCK; the tests patch POSIX signal support with raising=False and add no-op coverage when SIG_BLOCK is unavailable.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled in the current branch. The SIGINT-mask test now patches signal.pthread_sigmask and signal.SIG_BLOCK with raising=False, and there is a separate no-op test for runtimes without the POSIX signal-mask support.

@Jerry2003826 Jerry2003826 force-pushed the codex/fix-ray-client-specific-server-fork-race branch from 6d809e5 to 52992f3 Compare June 8, 2026 08:41
@Jerry2003826

Copy link
Copy Markdown
Author

Updated in 52992f3e06.

Addressed the latest automated feedback:

  • specific-server now checks parent liveness before starting serve() and again at the beginning of each health-loop iteration, in addition to the post-sleep idle-accounting check.
  • _block_sigint_for_specific_server() now no-ops unless both pthread_sigmask and SIG_BLOCK are available.
  • Added regression coverage for dead-parent-before-serve and no-POSIX-signal-support behavior.

Validation:

  • python -m compileall -q python/ray/_private/services.py python/ray/util/client/server/server.py python/ray/tests/test_debug_tools.py
  • ruff check python/ray/util/client/server/server.py python/ray/tests/test_debug_tools.py
  • ruff format --check python/ray/util/client/server/server.py python/ray/tests/test_debug_tools.py
  • git diff --check

Targeted pytest remains blocked in this local checkout because Ray native extension ray._raylet is not built/importable here:

  • PYTHONPATH=python pytest -q python/ray/tests/test_debug_tools.py -k specific_server
  • fails during import ray with ModuleNotFoundError: No module named 'ray._raylet' before collecting the changed tests.

@Yicheng-Lu-llll Yicheng-Lu-llll left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! using spawn makes sense to me. I left one comments for the first round reviewing.

Comment thread python/ray/util/client/server/server.py Outdated
@@ -851,6 +852,36 @@ def ray_connect_handler(job_config: JobConfig = None, **ray_init_kwargs):
return ray_connect_handler


def _parent_process_is_alive(parent_pid: Optional[int]) -> bool:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repeating the lookup might not be ideal. You could use a pipe, see here:

"""
This is a lightweight "reaper" process used to ensure that ray processes are
cleaned up properly when the main ray process dies unexpectedly (e.g.,
segfaults or gets SIGKILLed). Note that processes may not be cleaned up
properly if this process is SIGTERMed or SIGKILLed.
It detects that its parent has died by reading from stdin, which must be
inherited from the parent process so that the OS will deliver an EOF if the
parent dies. When this happens, the reaper process kills the rest of its
process group (first attempting graceful shutdown with SIGTERM, then escalating
to SIGKILL).

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented the pipe-based parent monitor in the current branch.

Specific-server startup now passes --monitor-parent-pipe and pipe_stdin=True when using the fork-safe posix_spawn path. The server starts a daemon thread that reads stdin and marks the parent as exited on EOF, matching the ray_process_reaper.py pattern instead of polling getppid() / kill(pid, 0).

Coverage added/updated:

  • specific-server startup asserts --monitor-parent-pipe, pipe_stdin=True, use_posix_spawn=True, and fate_share=False
  • parent-pipe liveness checks use a threading.Event
  • the monitor helper sets the event on stdin EOF
  • startup exits before serving when the monitored parent pipe is already closed

Validated:

  • python -m compileall -q python/ray/_private/services.py python/ray/util/client/server/server.py python/ray/tests/test_debug_tools.py
  • ruff check python/ray/_private/services.py python/ray/util/client/server/server.py python/ray/tests/test_debug_tools.py
  • ruff format --check python/ray/_private/services.py python/ray/util/client/server/server.py python/ray/tests/test_debug_tools.py
  • git diff --check

I also tried PYTHONPATH=python python -m pytest python/ray/tests/test_debug_tools.py -k "specific_server" -q; this local checkout cannot import ray._raylet, so the pytest run stops in python/ray/tests/conftest.py before collecting the target tests.

@Jerry2003826 Jerry2003826 force-pushed the codex/fix-ray-client-specific-server-fork-race branch from 52992f3 to caa1088 Compare June 9, 2026 08:06

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: caa1088224

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread python/ray/_private/services.py Outdated
stderr=stderr_file,
stdin=subprocess.PIPE if pipe_stdin else None,
preexec_fn=preexec_fn if sys.platform != "win32" else None,
preexec_fn=(None if sys.platform == "win32" or use_posix_spawn else preexec_fn),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Block SIGINT before spawning the specific server

For POSIX specific-server launches, this drops the preexec_fn that used to block SIGINT before exec, while the replacement mask is only applied after setup_worker execs and server.main() reaches _block_sigint_for_specific_server(). If a user sends Ctrl-C while the runtime-env setup/import path is still running, the child can receive SIGINT and exit before installing its own mask, regressing Ray's intended isolation of subprocesses from driver interrupts. Consider having the spawning thread temporarily block SIGINT around the posix_spawn call so the child inherits the mask without using fork/preexec.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled in the current branch. start_ray_process() now temporarily blocks SIGINT in the spawning thread around the posix_spawn/ConsolePopen call and restores the previous mask afterward. The child inherits the blocked mask before setup_worker starts, without reintroducing preexec_fn.

Added coverage in test_start_ray_process_posix_spawn_blocks_sigint_for_child.

Validated:

  • python -m compileall -q python/ray/_private/services.py python/ray/_private/workers/setup_worker.py python/ray/util/client/server/server.py python/ray/tests/test_debug_tools.py
  • ruff check python/ray/_private/services.py python/ray/_private/workers/setup_worker.py python/ray/util/client/server/server.py python/ray/tests/test_debug_tools.py
  • ruff format --check python/ray/_private/services.py python/ray/_private/workers/setup_worker.py python/ray/util/client/server/server.py python/ray/tests/test_debug_tools.py
  • git diff --check

Targeted pytest still cannot collect locally because this checkout lacks the compiled ray._raylet extension.

Comment thread python/ray/tests/test_debug_tools.py Outdated

with monkeypatch.context() as m:
m.setattr(ray_client_server.sys, "stdin", FakeStdin())
parent_pipe_closed = ray_client_server._start_parent_pipe_monitor(True)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep fake stdin patched until the monitor thread reads it

When the daemon monitor thread is scheduled only after this monkeypatch.context() exits, _start_parent_pipe_monitor() resolves sys.stdin in the thread and can read the restored real pytest/direct-run stdin instead of FakeStdin; with -s or direct invocation that read can block, making parent_pipe_closed.wait(timeout=1) fail. Keep the wait inside the monkeypatch context or capture the stdin object before starting the thread.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled in the current branch. Both parent-pipe monitor helpers now capture the stdin object before starting the daemon thread, so the test cannot race with monkeypatch.context() restoration.

The test now joins the monitor thread while the fake stdin is still patched.

Validated with compileall, Ruff check, Ruff format check, and git diff --check. Targeted pytest still cannot collect locally because this checkout lacks ray._raylet.

Comment on lines +2499 to +2501
process_fate_share = False if use_posix_spawn else fate_share
if use_posix_spawn:
command.append("--monitor-parent-pipe")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve parent-death cleanup during server bootstrap

When a POSIX specific server is launched, this disables kernel fate sharing immediately, but the replacement stdin monitor is only started later after setup_worker has finished and ray.util.client.server.main() begins. If the proxier process dies during that bootstrap/import/runtime-env command path, the child no longer gets killed by fate sharing and has not yet started monitoring stdin EOF, so it can continue running orphaned until it eventually reaches server startup. Start the parent-pipe monitor in the setup-worker phase or keep a startup-time fate-sharing mechanism until the monitor is active.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled in the current branch. The parent-pipe monitor now starts in setup_worker.py before RuntimeEnvContext.deserialize() / exec_worker(), so parent death during runtime-env/bootstrap exits the child early. Because exec_worker() replaces the process with os.execvp, setup_worker also forwards --monitor-parent-pipe to the final Ray Client server, which starts its own post-exec stdin EOF monitor.

This preserves parent-death cleanup both before and after the exec boundary while keeping the fork-safe spawn path.

Validated with compileall, Ruff check, Ruff format check, and git diff --check. Targeted pytest still cannot collect locally because this checkout lacks ray._raylet.

@Jerry2003826 Jerry2003826 force-pushed the codex/fix-ray-client-specific-server-fork-race branch from caa1088 to e90156c Compare June 9, 2026 08:23

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e90156cff0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

if __name__ == "__main__":
setup_logger(LOGGER_LEVEL, LOGGER_FORMAT)
args, remaining_args = parser.parse_known_args()
_start_parent_pipe_monitor(args.monitor_parent_pipe)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep parent monitor alive through runtime-env exec

When --monitor-parent-pipe is used with a non-empty runtime-env command_prefix, this monitor thread only exists until runtime_env_context.exec_worker(...) replaces setup_worker via os.execvp("bash", ...) in python/ray/_private/runtime_env/context.py; the daemon thread is then gone while the shell runs the prefix and before ray.util.client.server starts its own monitor. In that interval, if the proxier dies, stdin EOF is not consumed and fate_share has already been disabled for the specific server, so the child can still be orphaned during bootstrap. The monitor needs to survive the exec path or the parent-death mechanism needs to remain active until the final server monitor is running.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled in the current branch. The parent-pipe monitor is no longer a daemon thread inside setup_worker; _start_parent_pipe_monitor() now starts a small subprocess that inherits the same stdin pipe and targets the setup_worker PID. That process survives RuntimeEnvContext.exec_worker() / os.execvp(...), so it continues covering the runtime-env command-prefix window and the final Ray Client server process. It also exits when the target PID is gone to avoid leaking after normal server shutdown.

The final server no longer starts a second stdin reader, so there is only one monitor consuming the inherited pipe.

Validated:

  • python -m compileall -q python/ray/_private/services.py python/ray/_private/workers/setup_worker.py python/ray/util/client/server/server.py python/ray/tests/test_debug_tools.py
  • ruff check python/ray/_private/services.py python/ray/_private/workers/setup_worker.py python/ray/util/client/server/server.py python/ray/tests/test_debug_tools.py
  • ruff format --check python/ray/_private/services.py python/ray/_private/workers/setup_worker.py python/ray/util/client/server/server.py python/ray/tests/test_debug_tools.py
  • git diff --check

I also retried the targeted pytest command, but this local checkout still cannot collect Ray tests because it lacks the compiled ray._raylet extension.

Comment thread python/ray/tests/test_debug_tools.py Outdated
with monkeypatch.context() as m:
m.setattr(services.sys, "platform", "linux")
m.setattr(services.os, "POSIX_SPAWN_CLOSEFROM", object(), raising=False)
m.setattr(services.signal, "pthread_sigmask", fake_pthread_sigmask)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Make pthread_sigmask patch portable

The current test still patches services.signal.pthread_sigmask with the default raising=True; on Windows/Python builds where signal.pthread_sigmask is absent, this fails before the test reaches the code under test even though the test monkeypatches services.sys.platform to linux. I checked the current test after the earlier platform-awareness fixes, and this remaining patch should also use raising=False or be skipped/branched for runtimes without POSIX signal-mask support.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled in the current branch. The pthread_sigmask monkeypatch now uses raising=False, matching the existing platform guard in the code under test and keeping the test portable on runtimes where signal.pthread_sigmask is not present.

Validated with compileall, Ruff check, Ruff format check, and git diff --check. Targeted pytest still cannot collect locally because this checkout lacks ray._raylet.

Signed-off-by: JerryLee <223425819+Jerry2003826@users.noreply.github.com>
@Jerry2003826 Jerry2003826 force-pushed the codex/fix-ray-client-specific-server-fork-race branch from e90156c to 6915826 Compare June 9, 2026 23:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Core][Ray Client] specific_server intermittently dies during fork() — gRPC poller race in proxier

4 participants