feat: detect Ray measurement tasks that never start, and add remote runtime setup options#998
Open
michael-johnston wants to merge 20 commits into
Open
feat: detect Ray measurement tasks that never start, and add remote runtime setup options#998michael-johnston wants to merge 20 commits into
michael-johnston wants to merge 20 commits into
Conversation
If you are expecting many jobs to start simultaneously with autoscaling it can be useful to changes these from the ray defaults.
If you are expecting many jobs to start simultaneously with autoscaling it can be useful to changes these from the ray defaults.
The most common reason for operation hangs with distributed ray is ray tasks never starting. Current actuators "fire-and-forget" so there is no code for checking if a task ever started. LaunchSupervisor provides a generic way to monitor if launched ray tasks start and handle if they don't.
Notify the supervisor immediately after queue.put and ensure only timeout tasks that never reached RUNNING.
When manager catches an exception when monitoring measurement queue it informs subscribers. Prior to this change it passed the exception - however this may contain non-ray serializable components causing a second exception crashing the manager. This change fixes the problem
Member
|
@michael-johnston tests are failing |
…es and duplicate failures.
With pytest-xdist the pytest workers for ray supervision tests each call ray.init() in a session fixture, so parallel workers each start a separate local Ray cluster on the same machine. The tests then query Ray’s State API to observe task scheduling state; under parallel cluster startup that API is often slow or temporarily unavailable (ConnectionError, empty results). The supervisor under test then sees OTHER instead of real states, causing false timeouts and missed pending-resource failures. This then causes tests to fail as the required behaviour is not observed. To avoid this this commit creates a xdist_group to keep these tests on one worker (one Ray cluster, serial execution). tox must then use --dist loadgroup, because worksteal ignores xdist_group and the tests remain parallel.
The following problems were observed in tests in CI * Ray State API becomes unreliable after many tests on the same xdist worker — list_tasks returns empty results or ServerUnavailable, so tasks look like OTHER instead of their real state. * list_tasks retries were too short (3 quick attempts) for CI-level API lag, so lookups failed before the State API recovered. * FAILED state fell through to launch timeout when grace hadn’t elapsed yet, causing false “did not start within Xs” failures on healthy or completed tasks. * Resource timeout was blocked by seen_running — a brief false RUNNING report could prevent pending-resource failures from ever firing. * Launch timeout didn’t re-check mark_completed in _check_pending, leaving a race where a duplicate failure could still be emitted. This commit solves them as follows: * Ray State API becomes unreliable and list_tasks retries too short — fixed by more list_tasks retries (8), longer backoff, and extra delay when error is ServerUnavailable. * FAILED fell through to launch timeout — fixed by always returning after FAILED (only emit failure once grace has elapsed). * Resource timeout blocked by seen_running — fixed by removing not pending.seen_running from resource-wait and OTHER resource-timeout checks. * Launch timeout race with mark_completed — fixed by re-checking completed_request_ids in _check_pending before emitting launch timeout.
Issues: * Ray State API becomes unreliable after many tests on the same xdist worker — list_tasks returns empty results or ServerUnavailable, so tasks look like OTHER instead of their real state. * State lookup integration test was too brittle — hard-failed after 5s if PENDING_NODE_ASSIGNMENT never appeared, even when the API was just temporarily unavailable under parallel load. Fixes: * Ray State API becomes unreliable (state-lookup test) — fixed by polling up to 15s, only stopping on PENDING_NODE_ASSIGNMENT, and pytest.skip when the API stays unavailable under parallel load. * State lookup integration test too brittle — same changes as above (longer window + skip instead of hard fail).
Modules that call configure_logging() at module level can be lazily imported inside a typer CliRunner.invoke() call, at which point sys.stderr is the runner's capture buffer. The resulting root logger handler then points to a closed stream after the invoke returns, causing "ValueError: I/O operation on closed file" to appear in subsequent invocations' result.output. A _logging_configured flag ensures the full handler setup runs only once; subsequent calls merely update the log level. Ray workers are unaffected as each starts with a fresh interpreter. Exposed by the --dist worksteal → --dist loadgroup change in tox.ini.
Member
Author
|
Everything passing now @AlessandroPomponio |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
In ado distributed Ray operations hang when ray experiment workers do not post to the update queue or if posts to the queue are "lost". This can happen:
This PR adds a generic
ExperimentExecutorSupervisorwhich monitors ray tasks that execute experiments (via their ray state), and can handle many of these issues. The CustomExperiments actuator is updated to use this. Other actuators can optionally avail of itThe PR also exposes some ray runtime env options that can impact if these errors occur
Executor supervisor
- Success: it does nothing as it means the FAILED was after task exit
- Exception: Uncaught exception. It send InvalidMeasurementResult
taskRunningTimeoutSeconds- default is 15mins- Once a task is seen running this timeout no longer is applied. Subsequent issues are handled by FAILED path
- Turn this on if an autoscaling cluster is expected to handle the load
Edge Cases Handled
- In this case without additional logic it would be possible for two results to be recorded for the entity
- To prevent this the supervisor provides a call-back - tasks can notify it that they have sent the result, so it does not put a duplicate if it later sees the task as FAILED and can't retrieve its ref
Edge Cases Not Handled
Summary of Ray States Handling
Task are considered dead after taskRunningTimeoutSeconds is exceeded for these states
if taskPendingResourceTimeoutSeconds is set, tasks are considered dead if they are still in either of these states at the timeout. This is for use with autoscaling cluster, which is expected to scale to the required load.
Tasks considered dead when taskFailedGracePeriod is exceeded
Remote execution
setupTimeoutSeconds— how long Ray may spend setting up the job environmenteagerInstall— whether to install the environment when the job starts or wait until the first task runsTests