almost fully using new sdk for job stuff by aslonnie · Pull Request #60825 · ray-project/ray

aslonnie · 2026-02-07T04:50:40Z

No description provided.

…ale.job.status() Replace the old SDK's `self._sdk.get_production_job()` with the new `anyscale.job.status()` API in AnyscaleJobManager. The new API uses `JobState` enums (SUCCEEDED/FAILED/RUNNING) instead of `HaJobStates`. The critical distinction between FAILED-before-running (infra error, retcode -4) and FAILED-after-running (test failure, retcode -1) is preserved. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Handle STARTING state as job-running (switches to command timeout) - Treat UNKNOWN as terminal with retcode -2 to fail loudly - Remove job_error_message() (no equivalent in new API) - Remove stale error string references from _handle_command_output - Rename job_status_code to job_state - Update error messages to reflect new JobState enum names Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Rename method and parameter to better reflect the new JobStatus type. Add a warning log when the returned job ID doesn't match the expected one. Update FakeJobResult to FakeJobStatus in tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Remove the FAILED-before-running (-4) special case from _handle_command_output. Move the FAILED (-1) check to the top of the function alongside the UNKNOWN (-2) check. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

The new Anyscale SDK no longer supports env_vars in cluster environment registration. This refactors the release test infrastructure to bake runtime_env variables into custom BYOD Docker images instead. - Make require_custom_byod_image() trigger on runtime_env - Include runtime_env in image tag hash for cache correctness - Add --env CLI option to custom_byod_build script - Pass --env flags in Buildkite build YAML generation - Remove env_vars from create_byod_cluster_environment() SDK call - Remove __env__ hash suffix from cluster env name (now in image tag) - Add hello_world_custom_runtime_env test variants for validation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace old SDK's CreateProductionJob/CreateProductionJobConfig with new SDK's JobConfig + anyscale.job.submit(). This completes the migration of AnyscaleJobManager to the new anyscale.job.* API (submit, status, terminate, get_logs). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request refactors the Anyscale job submission logic to use the new Anyscale SDK. It also adds support for including runtime environment variables in custom BYOD images. The changes are extensive and touch multiple files, including job runners, managers, and test helpers.

My main concerns are around error handling in the new implementation. It appears that detailed error messages from failed jobs are no longer being propagated into exceptions, which will make debugging more difficult. Additionally, there's a critical bug where a specific job failure state is not handled, potentially leading to silent failures.

I've added two review comments in release/ray_release/command_runner/anyscale_job_runner.py detailing these issues. Other parts of the refactoring and the new runtime environment feature look good.

gemini-code-assist · 2026-02-07T04:52:42Z

release/ray_release/command_runner/anyscale_job_runner.py

+        if job_state == -1:
+            raise JobOutOfRetriesError(
+                "Job returned non-success state: 'FAILED' "
+                "(command has not been ran or no logs could have been obtained)."
            )


The _handle_command_output function is missing a check for job_state == -4. This state is returned by AnyscaleJobManager for jobs that fail before starting (a "soft infra error"). Without handling this case, a real failure could be missed. The previous implementation correctly raised a JobTerminatedBeforeStartError.

Please add a check for job_state == -4. You will also need to re-import JobTerminatedBeforeStartError at the top of the file.

if job_state == -4: raise JobTerminatedBeforeStartError( "Job entered 'TERMINATED' state before it started " "(most likely due to inability to provision required nodes; " "otherwise it was terminated manually or Ray was stopped)." ) if job_state == -1: raise JobOutOfRetriesError( "Job returned non-success state: 'FAILED' " "(command has not been ran or no logs could have been obtained)." )

gemini-code-assist · 2026-02-07T04:52:42Z

release/ray_release/command_runner/anyscale_job_runner.py

-                "Job entered 'TERMINATED' state (it was terminated "
-                "manually or Ray was stopped):"
-                f"\n{error}\n"
+    def _handle_command_output(self, job_state: int, raise_on_timeout: bool = True):


The refactoring to use the new SDK has removed the detailed error messages from the exceptions raised on job failure. For example, JobBrokenError, PrepareCommandError, and TestCommandError are now raised with generic messages. Previously, they included a specific error message from the job, which is very helpful for debugging.

Please consider re-introducing these error details. The new Anyscale SDK's JobStatus object likely contains an error message (e.g., in a message attribute). This could be exposed via the AnyscaleJobManager and passed to this function to be included in the exceptions.

cursor · 2026-02-07T05:02:50Z

release/ray_release/command_runner/anyscale_job_runner.py

-                f"\n{error}\n"
-            )
+        if job_state == -2:
+            raise JobBrokenError("Job state is 'UNKNOWN'.")


Return code -4 unhandled in command output handler

High Severity

_wait_job can return retcode -4 when JobState.FAILED occurs before the job starts running, but _handle_command_output only handles -1 and -2. The old code raised JobTerminatedBeforeStartError (with ExitCode.CLUSTER_STARTUP_TIMEOUT, a retryable infra error) for -4. Now the code falls through to output parsing, which will fail since the job never ran, ultimately raising JobNoLogsError with ExitCode.ANYSCALE_ERROR — a different, non-retryable error category.

Additional Locations (1)

release/ray_release/job_manager/anyscale_job_manager.py#L210-L213

cursor · 2026-02-07T05:02:50Z

release/ray_release/command_runner/anyscale_job_runner.py

-    def _handle_command_output(
-        self, job_status_code: int, error: str, raise_on_timeout: bool = True
-    ):
-        if job_status_code == -2:


Early FAILED check skips output parsing for running jobs

Medium Severity

When job_state == -1 (a FAILED job that was already running), _handle_command_output immediately raises JobOutOfRetriesError without attempting to parse output. The old code checked this condition last, after parsing output, so it could raise more specific errors like TestCommandError or PrepareCommandError. The error message also incorrectly says "command has not been ran" even though the command did run (since -1 is only returned when job_running was True).

Add _convert_compute_config() to convert old-style compute config dicts to new ComputeConfig objects, inlined in the job submit call instead of registering separately via cluster_compute_id. Add test validating all release test compute configs convert successfully. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

cursor · 2026-02-07T06:23:36Z

release/ray_release/job_manager/anyscale_job_manager.py

        status = self._last_job_status()
        assert status in terminal_state
-        if status == HaJobStates.TERMINATED and not job_running:
+        if status == JobState.FAILED and not job_running:


Return code -4 produced but never handled

High Severity

_wait_job generates retcode = -4 when status == JobState.FAILED and not job_running, but _handle_command_output only handles -1 and -2. The old code had an explicit handler for -4 that raised JobTerminatedBeforeStartError. Now -4 falls through to the S3/log-parsing path, which will fail (since the job never ran) and raise a misleading JobNoLogsError instead of a proper infrastructure error.

Additional Locations (1)

release/ray_release/command_runner/anyscale_job_runner.py#L158-L221

aslonnie and others added 11 commits February 7, 2026 03:14

[Release] Format with black

f62b7b0

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

lints

93e1491

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

Merge remote-tracking branch 'origin/master' into lonnie-260206-jobstate

c58c37c

Merge branch 'lonnie-260206-jobstate' into lonnie-260206-newsdk

78990a6

[release] Use new anyscale.job.terminate() for job termination

4c70959

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

aslonnie requested a review from a team as a code owner February 7, 2026 04:50

gemini-code-assist bot reviewed Feb 7, 2026

View reviewed changes

cursor bot reviewed Feb 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

almost fully using new sdk for job stuff#60825

almost fully using new sdk for job stuff#60825
aslonnie wants to merge 12 commits intoray-project:masterfrom
anyscale:lonnie-260206-newsdk

aslonnie commented Feb 7, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 7, 2026

Uh oh!

gemini-code-assist bot Feb 7, 2026

Uh oh!

cursor bot Feb 7, 2026

Uh oh!

cursor bot Feb 7, 2026

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Feb 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aslonnie commented Feb 7, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

cursor bot Feb 7, 2026

Choose a reason for hiding this comment

Return code -4 unhandled in command output handler

Uh oh!

cursor bot Feb 7, 2026

Choose a reason for hiding this comment

Early FAILED check skips output parsing for running jobs

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Feb 7, 2026

Choose a reason for hiding this comment

Return code -4 produced but never handled

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant