[release test] use new sdk to get job state by aslonnie · Pull Request #60821 · ray-project/ray

aslonnie · 2026-02-07T03:33:58Z

one fewer piece that is using legacy sdk.

gemini-code-assist

Code Review

This pull request refactors the Anyscale job runner and manager to use the new Anyscale SDK for fetching job states. The changes are logical and correctly adapt to the new SDK's API.

I've added a couple of suggestions to improve debuggability. With the removal of job_error_message, some exceptions no longer include detailed error information. My suggestions aim to mitigate this by including job logs in the exception messages for certain failure cases.

Overall, this is a good step towards modernizing the codebase.

gemini-code-assist · 2026-02-07T03:35:16Z

release/ray_release/command_runner/anyscale_job_runner.py

+        if job_state == -2:
+            raise JobBrokenError("Job state is 'UNKNOWN'.")


The migration to the new SDK seems to have removed the detailed error message that was previously included in this exception. To maintain debuggability, could you please fetch and include the job logs in the exception message? This will be very helpful for diagnosing failures.

Suggested change

if job_state == -2:

raise JobBrokenError("Job state is 'UNKNOWN'.")

if job_state == -2:

logs = self.get_last_logs() or "Could not retrieve logs."

raise JobBrokenError(f"Job state is 'UNKNOWN'.\n--- Logs ---\n{logs}")

release/ray_release/command_runner/anyscale_job_runner.py

cursor · 2026-02-07T03:48:22Z

release/ray_release/command_runner/anyscale_job_runner.py

-                f"\n{error}\n"
-            )
+        if job_state == -2:
+            raise JobBrokenError("Job state is 'UNKNOWN'.")


Unhandled return code -4 for soft infra errors

High Severity

The _handle_command_output function only handles return codes -1 and -2, but run_and_wait can still return -4 for soft infrastructure errors (when JobState.FAILED occurs before the job starts running). The old handling for -4 that raised JobTerminatedBeforeStartError was removed, but the code path that returns -4 still exists. This causes jobs that fail during cluster provisioning to fall through to output fetching, eventually raising an incorrect JobNoLogsError instead of the appropriate infrastructure failure exception.

Additional Locations (1)

release/ray_release/job_manager/anyscale_job_manager.py#L226-L229

cursor · 2026-02-07T03:48:22Z

release/ray_release/command_runner/anyscale_job_runner.py

-    def _handle_command_output(
-        self, job_status_code: int, error: str, raise_on_timeout: bool = True
-    ):
-        if job_status_code == -2:


Early -1 check skips output analysis and has wrong message

Medium Severity

The check for job_state == -1 was moved from the end of _handle_command_output to the beginning, changing the error-handling behavior. In the original code, even when the job returned -1 (OUT_OF_RETRIES), the function would still fetch output.json and check for specific errors like PrepareCommandError or TestCommandError, only raising JobOutOfRetriesError as a fallback. Now it raises immediately without analyzing output. Additionally, the error message says "command has not been ran" but -1 is only returned when job_running == True (the job did start), making the message inaccurate.

…ale.job.status() Replace the old SDK's `self._sdk.get_production_job()` with the new `anyscale.job.status()` API in AnyscaleJobManager. The new API uses `JobState` enums (SUCCEEDED/FAILED/RUNNING) instead of `HaJobStates`. The critical distinction between FAILED-before-running (infra error, retcode -4) and FAILED-after-running (test failure, retcode -1) is preserved. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

cursor · 2026-02-07T04:53:29Z

release/ray_release/job_manager/anyscale_job_manager.py

+        self.save_last_job_status(result)
        status = self._last_job_status()
        assert status in terminal_state
-        if status == HaJobStates.TERMINATED and not job_running:


Return code -4 generated but never handled

High Severity

_wait_job produces retcode = -4 when JobState.FAILED occurs before the job starts running, but _handle_command_output only handles -1 and -2. The old code raised JobTerminatedBeforeStartError (with retryable CLUSTER_STARTUP_TIMEOUT exit code) for -4. Now the unhandled -4 falls through to output/log collection, which will likely fail for a job that never started, resulting in a JobNoLogsError with a non-retryable exit code — misclassifying a retryable infrastructure failure.

Additional Locations (1)

release/ray_release/command_runner/anyscale_job_runner.py#L158-L167

aslonnie requested a review from a team as a code owner February 7, 2026 03:33

gemini-code-assist bot reviewed Feb 7, 2026

View reviewed changes

cursor bot reviewed Feb 7, 2026

View reviewed changes

aslonnie force-pushed the lonnie-260206-jobstate branch from c58c37c to 026c7b4 Compare February 7, 2026 04:47

cursor bot reviewed Feb 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[release test] use new sdk to get job state#60821

[release test] use new sdk to get job state#60821
aslonnie wants to merge 1 commit intoray-project:masterfrom
anyscale:lonnie-260206-jobstate

aslonnie commented Feb 7, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 7, 2026

Uh oh!

Uh oh!

cursor bot Feb 7, 2026

Uh oh!

cursor bot Feb 7, 2026

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Feb 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		if job_state == -2:
		raise JobBrokenError("Job state is 'UNKNOWN'.")

Conversation

aslonnie commented Feb 7, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor bot Feb 7, 2026

Choose a reason for hiding this comment

Unhandled return code -4 for soft infra errors

Uh oh!

cursor bot Feb 7, 2026

Choose a reason for hiding this comment

Early -1 check skips output analysis and has wrong message

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Feb 7, 2026

Choose a reason for hiding this comment

Return code -4 generated but never handled

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant