[release test] use new sdk to get job state#60821
[release test] use new sdk to get job state#60821aslonnie wants to merge 1 commit intoray-project:masterfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the Anyscale job runner and manager to use the new Anyscale SDK for fetching job states. The changes are logical and correctly adapt to the new SDK's API.
I've added a couple of suggestions to improve debuggability. With the removal of job_error_message, some exceptions no longer include detailed error information. My suggestions aim to mitigate this by including job logs in the exception messages for certain failure cases.
Overall, this is a good step towards modernizing the codebase.
| if job_state == -2: | ||
| raise JobBrokenError("Job state is 'UNKNOWN'.") |
There was a problem hiding this comment.
The migration to the new SDK seems to have removed the detailed error message that was previously included in this exception. To maintain debuggability, could you please fetch and include the job logs in the exception message? This will be very helpful for diagnosing failures.
| if job_state == -2: | |
| raise JobBrokenError("Job state is 'UNKNOWN'.") | |
| if job_state == -2: | |
| logs = self.get_last_logs() or "Could not retrieve logs." | |
| raise JobBrokenError(f"Job state is 'UNKNOWN'.\n--- Logs ---\n{logs}") |
| f"\n{error}\n" | ||
| ) | ||
| if job_state == -2: | ||
| raise JobBrokenError("Job state is 'UNKNOWN'.") |
There was a problem hiding this comment.
Unhandled return code -4 for soft infra errors
High Severity
The _handle_command_output function only handles return codes -1 and -2, but run_and_wait can still return -4 for soft infrastructure errors (when JobState.FAILED occurs before the job starts running). The old handling for -4 that raised JobTerminatedBeforeStartError was removed, but the code path that returns -4 still exists. This causes jobs that fail during cluster provisioning to fall through to output fetching, eventually raising an incorrect JobNoLogsError instead of the appropriate infrastructure failure exception.
Additional Locations (1)
| def _handle_command_output( | ||
| self, job_status_code: int, error: str, raise_on_timeout: bool = True | ||
| ): | ||
| if job_status_code == -2: |
There was a problem hiding this comment.
Early -1 check skips output analysis and has wrong message
Medium Severity
The check for job_state == -1 was moved from the end of _handle_command_output to the beginning, changing the error-handling behavior. In the original code, even when the job returned -1 (OUT_OF_RETRIES), the function would still fetch output.json and check for specific errors like PrepareCommandError or TestCommandError, only raising JobOutOfRetriesError as a fallback. Now it raises immediately without analyzing output. Additionally, the error message says "command has not been ran" but -1 is only returned when job_running == True (the job did start), making the message inaccurate.
…ale.job.status() Replace the old SDK's `self._sdk.get_production_job()` with the new `anyscale.job.status()` API in AnyscaleJobManager. The new API uses `JobState` enums (SUCCEEDED/FAILED/RUNNING) instead of `HaJobStates`. The critical distinction between FAILED-before-running (infra error, retcode -4) and FAILED-after-running (test failure, retcode -1) is preserved. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
c58c37c to
026c7b4
Compare
| self.save_last_job_status(result) | ||
| status = self._last_job_status() | ||
| assert status in terminal_state | ||
| if status == HaJobStates.TERMINATED and not job_running: |
There was a problem hiding this comment.
Return code -4 generated but never handled
High Severity
_wait_job produces retcode = -4 when JobState.FAILED occurs before the job starts running, but _handle_command_output only handles -1 and -2. The old code raised JobTerminatedBeforeStartError (with retryable CLUSTER_STARTUP_TIMEOUT exit code) for -4. Now the unhandled -4 falls through to output/log collection, which will likely fail for a job that never started, resulting in a JobNoLogsError with a non-retryable exit code — misclassifying a retryable infrastructure failure.


one fewer piece that is using legacy sdk.