Skip to content

[release test] use new sdk to get job state#60821

Open
aslonnie wants to merge 1 commit intoray-project:masterfrom
anyscale:lonnie-260206-jobstate
Open

[release test] use new sdk to get job state#60821
aslonnie wants to merge 1 commit intoray-project:masterfrom
anyscale:lonnie-260206-jobstate

Conversation

@aslonnie
Copy link
Collaborator

@aslonnie aslonnie commented Feb 7, 2026

one fewer piece that is using legacy sdk.

@aslonnie aslonnie requested a review from a team as a code owner February 7, 2026 03:33
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the Anyscale job runner and manager to use the new Anyscale SDK for fetching job states. The changes are logical and correctly adapt to the new SDK's API.

I've added a couple of suggestions to improve debuggability. With the removal of job_error_message, some exceptions no longer include detailed error information. My suggestions aim to mitigate this by including job logs in the exception messages for certain failure cases.

Overall, this is a good step towards modernizing the codebase.

Comment on lines +161 to +167
if job_state == -2:
raise JobBrokenError("Job state is 'UNKNOWN'.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The migration to the new SDK seems to have removed the detailed error message that was previously included in this exception. To maintain debuggability, could you please fetch and include the job logs in the exception message? This will be very helpful for diagnosing failures.

Suggested change
if job_state == -2:
raise JobBrokenError("Job state is 'UNKNOWN'.")
if job_state == -2:
logs = self.get_last_logs() or "Could not retrieve logs."
raise JobBrokenError(f"Job state is 'UNKNOWN'.\n--- Logs ---\n{logs}")

f"\n{error}\n"
)
if job_state == -2:
raise JobBrokenError("Job state is 'UNKNOWN'.")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unhandled return code -4 for soft infra errors

High Severity

The _handle_command_output function only handles return codes -1 and -2, but run_and_wait can still return -4 for soft infrastructure errors (when JobState.FAILED occurs before the job starts running). The old handling for -4 that raised JobTerminatedBeforeStartError was removed, but the code path that returns -4 still exists. This causes jobs that fail during cluster provisioning to fall through to output fetching, eventually raising an incorrect JobNoLogsError instead of the appropriate infrastructure failure exception.

Additional Locations (1)

Fix in Cursor Fix in Web

def _handle_command_output(
self, job_status_code: int, error: str, raise_on_timeout: bool = True
):
if job_status_code == -2:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Early -1 check skips output analysis and has wrong message

Medium Severity

The check for job_state == -1 was moved from the end of _handle_command_output to the beginning, changing the error-handling behavior. In the original code, even when the job returned -1 (OUT_OF_RETRIES), the function would still fetch output.json and check for specific errors like PrepareCommandError or TestCommandError, only raising JobOutOfRetriesError as a fallback. Now it raises immediately without analyzing output. Additionally, the error message says "command has not been ran" but -1 is only returned when job_running == True (the job did start), making the message inaccurate.

Fix in Cursor Fix in Web

…ale.job.status()

Replace the old SDK's `self._sdk.get_production_job()` with the new
`anyscale.job.status()` API in AnyscaleJobManager. The new API uses
`JobState` enums (SUCCEEDED/FAILED/RUNNING) instead of `HaJobStates`.
The critical distinction between FAILED-before-running (infra error,
retcode -4) and FAILED-after-running (test failure, retcode -1) is
preserved.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
@aslonnie aslonnie force-pushed the lonnie-260206-jobstate branch from c58c37c to 026c7b4 Compare February 7, 2026 04:47
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

self.save_last_job_status(result)
status = self._last_job_status()
assert status in terminal_state
if status == HaJobStates.TERMINATED and not job_running:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return code -4 generated but never handled

High Severity

_wait_job produces retcode = -4 when JobState.FAILED occurs before the job starts running, but _handle_command_output only handles -1 and -2. The old code raised JobTerminatedBeforeStartError (with retryable CLUSTER_STARTUP_TIMEOUT exit code) for -4. Now the unhandled -4 falls through to output/log collection, which will likely fail for a job that never started, resulting in a JobNoLogsError with a non-retryable exit code — misclassifying a retryable infrastructure failure.

Additional Locations (1)

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant