Skip to content

almost fully using new sdk for job stuff#60825

Open
aslonnie wants to merge 12 commits intoray-project:masterfrom
anyscale:lonnie-260206-newsdk
Open

almost fully using new sdk for job stuff#60825
aslonnie wants to merge 12 commits intoray-project:masterfrom
anyscale:lonnie-260206-newsdk

Conversation

@aslonnie
Copy link
Collaborator

@aslonnie aslonnie commented Feb 7, 2026

No description provided.

aslonnie and others added 11 commits February 7, 2026 03:14
…ale.job.status()

Replace the old SDK's `self._sdk.get_production_job()` with the new
`anyscale.job.status()` API in AnyscaleJobManager. The new API uses
`JobState` enums (SUCCEEDED/FAILED/RUNNING) instead of `HaJobStates`.
The critical distinction between FAILED-before-running (infra error,
retcode -4) and FAILED-after-running (test failure, retcode -1) is
preserved.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Handle STARTING state as job-running (switches to command timeout)
- Treat UNKNOWN as terminal with retcode -2 to fail loudly
- Remove job_error_message() (no equivalent in new API)
- Remove stale error string references from _handle_command_output
- Rename job_status_code to job_state
- Update error messages to reflect new JobState enum names

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rename method and parameter to better reflect the new JobStatus type.
Add a warning log when the returned job ID doesn't match the expected
one. Update FakeJobResult to FakeJobStatus in tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove the FAILED-before-running (-4) special case from
_handle_command_output. Move the FAILED (-1) check to the top of
the function alongside the UNKNOWN (-2) check.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
The new Anyscale SDK no longer supports env_vars in cluster environment
registration. This refactors the release test infrastructure to bake
runtime_env variables into custom BYOD Docker images instead.

- Make require_custom_byod_image() trigger on runtime_env
- Include runtime_env in image tag hash for cache correctness
- Add --env CLI option to custom_byod_build script
- Pass --env flags in Buildkite build YAML generation
- Remove env_vars from create_byod_cluster_environment() SDK call
- Remove __env__ hash suffix from cluster env name (now in image tag)
- Add hello_world_custom_runtime_env test variants for validation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace old SDK's CreateProductionJob/CreateProductionJobConfig with
new SDK's JobConfig + anyscale.job.submit(). This completes the
migration of AnyscaleJobManager to the new anyscale.job.* API
(submit, status, terminate, get_logs).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@aslonnie aslonnie requested a review from a team as a code owner February 7, 2026 04:50
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the Anyscale job submission logic to use the new Anyscale SDK. It also adds support for including runtime environment variables in custom BYOD images. The changes are extensive and touch multiple files, including job runners, managers, and test helpers.

My main concerns are around error handling in the new implementation. It appears that detailed error messages from failed jobs are no longer being propagated into exceptions, which will make debugging more difficult. Additionally, there's a critical bug where a specific job failure state is not handled, potentially leading to silent failures.

I've added two review comments in release/ray_release/command_runner/anyscale_job_runner.py detailing these issues. Other parts of the refactoring and the new runtime environment feature look good.

Comment on lines +160 to 164
if job_state == -1:
raise JobOutOfRetriesError(
"Job returned non-success state: 'FAILED' "
"(command has not been ran or no logs could have been obtained)."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The _handle_command_output function is missing a check for job_state == -4. This state is returned by AnyscaleJobManager for jobs that fail before starting (a "soft infra error"). Without handling this case, a real failure could be missed. The previous implementation correctly raised a JobTerminatedBeforeStartError.

Please add a check for job_state == -4. You will also need to re-import JobTerminatedBeforeStartError at the top of the file.

        if job_state == -4:
            raise JobTerminatedBeforeStartError(
                "Job entered 'TERMINATED' state before it started "
                "(most likely due to inability to provision required nodes; "
                "otherwise it was terminated manually or Ray was stopped)."
            )

        if job_state == -1:
            raise JobOutOfRetriesError(
                "Job returned non-success state: 'FAILED' "
                "(command has not been ran or no logs could have been obtained)."
            )

"Job entered 'TERMINATED' state (it was terminated "
"manually or Ray was stopped):"
f"\n{error}\n"
def _handle_command_output(self, job_state: int, raise_on_timeout: bool = True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The refactoring to use the new SDK has removed the detailed error messages from the exceptions raised on job failure. For example, JobBrokenError, PrepareCommandError, and TestCommandError are now raised with generic messages. Previously, they included a specific error message from the job, which is very helpful for debugging.

Please consider re-introducing these error details. The new Anyscale SDK's JobStatus object likely contains an error message (e.g., in a message attribute). This could be exposed via the AnyscaleJobManager and passed to this function to be included in the exceptions.

f"\n{error}\n"
)
if job_state == -2:
raise JobBrokenError("Job state is 'UNKNOWN'.")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return code -4 unhandled in command output handler

High Severity

_wait_job can return retcode -4 when JobState.FAILED occurs before the job starts running, but _handle_command_output only handles -1 and -2. The old code raised JobTerminatedBeforeStartError (with ExitCode.CLUSTER_STARTUP_TIMEOUT, a retryable infra error) for -4. Now the code falls through to output parsing, which will fail since the job never ran, ultimately raising JobNoLogsError with ExitCode.ANYSCALE_ERROR — a different, non-retryable error category.

Additional Locations (1)

Fix in Cursor Fix in Web

def _handle_command_output(
self, job_status_code: int, error: str, raise_on_timeout: bool = True
):
if job_status_code == -2:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Early FAILED check skips output parsing for running jobs

Medium Severity

When job_state == -1 (a FAILED job that was already running), _handle_command_output immediately raises JobOutOfRetriesError without attempting to parse output. The old code checked this condition last, after parsing output, so it could raise more specific errors like TestCommandError or PrepareCommandError. The error message also incorrectly says "command has not been ran" even though the command did run (since -1 is only returned when job_running was True).

Fix in Cursor Fix in Web

Add _convert_compute_config() to convert old-style compute config dicts
to new ComputeConfig objects, inlined in the job submit call instead of
registering separately via cluster_compute_id. Add test validating all
release test compute configs convert successfully.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

status = self._last_job_status()
assert status in terminal_state
if status == HaJobStates.TERMINATED and not job_running:
if status == JobState.FAILED and not job_running:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return code -4 produced but never handled

High Severity

_wait_job generates retcode = -4 when status == JobState.FAILED and not job_running, but _handle_command_output only handles -1 and -2. The old code had an explicit handler for -4 that raised JobTerminatedBeforeStartError. Now -4 falls through to the S3/log-parsing path, which will fail (since the job never ran) and raise a misleading JobNoLogsError instead of a proper infrastructure error.

Additional Locations (1)

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant