almost fully using new sdk for job stuff#60825
almost fully using new sdk for job stuff#60825aslonnie wants to merge 12 commits intoray-project:masterfrom
Conversation
…ale.job.status() Replace the old SDK's `self._sdk.get_production_job()` with the new `anyscale.job.status()` API in AnyscaleJobManager. The new API uses `JobState` enums (SUCCEEDED/FAILED/RUNNING) instead of `HaJobStates`. The critical distinction between FAILED-before-running (infra error, retcode -4) and FAILED-after-running (test failure, retcode -1) is preserved. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Handle STARTING state as job-running (switches to command timeout) - Treat UNKNOWN as terminal with retcode -2 to fail loudly - Remove job_error_message() (no equivalent in new API) - Remove stale error string references from _handle_command_output - Rename job_status_code to job_state - Update error messages to reflect new JobState enum names Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rename method and parameter to better reflect the new JobStatus type. Add a warning log when the returned job ID doesn't match the expected one. Update FakeJobResult to FakeJobStatus in tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove the FAILED-before-running (-4) special case from _handle_command_output. Move the FAILED (-1) check to the top of the function alongside the UNKNOWN (-2) check. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The new Anyscale SDK no longer supports env_vars in cluster environment registration. This refactors the release test infrastructure to bake runtime_env variables into custom BYOD Docker images instead. - Make require_custom_byod_image() trigger on runtime_env - Include runtime_env in image tag hash for cache correctness - Add --env CLI option to custom_byod_build script - Pass --env flags in Buildkite build YAML generation - Remove env_vars from create_byod_cluster_environment() SDK call - Remove __env__ hash suffix from cluster env name (now in image tag) - Add hello_world_custom_runtime_env test variants for validation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace old SDK's CreateProductionJob/CreateProductionJobConfig with new SDK's JobConfig + anyscale.job.submit(). This completes the migration of AnyscaleJobManager to the new anyscale.job.* API (submit, status, terminate, get_logs). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request refactors the Anyscale job submission logic to use the new Anyscale SDK. It also adds support for including runtime environment variables in custom BYOD images. The changes are extensive and touch multiple files, including job runners, managers, and test helpers.
My main concerns are around error handling in the new implementation. It appears that detailed error messages from failed jobs are no longer being propagated into exceptions, which will make debugging more difficult. Additionally, there's a critical bug where a specific job failure state is not handled, potentially leading to silent failures.
I've added two review comments in release/ray_release/command_runner/anyscale_job_runner.py detailing these issues. Other parts of the refactoring and the new runtime environment feature look good.
| if job_state == -1: | ||
| raise JobOutOfRetriesError( | ||
| "Job returned non-success state: 'FAILED' " | ||
| "(command has not been ran or no logs could have been obtained)." | ||
| ) |
There was a problem hiding this comment.
The _handle_command_output function is missing a check for job_state == -4. This state is returned by AnyscaleJobManager for jobs that fail before starting (a "soft infra error"). Without handling this case, a real failure could be missed. The previous implementation correctly raised a JobTerminatedBeforeStartError.
Please add a check for job_state == -4. You will also need to re-import JobTerminatedBeforeStartError at the top of the file.
if job_state == -4:
raise JobTerminatedBeforeStartError(
"Job entered 'TERMINATED' state before it started "
"(most likely due to inability to provision required nodes; "
"otherwise it was terminated manually or Ray was stopped)."
)
if job_state == -1:
raise JobOutOfRetriesError(
"Job returned non-success state: 'FAILED' "
"(command has not been ran or no logs could have been obtained)."
)| "Job entered 'TERMINATED' state (it was terminated " | ||
| "manually or Ray was stopped):" | ||
| f"\n{error}\n" | ||
| def _handle_command_output(self, job_state: int, raise_on_timeout: bool = True): |
There was a problem hiding this comment.
The refactoring to use the new SDK has removed the detailed error messages from the exceptions raised on job failure. For example, JobBrokenError, PrepareCommandError, and TestCommandError are now raised with generic messages. Previously, they included a specific error message from the job, which is very helpful for debugging.
Please consider re-introducing these error details. The new Anyscale SDK's JobStatus object likely contains an error message (e.g., in a message attribute). This could be exposed via the AnyscaleJobManager and passed to this function to be included in the exceptions.
| f"\n{error}\n" | ||
| ) | ||
| if job_state == -2: | ||
| raise JobBrokenError("Job state is 'UNKNOWN'.") |
There was a problem hiding this comment.
Return code -4 unhandled in command output handler
High Severity
_wait_job can return retcode -4 when JobState.FAILED occurs before the job starts running, but _handle_command_output only handles -1 and -2. The old code raised JobTerminatedBeforeStartError (with ExitCode.CLUSTER_STARTUP_TIMEOUT, a retryable infra error) for -4. Now the code falls through to output parsing, which will fail since the job never ran, ultimately raising JobNoLogsError with ExitCode.ANYSCALE_ERROR — a different, non-retryable error category.
Additional Locations (1)
| def _handle_command_output( | ||
| self, job_status_code: int, error: str, raise_on_timeout: bool = True | ||
| ): | ||
| if job_status_code == -2: |
There was a problem hiding this comment.
Early FAILED check skips output parsing for running jobs
Medium Severity
When job_state == -1 (a FAILED job that was already running), _handle_command_output immediately raises JobOutOfRetriesError without attempting to parse output. The old code checked this condition last, after parsing output, so it could raise more specific errors like TestCommandError or PrepareCommandError. The error message also incorrectly says "command has not been ran" even though the command did run (since -1 is only returned when job_running was True).
Add _convert_compute_config() to convert old-style compute config dicts to new ComputeConfig objects, inlined in the job submit call instead of registering separately via cluster_compute_id. Add test validating all release test compute configs convert successfully. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| status = self._last_job_status() | ||
| assert status in terminal_state | ||
| if status == HaJobStates.TERMINATED and not job_running: | ||
| if status == JobState.FAILED and not job_running: |
There was a problem hiding this comment.
Return code -4 produced but never handled
High Severity
_wait_job generates retcode = -4 when status == JobState.FAILED and not job_running, but _handle_command_output only handles -1 and -2. The old code had an explicit handler for -4 that raised JobTerminatedBeforeStartError. Now -4 falls through to the S3/log-parsing path, which will fail (since the job never ran) and raise a misleading JobNoLogsError instead of a proper infrastructure error.


No description provided.