[Test] Fail fast in smoke wait-loops on terminal job failure#9996
Draft
kevinmingtarja wants to merge 1 commit into
Draft
[Test] Fail fast in smoke wait-loops on terminal job failure#9996kevinmingtarja wants to merge 1 commit into
kevinmingtarja wants to merge 1 commit into
Conversation
Contributor
|
Warning Gemini encountered an error creating the review. You can try again by commenting |
The job wait helpers poll `sky (jobs) queue` until the target status appears or the timeout elapses. If a job reaches a terminal failure status (e.g. FAILED, FAILED_SETUP, CANCELLED) it can never reach a target like SUCCEEDED, yet the loop keeps polling for the full timeout, wasting minutes per failing test. Add an early exit: when the matched job is in a terminal failure status that is not one of the statuses being waited for, print the status and the queue and exit non-zero immediately. Tests that legitimately wait for a failure status are unaffected, since the target is excluded from the fail-fast set. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
db8c490 to
5880424
Compare
This was referenced Jun 30, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The job wait helpers (
get_cmd_wait_until_*_job_status_contains_*) pollsky (jobs) queueuntil the target status appears or the timeout elapses. If a job reaches a terminal failure status (FAILED,FAILED_SETUP,FAILED_PRECHECKS,FAILED_NO_RESOURCE,FAILED_CONTROLLER,CANCELLED) it can never reach a target likeSUCCEEDED, yet the loop keeps polling for the full timeout (often 10 min), wasting minutes per failing test.Add an early exit: when the matched job is in a terminal failure status that is not one of the statuses being waited for, print the status + queue and exit non-zero immediately. Tests that legitimately wait for a failure status are unaffected — the target status is excluded from the fail-fast set.
Test plan
SUCCEEDEDand ran it against capturedsky jobs queueoutput:SUCCEEDED→ reports target reached, exits 0.FAILED) excludes that status from the fail-fast set, so it still waits as before.get_cmd_wait_until_*variants format without error;yapf/isortclean.Part of a 3-PR series cleaning up the smoke-test failure path: