Skip to content

[Test] Fail fast in smoke wait-loops on terminal job failure#9996

Draft
kevinmingtarja wants to merge 1 commit into
masterfrom
smoke/fail-fast-on-terminal-job-status
Draft

[Test] Fail fast in smoke wait-loops on terminal job failure#9996
kevinmingtarja wants to merge 1 commit into
masterfrom
smoke/fail-fast-on-terminal-job-status

Conversation

@kevinmingtarja

@kevinmingtarja kevinmingtarja commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

Summary

The job wait helpers (get_cmd_wait_until_*_job_status_contains_*) poll sky (jobs) queue until the target status appears or the timeout elapses. If a job reaches a terminal failure status (FAILED, FAILED_SETUP, FAILED_PRECHECKS, FAILED_NO_RESOURCE, FAILED_CONTROLLER, CANCELLED) it can never reach a target like SUCCEEDED, yet the loop keeps polling for the full timeout (often 10 min), wasting minutes per failing test.

Add an early exit: when the matched job is in a terminal failure status that is not one of the statuses being waited for, print the status + queue and exit non-zero immediately. Tests that legitimately wait for a failure status are unaffected — the target status is excluded from the fail-fast set.

Test plan

  • Generated the wait command for a job waiting on SUCCEEDED and ran it against captured sky jobs queue output:
    • Job in a terminal failure status → exits non-zero immediately instead of polling until timeout.
    • Job in SUCCEEDED → reports target reached, exits 0.
    • Job still running → keeps polling (unchanged).
  • Verified a helper waiting for a failure status (e.g. FAILED) excludes that status from the fail-fast set, so it still waits as before.
  • All four get_cmd_wait_until_* variants format without error; yapf/isort clean.

Part of a 3-PR series cleaning up the smoke-test failure path:

@kevinmingtarja kevinmingtarja changed the title [Test] Fail fast in smoke wait-loops on terminal job failure [Test] Speed up smoke-test failure path: fail fast + bounded log fetch Jun 30, 2026
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

The job wait helpers poll `sky (jobs) queue` until the target status
appears or the timeout elapses. If a job reaches a terminal failure status
(e.g. FAILED, FAILED_SETUP, CANCELLED) it can never reach a target like
SUCCEEDED, yet the loop keeps polling for the full timeout, wasting minutes
per failing test.

Add an early exit: when the matched job is in a terminal failure status
that is not one of the statuses being waited for, print the status and the
queue and exit non-zero immediately. Tests that legitimately wait for a
failure status are unaffected, since the target is excluded from the
fail-fast set.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@kevinmingtarja kevinmingtarja force-pushed the smoke/fail-fast-on-terminal-job-status branch from db8c490 to 5880424 Compare June 30, 2026 22:28
@kevinmingtarja kevinmingtarja changed the title [Test] Speed up smoke-test failure path: fail fast + bounded log fetch [Test] Fail fast in smoke wait-loops on terminal job failure Jun 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant