[Test] Bound failed-job log fetch on failure teardown by kevinmingtarja · Pull Request #10000 · skypilot-org/skypilot

kevinmingtarja · 2026-06-30T22:28:47Z

Summary

On smoke-test failure, run_one_test runs fetch_failed_job_logs.sh, which does sky jobs queue -a -u (all jobs, all users) and then serially runs sky jobs logs --controller <id> for every FAILED* job in the queue, with no per-fetch timeout. On a long-lived/shared API server the queue accumulates many old failed jobs from unrelated runs, so this can take minutes fetching logs for jobs that have nothing to do with the failing test.

The queue is ordered newest-first, so only fetch logs for the most recent failed jobs (the ones the failing test just created) and bound each fetch with a timeout. Both limits are env-configurable: SKYPILOT_FETCH_FAILED_JOB_LOGS_LIMIT (default 5) and SKYPILOT_FETCH_FAILED_JOB_LOGS_TIMEOUT (default 60s).

Motivation

In one run against a shared server, this teardown took roughly 5 minutes on its own — almost entirely spent pulling controller logs one job at a time for every failed job in the queue, most of which were unrelated to the test that failed. Beyond the wasted time, dumping logs for other jobs and users mixes their output into the failing test's log, so an error from an unrelated job's controller log can look like it belongs to the test under investigation and send debugging down the wrong path. Scoping the fetch to the test's own recent failures keeps the output relevant and fast.

Test plan

bash -n clean.
Verified the awk + head selection against a queue mixing recent and old failed jobs: only the newest (capped) FAILED* jobs are selected, SUCCEEDED skipped, and the just-failed job (newest) is always included.

Part of a 3-PR series cleaning up the smoke-test failure path:

[Test] Match job by name in any column in smoke wait-loops #9999 — match job by name in any column
[Test] Fail fast in smoke wait-loops on terminal job failure #9996 — fail fast on terminal job failure
[Test] Bound failed-job log fetch on failure teardown #10000 — bound failed-job log fetch on teardown

When a smoke test fails, the teardown fetches controller logs for failed jobs via `sky jobs queue -a -u` + `sky jobs logs --controller`. On a long-lived or shared API server the queue accumulates many old failed jobs from unrelated runs, so this serially fetched logs for every failed job on the server with no per-fetch timeout, taking several minutes. The queue is ordered newest-first, so only fetch logs for the most recent failed jobs (the ones the failing test just created) and bound each fetch with a timeout. Both limits are configurable via env vars and default to 5 jobs / 60s each. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request updates the fetch_failed_job_logs.sh script to limit the number of recently-failed jobs fetched and introduces a timeout for each log fetch operation. Feedback was provided regarding the compatibility of the timeout command on macOS, suggesting a fallback check to ensure logs are still fetched when the command is unavailable.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-30T22:29:48Z

+        timeout "$PER_LOG_TIMEOUT" sky jobs logs --controller "$job_id" || \
+            echo "(controller log fetch for job $job_id timed out or failed)"


The timeout command is part of GNU coreutils and is not available by default on macOS. If timeout is missing, the command will fail with command not found (exit code 127), which triggers the || fallback block and completely skips fetching the logs.

To ensure compatibility for developers running smoke tests on macOS, we should check if timeout is available before using it, and fall back to running the command directly without a timeout if it is not present.

Suggested change

timeout "$PER_LOG_TIMEOUT" sky jobs logs --controller "$job_id" || \

echo "(controller log fetch for job $job_id timed out or failed)"

if command -v timeout >/dev/null 2>&1; then

timeout "$PER_LOG_TIMEOUT" sky jobs logs --controller "$job_id"

else

sky jobs logs --controller "$job_id"

fi || echo "(controller log fetch for job $job_id timed out or failed)"

This was referenced Jun 30, 2026

[Test] Fail fast in smoke wait-loops on terminal job failure #9996

Draft

[Test] Match job by name in any column in smoke wait-loops #9999

Draft

gemini-code-assist Bot reviewed Jun 30, 2026

View reviewed changes

kevinmingtarja marked this pull request as ready for review June 30, 2026 22:33

kevinmingtarja requested a review from zpoint June 30, 2026 22:36

zpoint approved these changes Jul 1, 2026

View reviewed changes

zpoint merged commit 0aab90c into master Jul 1, 2026
21 checks passed

zpoint deleted the smoke/bound-failed-job-log-fetch branch July 1, 2026 02:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Test] Bound failed-job log fetch on failure teardown#10000

[Test] Bound failed-job log fetch on failure teardown#10000
zpoint merged 1 commit into
masterfrom
smoke/bound-failed-job-log-fetch

kevinmingtarja commented Jun 30, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		timeout "$PER_LOG_TIMEOUT" sky jobs logs --controller "$job_id" \|\| \
		echo "(controller log fetch for job $job_id timed out or failed)"

Uh oh!

Conversation

kevinmingtarja commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kevinmingtarja commented Jun 30, 2026 •

edited

Loading