[Test] Bound failed-job log fetch on failure teardown#10000
Conversation
When a smoke test fails, the teardown fetches controller logs for failed jobs via `sky jobs queue -a -u` + `sky jobs logs --controller`. On a long-lived or shared API server the queue accumulates many old failed jobs from unrelated runs, so this serially fetched logs for every failed job on the server with no per-fetch timeout, taking several minutes. The queue is ordered newest-first, so only fetch logs for the most recent failed jobs (the ones the failing test just created) and bound each fetch with a timeout. Both limits are configurable via env vars and default to 5 jobs / 60s each. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request updates the fetch_failed_job_logs.sh script to limit the number of recently-failed jobs fetched and introduces a timeout for each log fetch operation. Feedback was provided regarding the compatibility of the timeout command on macOS, suggesting a fallback check to ensure logs are still fetched when the command is unavailable.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| timeout "$PER_LOG_TIMEOUT" sky jobs logs --controller "$job_id" || \ | ||
| echo "(controller log fetch for job $job_id timed out or failed)" |
There was a problem hiding this comment.
The timeout command is part of GNU coreutils and is not available by default on macOS. If timeout is missing, the command will fail with command not found (exit code 127), which triggers the || fallback block and completely skips fetching the logs.
To ensure compatibility for developers running smoke tests on macOS, we should check if timeout is available before using it, and fall back to running the command directly without a timeout if it is not present.
| timeout "$PER_LOG_TIMEOUT" sky jobs logs --controller "$job_id" || \ | |
| echo "(controller log fetch for job $job_id timed out or failed)" | |
| if command -v timeout >/dev/null 2>&1; then | |
| timeout "$PER_LOG_TIMEOUT" sky jobs logs --controller "$job_id" | |
| else | |
| sky jobs logs --controller "$job_id" | |
| fi || echo "(controller log fetch for job $job_id timed out or failed)" |
Summary
On smoke-test failure,
run_one_testrunsfetch_failed_job_logs.sh, which doessky jobs queue -a -u(all jobs, all users) and then serially runssky jobs logs --controller <id>for everyFAILED*job in the queue, with no per-fetch timeout. On a long-lived/shared API server the queue accumulates many old failed jobs from unrelated runs, so this can take minutes fetching logs for jobs that have nothing to do with the failing test.The queue is ordered newest-first, so only fetch logs for the most recent failed jobs (the ones the failing test just created) and bound each fetch with a timeout. Both limits are env-configurable:
SKYPILOT_FETCH_FAILED_JOB_LOGS_LIMIT(default 5) andSKYPILOT_FETCH_FAILED_JOB_LOGS_TIMEOUT(default 60s).Motivation
In one run against a shared server, this teardown took roughly 5 minutes on its own — almost entirely spent pulling controller logs one job at a time for every failed job in the queue, most of which were unrelated to the test that failed. Beyond the wasted time, dumping logs for other jobs and users mixes their output into the failing test's log, so an error from an unrelated job's controller log can look like it belongs to the test under investigation and send debugging down the wrong path. Scoping the fetch to the test's own recent failures keeps the output relevant and fast.
Test plan
bash -nclean.headselection against a queue mixing recent and old failed jobs: only the newest (capped)FAILED*jobs are selected,SUCCEEDEDskipped, and the just-failed job (newest) is always included.Part of a 3-PR series cleaning up the smoke-test failure path: