Skip to content

[Test] Bound failed-job log fetch on failure teardown#10000

Merged
zpoint merged 1 commit into
masterfrom
smoke/bound-failed-job-log-fetch
Jul 1, 2026
Merged

[Test] Bound failed-job log fetch on failure teardown#10000
zpoint merged 1 commit into
masterfrom
smoke/bound-failed-job-log-fetch

Conversation

@kevinmingtarja

@kevinmingtarja kevinmingtarja commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

Summary

On smoke-test failure, run_one_test runs fetch_failed_job_logs.sh, which does sky jobs queue -a -u (all jobs, all users) and then serially runs sky jobs logs --controller <id> for every FAILED* job in the queue, with no per-fetch timeout. On a long-lived/shared API server the queue accumulates many old failed jobs from unrelated runs, so this can take minutes fetching logs for jobs that have nothing to do with the failing test.

The queue is ordered newest-first, so only fetch logs for the most recent failed jobs (the ones the failing test just created) and bound each fetch with a timeout. Both limits are env-configurable: SKYPILOT_FETCH_FAILED_JOB_LOGS_LIMIT (default 5) and SKYPILOT_FETCH_FAILED_JOB_LOGS_TIMEOUT (default 60s).

Motivation

In one run against a shared server, this teardown took roughly 5 minutes on its own — almost entirely spent pulling controller logs one job at a time for every failed job in the queue, most of which were unrelated to the test that failed. Beyond the wasted time, dumping logs for other jobs and users mixes their output into the failing test's log, so an error from an unrelated job's controller log can look like it belongs to the test under investigation and send debugging down the wrong path. Scoping the fetch to the test's own recent failures keeps the output relevant and fast.

Test plan

  • bash -n clean.
  • Verified the awk + head selection against a queue mixing recent and old failed jobs: only the newest (capped) FAILED* jobs are selected, SUCCEEDED skipped, and the just-failed job (newest) is always included.

Part of a 3-PR series cleaning up the smoke-test failure path:

When a smoke test fails, the teardown fetches controller logs for failed
jobs via `sky jobs queue -a -u` + `sky jobs logs --controller`. On a
long-lived or shared API server the queue accumulates many old failed jobs
from unrelated runs, so this serially fetched logs for every failed job on
the server with no per-fetch timeout, taking several minutes.

The queue is ordered newest-first, so only fetch logs for the most recent
failed jobs (the ones the failing test just created) and bound each fetch
with a timeout. Both limits are configurable via env vars and default to 5
jobs / 60s each.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the fetch_failed_job_logs.sh script to limit the number of recently-failed jobs fetched and introduces a timeout for each log fetch operation. Feedback was provided regarding the compatibility of the timeout command on macOS, suggesting a fallback check to ensure logs are still fetched when the command is unavailable.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +53 to +54
timeout "$PER_LOG_TIMEOUT" sky jobs logs --controller "$job_id" || \
echo "(controller log fetch for job $job_id timed out or failed)"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The timeout command is part of GNU coreutils and is not available by default on macOS. If timeout is missing, the command will fail with command not found (exit code 127), which triggers the || fallback block and completely skips fetching the logs.

To ensure compatibility for developers running smoke tests on macOS, we should check if timeout is available before using it, and fall back to running the command directly without a timeout if it is not present.

Suggested change
timeout "$PER_LOG_TIMEOUT" sky jobs logs --controller "$job_id" || \
echo "(controller log fetch for job $job_id timed out or failed)"
if command -v timeout >/dev/null 2>&1; then
timeout "$PER_LOG_TIMEOUT" sky jobs logs --controller "$job_id"
else
sky jobs logs --controller "$job_id"
fi || echo "(controller log fetch for job $job_id timed out or failed)"

@kevinmingtarja kevinmingtarja marked this pull request as ready for review June 30, 2026 22:33
@kevinmingtarja kevinmingtarja requested a review from zpoint June 30, 2026 22:36
@zpoint zpoint merged commit 0aab90c into master Jul 1, 2026
21 checks passed
@zpoint zpoint deleted the smoke/bound-failed-job-log-fetch branch July 1, 2026 02:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants