Skip to content

Map executor setup failures to exit_status -1#3769

Open
kainanpeace666 wants to merge 1 commit intobuildkite:mainfrom
kainanpeace666:fix/setup-failure-exit-code
Open

Map executor setup failures to exit_status -1#3769
kainanpeace666 wants to merge 1 commit intobuildkite:mainfrom
kainanpeace666:fix/setup-failure-exit-code

Conversation

@kainanpeace666
Copy link

@kainanpeace666 kainanpeace666 commented Mar 18, 2026

When the job executor fails during setUp — e.g. due to DNS errors fetching secrets, shell creation failures, or Job API initialization errors — the bootstrap subprocess currently exits with code 1. This makes infrastructure failures indistinguishable from user command failures, preventing users from catching them with automatic_retry on exit_status -1.

Error example:

🚨 Error: Error setting up job executor: failed to fetch secrets for job:
  secret "TEST_ENGINE_ANALYTICS_TOKEN": Get
  "https://agent.buildkite.com/v3/jobs/.../secrets?key=...":
    dial tcp: lookup agent.buildkite.com on [2401:db00:eef0:b53::]:53:
    read tcp [...]: connection reset by peer

The fix has two parts:

  1. Executor (subprocess): return a new ExitCodeSetupFailure (125) for pre-command setup errors instead of falling through to 1 via shell.ExitCode(). When setUp() fails due to a hook returning a specific exit code, that code is preserved; only plain Go errors now return 125.

  2. Job runner (parent): detect exit code 125 from the subprocess and map it to exit_status -1 with SignalReason "process_run_error", consistent with other agent-level "command never ran" failures.

Users who already retry on exit_status -1 will now automatically catch transient infrastructure failures like DNS blips during secret fetching.

Description

Context

Changes

Testing

  • Tests have run locally (with go test ./...). Buildkite employees may check this if the pipeline has run automatically.
  • Code is formatted (with go tool gofumpt -extra -w .)

Disclosures / Credits

@kainanpeace666 kainanpeace666 requested review from a team as code owners March 18, 2026 21:18
@kainanpeace666 kainanpeace666 force-pushed the fix/setup-failure-exit-code branch from 700b4ac to dbacd6d Compare March 18, 2026 21:35
When setUp() fails due to infrastructure errors like DNS failures
during secret fetching, the executor subprocess exits with code 1
(the fallthrough in shell.ExitCode for plain Go errors). This makes
infra failures indistinguishable from user command failures,
preventing users from catching them with automatic_retry on
exit_status -1.

The fix is surgical and only changes the setUp() error path:

1. Executor: when setUp() fails with a typed ExitError (from a hook),
   preserve the hook's exit code (unchanged behavior). When it fails
   with a plain Go error (secret fetch DNS failure, env init error),
   return ExitCodeSetupFailure (125) instead of 1.

2. Job runner: detect exit code 125 from the subprocess and map it to
   exit_status -1 with SignalReason "process_run_error", consistent
   with other agent-level "command never ran" failures.

All other exit paths (shell creation, Job API, git credential helper)
are left unchanged to minimize behavioral impact.
@kainanpeace666 kainanpeace666 force-pushed the fix/setup-failure-exit-code branch from dbacd6d to a27340e Compare March 18, 2026 21:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant