Map executor setup failures to exit_status -1#3769
Open
kainanpeace666 wants to merge 1 commit intobuildkite:mainfrom
Open
Map executor setup failures to exit_status -1#3769kainanpeace666 wants to merge 1 commit intobuildkite:mainfrom
kainanpeace666 wants to merge 1 commit intobuildkite:mainfrom
Conversation
700b4ac to
dbacd6d
Compare
When setUp() fails due to infrastructure errors like DNS failures during secret fetching, the executor subprocess exits with code 1 (the fallthrough in shell.ExitCode for plain Go errors). This makes infra failures indistinguishable from user command failures, preventing users from catching them with automatic_retry on exit_status -1. The fix is surgical and only changes the setUp() error path: 1. Executor: when setUp() fails with a typed ExitError (from a hook), preserve the hook's exit code (unchanged behavior). When it fails with a plain Go error (secret fetch DNS failure, env init error), return ExitCodeSetupFailure (125) instead of 1. 2. Job runner: detect exit code 125 from the subprocess and map it to exit_status -1 with SignalReason "process_run_error", consistent with other agent-level "command never ran" failures. All other exit paths (shell creation, Job API, git credential helper) are left unchanged to minimize behavioral impact.
dbacd6d to
a27340e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When the job executor fails during setUp — e.g. due to DNS errors fetching secrets, shell creation failures, or Job API initialization errors — the bootstrap subprocess currently exits with code 1. This makes infrastructure failures indistinguishable from user command failures, preventing users from catching them with automatic_retry on exit_status -1.
Error example:
The fix has two parts:
Executor (subprocess): return a new ExitCodeSetupFailure (125) for pre-command setup errors instead of falling through to 1 via shell.ExitCode(). When setUp() fails due to a hook returning a specific exit code, that code is preserved; only plain Go errors now return 125.
Job runner (parent): detect exit code 125 from the subprocess and map it to exit_status -1 with SignalReason "process_run_error", consistent with other agent-level "command never ran" failures.
Users who already retry on exit_status -1 will now automatically catch transient infrastructure failures like DNS blips during secret fetching.
Description
Context
Changes
Testing
go test ./...). Buildkite employees may check this if the pipeline has run automatically.go tool gofumpt -extra -w .)Disclosures / Credits