Skip to content

Add --pause-after-job flag to pause agent between jobs instead of disconnecting#3753

Draft
gregmagolan wants to merge 1 commit intobuildkite:mainfrom
gregmagolan:pause-after-job
Draft

Add --pause-after-job flag to pause agent between jobs instead of disconnecting#3753
gregmagolan wants to merge 1 commit intobuildkite:mainfrom
gregmagolan:pause-after-job

Conversation

@gregmagolan
Copy link

Summary

Adds a new --pause-after-job configuration option that pauses the agent after completing a job instead of disconnecting. The agent remains connected and pinging Buildkite, preserving its identity in the UI (job history affinity), and can be resumed via the API when ready for the next job.

This is useful for orchestrators that need to run health checks or maintenance between jobs without losing the agent's connection and history. With --disconnect-after-job, each reconnection creates a new agent instance in the Buildkite UI, losing the association between jobs run on the same machine.

Motivation

When using --disconnect-after-job, the agent disconnects and reconnects after every job. Each reconnection registers as a new agent in the Buildkite UI, which means:

  • Job history for a given runner is spread across multiple agent entries
  • There's no way to see all jobs that ran on a specific machine in one place
  • The overhead of disconnect/reconnect is unnecessary when the agent just needs a pause between jobs

With --pause-after-job, the agent stays connected and simply pauses itself via POST /pause after finishing a job. An external orchestrator (or the Buildkite API) can then resume the agent when ready, and it will accept the next job with ranJob reset — all under the same agent identity.

Changes

  • agent/agent_configuration.go: Added PauseAfterJob bool field to AgentConfiguration
  • clicommand/agent_start.go: Added --pause-after-job CLI flag with BUILDKITE_AGENT_PAUSE_AFTER_JOB env var, feature reporting, mutual exclusion validation with --disconnect-after-job, and config transfer
  • agent/agent_worker_action.go:
    • Set ignoreAgentInDispatches when PauseAfterJob is enabled (same as DisconnectAfterJob)
    • After job completion, self-pause via POST /pause API instead of disconnecting
    • On resume, reset ranJob only for PauseAfterJob agents so they can accept the next job
  • agent/agent_worker_test.go: Added TestAgentWorker_PauseAfterJob integration test
  • agent/fake_api_server_test.go: Added POST /pause handler and PauseCalls counter to FakeAgent

Test plan

  • New TestAgentWorker_PauseAfterJob test passes — verifies job runs, agent self-pauses via API, resumes correctly, and stays connected
  • Existing TestAgentWorker_DisconnectAfterJob_Start_Pause_Unpause still passes — ranJob reset is gated on PauseAfterJob only
  • Existing TestAgentWorker_Streaming_DisconnectAfterJob_Start_Pause_Unpause still passes
  • Existing TestAgentWorker_DisconnectAfterUptime still passes
  • Manual testing with a real Buildkite pipeline

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant