Description
Is your feature request related to a problem? Please describe.
Presently, when an agent is failing builds, the only way to fix it is to stop the agent (which terminates the instance) or terminate the instance directly.
In order to perform diagnosis on instances, it would be useful to be able to "cordon" an instance while stopping the agent from accepting any more jobs.
Describe the solution you'd like
Simply not dispatching to a given agent from buildkite.com
would not be sufficient. Cordoning at the agent level would prevent a replacement instance from being booted in order to maintain pool capacity.
Instead, infrastructure level cordoning would remove the instance from the Auto Scaling group. Using autoscaling:EnterStandby would keep an ASG reference to the instance vs instance detach from the ASG, and the desired count would be maintained such that a replacement instance is booted.
The way I would expose this infrastructure level functionality up to the buildkite.com
API and UI would be to include an agent lifecycle hook called cordon
. If present when registering the agent with the API, set a flag that indicates the agent has a cordon hook that can be invoked.
In the Elastic CI Stack’s cordon hook I would either invoke the AWS CLI directly, or use an AWS SSM Automation to stop the agent systemd job and set the instance to standby.
Decoupling the agent and instance lifetimes may depend on the work started in #964 the solution may also need to take instances that set disconnect-after-job
into consideration.
Describe alternatives you've considered
As above, keeping the agent alive but not dispatching to it is an inferior solution.