Add support for instance cordoning

**Is your feature request related to a problem? Please describe.**

Presently, when an agent is failing builds, the only way to fix it is to stop the agent (which terminates the instance) or terminate the instance directly.

In order to perform diagnosis on instances, it would be useful to be able to "cordon" an instance while stopping the agent from accepting any more jobs.

**Describe the solution you'd like**

Simply not dispatching to a given agent from `buildkite.com` would not be sufficient. Cordoning at the agent level would prevent a replacement instance from being booted in order to maintain pool capacity.

Instead, infrastructure level cordoning would remove the instance from the Auto Scaling group. Using [autoscaling:EnterStandby](https://docs.aws.amazon.com/autoscaling/ec2/APIReference/API_EnterStandby.html) would keep an ASG reference to the instance vs instance detach from the ASG, and the desired count would be maintained such that a replacement instance is booted.

The way I would expose this infrastructure level functionality up to the `buildkite.com` API and UI would be to include an [agent lifecycle hook](https://buildkite.com/docs/agent/v3/hooks#agent-lifecycle-hooks) called `cordon`. If present when registering the agent with the API, set a flag that indicates the agent has a cordon hook that can be invoked.

In the Elastic CI Stack’s cordon hook I would either invoke the AWS CLI directly, or use an AWS SSM Automation to stop the agent systemd job and set the instance to standby.

Decoupling the agent and instance lifetimes may depend on the work started in #964 the solution may also need to take instances that set `disconnect-after-job` into consideration.

**Describe alternatives you've considered**

As above, keeping the agent alive but not dispatching *to* it is an inferior solution.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for instance cordoning #972

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add support for instance cordoning #972

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions