You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+31-10Lines changed: 31 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -23,7 +23,28 @@ wanted more granular control over:
23
23
The lambda (or cli version) polls the Buildkite Metrics API every 10 seconds, and based on the
24
24
results sets the `DesiredCount` to exactly what is needed. This allows much faster scale up.
25
25
26
+
## Configuration
27
+
28
+
### Availability-based scaling
29
+
30
+
The scaler monitors agent availability to handle situations where EC2 instances are healthy but Buildkite agents aren't connecting. This can happen due to network issues, agent configuration problems, or instance startup delays.
31
+
32
+
**`AVAILABILITY_THRESHOLD`** (default: `0.5`)
33
+
34
+
When jobs are queued, the scaler checks if the percentage of connected agents meets this threshold. For example, with 4 agents per instance and 2 instances running (8 expected agents), if only 3 agents are online, that's 37.5% availability.
35
+
36
+
When availability drops below the threshold and the ASG has converged (actual instances match desired), the scaler adds one instance to help recover availability.
37
+
38
+
Set `AVAILABILITY_THRESHOLD=0` to disable availability-based scaling. The scaler will then scale based only on job count.
***Higher threshold (e.g., 0.8)**: Aggressive scaling to maintain high availability when agents are expected to connect quickly
44
+
***Disabled (0)**: Job-based scaling only, suitable when agents connect reliably
45
+
26
46
## Gracefully scaling in
47
+
27
48
:construction: For [Elastic CI Stack][], there's now available a dedicated and experimental mode configured with `ELASTIC_CI_MODE` variable. You can read more about it [in here](./docs/elastic_ci_mode.md). :construction:
28
49
___
29
50
@@ -55,17 +76,17 @@ of the metrics that the [buildkite-agent-metrics][] binary collects:
55
76
An AWS Lambda bundle is created and published as part of the build process. The lambda will require
56
77
the following IAM permissions:
57
78
58
-
-`cloudwatch:PutMetricData`
59
-
-`autoscaling:DescribeAutoScalingGroups`
60
-
-`autoscaling:DescribeScalingActivities`
61
-
-`autoscaling:SetDesiredCapacity`
79
+
*`cloudwatch:PutMetricData`
80
+
*`autoscaling:DescribeAutoScalingGroups`
81
+
*`autoscaling:DescribeScalingActivities`
82
+
*`autoscaling:SetDesiredCapacity`
62
83
63
84
Its handler is `bootstrap`, it uses a `provided.al2` runtime and requires the following env vars:
64
85
65
-
-`BUILDKITE_AGENT_TOKEN` or `BUILDKITE_AGENT_TOKEN_SSM_KEY`
66
-
-`BUILDKITE_QUEUE`
67
-
-`AGENTS_PER_INSTANCE`
68
-
-`ASG_NAME`
86
+
*`BUILDKITE_AGENT_TOKEN` or `BUILDKITE_AGENT_TOKEN_SSM_KEY`
87
+
*`BUILDKITE_QUEUE`
88
+
*`AGENTS_PER_INSTANCE`
89
+
*`ASG_NAME`
69
90
70
91
If `BUILDKITE_AGENT_TOKEN_SSM_KEY` is set, the token will be read from
71
92
[AWS Systems Manager Parameter Store GetParameter](https://docs.aws.amazon.com/systems-manager/latest/APIReference/API_GetParameter.html)
@@ -82,8 +103,9 @@ aws lambda create-function \
82
103
```
83
104
84
105
## Development
106
+
85
107
This project uses [mise](https://mise.jdx.dev/) to manage development tooling ensuring all the tooling needed is installed with one step, and in expected versions.
86
-
To install mise, execute [./bin/mise](./bin/mise) bootstrap script or follow [mise documentation](https://mise.jdx.dev/installing-mise.html).
108
+
To install mise, execute [./bin/mise](./bin/mise) bootstrap script or follow [mise documentation](https://mise.jdx.dev/installing-mise.html).
87
109
Run `mise install` to install all the required tooling defined in [mise.toml](./mise.toml).
88
110
89
111
### Running agent-scaler locally
@@ -103,7 +125,6 @@ The scaler is set up automatically by the [Elastic CI Stack][]'s CloudFormation
103
125
reference the agent token and a queue name. A Lambda function running the scaler is then generated
104
126
using these references (e.g., `BUILDKITE_AGENT_TOKEN_SSM_KEY` and `BUILDKITE_QUEUE`).
105
127
106
-
107
128
## Copyright
108
129
109
130
Copyright (c) 2014-2019 Buildkite Pty Ltd. See [LICENSE](./LICENSE.txt) for details.
Copy file name to clipboardExpand all lines: docs/elastic_ci_mode.md
+8-3Lines changed: 8 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -91,9 +91,14 @@ Job dispatch delays, other issues related to job processing.
91
91
```
92
92
93
93
## Configuration Parameters
94
+
95
+
### Availability Monitoring (applies to all modes)
96
+
-`AVAILABILITY_THRESHOLD`: Minimum agent availability percentage before triggering scale-out (default 50%, i.e. with 4 agents per instance, we scale out when fewer than 2 agents are online). Set to `0` to disable availability-based scaling.
97
+
98
+
### Elastic CI Mode Specific
99
+
**Note:** The following settings apply **only when `ELASTIC_CI_MODE=true`**
100
+
94
101
-`ELASTIC_CI_MODE`: Enable enhanced safety features, only for [Elastic CI Stack](https://github.com/buildkite/elastic-ci-stack-for-aws)! (boolean)
-`MIN_AGENTS_PERCENTAGE`: Minimum acceptable percentage of expected agents — ratio of desired agents number to actual (default 50%, i.e. we tolerate 2 agent instances running on a single EC2 out of desired 4)
97
102
-`DANGLING_CHECK_MINIMUM_INSTANCE_UPTIME`: Minimum instance uptime before checking for dangling instances (default 1h)
98
103
-`MAX_DANGLING_INSTANCES_TO_CHECK`: Maximum number of instances to scan for dangling detection (default 5)
99
-
-`SCALE_IN_COOLDOWN_PERIOD`: Time to wait between scale-in operations (default 1h)
104
+
-`SCALE_IN_COOLDOWN_PERIOD`: Time to wait between scale-in operations (default 1h for Elastic CI Mode, 0 otherwise)
maxDanglingInstancesToCheck:=EnvInt("MAX_DANGLING_INSTANCES_TO_CHECK", 5) // Maximum number of instances to check for dangling instances (only used for dangling instance scanning, not for normal scale-in)
82
-
// Above settings only applicable when elasticCIMode is enabled!
0 commit comments