Skip to content

Commit 9a1d023

Browse files
authored
Fix availability-based scaling and simplify threshold logic (#257)
1 parent ec743c0 commit 9a1d023

7 files changed

Lines changed: 229 additions & 68 deletions

File tree

README.md

Lines changed: 31 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,28 @@ wanted more granular control over:
2323
The lambda (or cli version) polls the Buildkite Metrics API every 10 seconds, and based on the
2424
results sets the `DesiredCount` to exactly what is needed. This allows much faster scale up.
2525

26+
## Configuration
27+
28+
### Availability-based scaling
29+
30+
The scaler monitors agent availability to handle situations where EC2 instances are healthy but Buildkite agents aren't connecting. This can happen due to network issues, agent configuration problems, or instance startup delays.
31+
32+
**`AVAILABILITY_THRESHOLD`** (default: `0.5`)
33+
34+
When jobs are queued, the scaler checks if the percentage of connected agents meets this threshold. For example, with 4 agents per instance and 2 instances running (8 expected agents), if only 3 agents are online, that's 37.5% availability.
35+
36+
When availability drops below the threshold and the ASG has converged (actual instances match desired), the scaler adds one instance to help recover availability.
37+
38+
Set `AVAILABILITY_THRESHOLD=0` to disable availability-based scaling. The scaler will then scale based only on job count.
39+
40+
**Threshold tuning:**
41+
42+
* **Lower threshold (e.g., 0.3)**: Tolerates slower agent connection times, reduces instance churn
43+
* **Higher threshold (e.g., 0.8)**: Aggressive scaling to maintain high availability when agents are expected to connect quickly
44+
* **Disabled (0)**: Job-based scaling only, suitable when agents connect reliably
45+
2646
## Gracefully scaling in
47+
2748
:construction: For [Elastic CI Stack][], there's now available a dedicated and experimental mode configured with `ELASTIC_CI_MODE` variable. You can read more about it [in here](./docs/elastic_ci_mode.md). :construction:
2849
___
2950

@@ -55,17 +76,17 @@ of the metrics that the [buildkite-agent-metrics][] binary collects:
5576
An AWS Lambda bundle is created and published as part of the build process. The lambda will require
5677
the following IAM permissions:
5778

58-
- `cloudwatch:PutMetricData`
59-
- `autoscaling:DescribeAutoScalingGroups`
60-
- `autoscaling:DescribeScalingActivities`
61-
- `autoscaling:SetDesiredCapacity`
79+
* `cloudwatch:PutMetricData`
80+
* `autoscaling:DescribeAutoScalingGroups`
81+
* `autoscaling:DescribeScalingActivities`
82+
* `autoscaling:SetDesiredCapacity`
6283

6384
Its handler is `bootstrap`, it uses a `provided.al2` runtime and requires the following env vars:
6485

65-
- `BUILDKITE_AGENT_TOKEN` or `BUILDKITE_AGENT_TOKEN_SSM_KEY`
66-
- `BUILDKITE_QUEUE`
67-
- `AGENTS_PER_INSTANCE`
68-
- `ASG_NAME`
86+
* `BUILDKITE_AGENT_TOKEN` or `BUILDKITE_AGENT_TOKEN_SSM_KEY`
87+
* `BUILDKITE_QUEUE`
88+
* `AGENTS_PER_INSTANCE`
89+
* `ASG_NAME`
6990

7091
If `BUILDKITE_AGENT_TOKEN_SSM_KEY` is set, the token will be read from
7192
[AWS Systems Manager Parameter Store GetParameter](https://docs.aws.amazon.com/systems-manager/latest/APIReference/API_GetParameter.html)
@@ -82,8 +103,9 @@ aws lambda create-function \
82103
```
83104

84105
## Development
106+
85107
This project uses [mise](https://mise.jdx.dev/) to manage development tooling ensuring all the tooling needed is installed with one step, and in expected versions.
86-
To install mise, execute [./bin/mise](./bin/mise) bootstrap script or follow [mise documentation](https://mise.jdx.dev/installing-mise.html).
108+
To install mise, execute [./bin/mise](./bin/mise) bootstrap script or follow [mise documentation](https://mise.jdx.dev/installing-mise.html).
87109
Run `mise install` to install all the required tooling defined in [mise.toml](./mise.toml).
88110

89111
### Running agent-scaler locally
@@ -103,7 +125,6 @@ The scaler is set up automatically by the [Elastic CI Stack][]'s CloudFormation
103125
reference the agent token and a queue name. A Lambda function running the scaler is then generated
104126
using these references (e.g., `BUILDKITE_AGENT_TOKEN_SSM_KEY` and `BUILDKITE_QUEUE`).
105127

106-
107128
## Copyright
108129

109130
Copyright (c) 2014-2019 Buildkite Pty Ltd. See [LICENSE](./LICENSE.txt) for details.

docs/elastic_ci_mode.md

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -91,9 +91,14 @@ Job dispatch delays, other issues related to job processing.
9191
```
9292

9393
## Configuration Parameters
94+
95+
### Availability Monitoring (applies to all modes)
96+
- `AVAILABILITY_THRESHOLD`: Minimum agent availability percentage before triggering scale-out (default 50%, i.e. with 4 agents per instance, we scale out when fewer than 2 agents are online). Set to `0` to disable availability-based scaling.
97+
98+
### Elastic CI Mode Specific
99+
**Note:** The following settings apply **only when `ELASTIC_CI_MODE=true`**
100+
94101
- `ELASTIC_CI_MODE`: Enable enhanced safety features, only for [Elastic CI Stack](https://github.com/buildkite/elastic-ci-stack-for-aws)! (boolean)
95-
- `AVAILABILITY_THRESHOLD`: Minimum agent availability percentage (default 90%)
96-
- `MIN_AGENTS_PERCENTAGE`: Minimum acceptable percentage of expected agents — ratio of desired agents number to actual (default 50%, i.e. we tolerate 2 agent instances running on a single EC2 out of desired 4)
97102
- `DANGLING_CHECK_MINIMUM_INSTANCE_UPTIME`: Minimum instance uptime before checking for dangling instances (default 1h)
98103
- `MAX_DANGLING_INSTANCES_TO_CHECK`: Maximum number of instances to scan for dangling detection (default 5)
99-
- `SCALE_IN_COOLDOWN_PERIOD`: Time to wait between scale-in operations (default 1h)
104+
- `SCALE_IN_COOLDOWN_PERIOD`: Time to wait between scale-in operations (default 1h for Elastic CI Mode, 0 otherwise)

lambda/main.go

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -74,12 +74,10 @@ func Handler(ctx context.Context, evt json.RawMessage) (string, error) {
7474
includeWaiting := EnvBool("INCLUDE_WAITING")
7575
instanceBuffer := EnvInt("INSTANCE_BUFFER", 0)
7676
maxDescribeScalingActivitiesPages := EnvInt("MAX_DESCRIBE_SCALING_ACTIVITIES_PAGES", -1)
77-
// Below settings only applicable when elasticCIMode is enabled!
78-
availabilityThreshold := EnvFloat("AVAILABILITY_THRESHOLD") // Default to 90% in scaling calculator
79-
minAgentsPercentage := EnvFloat("MIN_AGENTS_PERCENTAGE", 0.5) // Default to 50% in scaling calculator
77+
availabilityThreshold := EnvFloat("AVAILABILITY_THRESHOLD", 0.5) // Default to 50%
78+
// Below settings only applicable when elasticCIMode is enabled
8079
minimumInstanceUptime := EnvDuration("DANGLING_CHECK_MINIMUM_INSTANCE_UPTIME", 1*time.Hour)
8180
maxDanglingInstancesToCheck := EnvInt("MAX_DANGLING_INSTANCES_TO_CHECK", 5) // Maximum number of instances to check for dangling instances (only used for dangling instance scanning, not for normal scale-in)
82-
// Above settings only applicable when elasticCIMode is enabled!
8381

8482
publishCloudWatchMetrics := EnvBool("CLOUDWATCH_METRICS")
8583
if publishCloudWatchMetrics {
@@ -189,7 +187,6 @@ func Handler(ctx context.Context, evt json.RawMessage) (string, error) {
189187
ScaleOnlyAfterAllEvent: scaleOnlyAfterAllEvent,
190188
PublishCloudWatchMetrics: publishCloudWatchMetrics,
191189
AvailabilityThreshold: availabilityThreshold,
192-
MinAgentsPercentage: minAgentsPercentage,
193190
ElasticCIMode: elasticCIMode,
194191
MinimumInstanceUptime: minimumInstanceUptime,
195192
MaxDanglingInstancesToCheck: maxDanglingInstancesToCheck,

scaler/scaler.go

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -33,8 +33,7 @@ type Params struct {
3333
ScaleOutParams ScaleParams
3434
InstanceBuffer int
3535
ScaleOnlyAfterAllEvent bool
36-
AvailabilityThreshold float64 // Threshold for agent availability
37-
MinAgentsPercentage float64 // Minimum acceptable percentage of expected agents
36+
AvailabilityThreshold float64 // Threshold for agent availability (default 50%, all modes)
3837
ASGActivityCooldown time.Duration // How long to wait after an ASG activity before scaling again
3938
ElasticCIMode bool // Special mode for Elastic CI Stack with additional safety checks
4039
MinimumInstanceUptime time.Duration // How long instance should be online before being eligible for dangling instance check
@@ -88,7 +87,6 @@ func NewScaler(client *buildkite.Client, cfg aws.Config, params Params) (*Scaler
8887
includeWaiting: params.IncludeWaiting,
8988
agentsPerInstance: params.AgentsPerInstance,
9089
availabilityThreshold: params.AvailabilityThreshold,
91-
minAgentsPercentage: params.MinAgentsPercentage,
9290
elasticCIMode: params.ElasticCIMode,
9391
}
9492

@@ -201,7 +199,9 @@ func (s *Scaler) Run(ctx context.Context) (time.Duration, error) {
201199
proportionalBuffer = int64(s.instanceBuffer)
202200
}
203201

204-
log.Printf("↳ 🧮 Adding proportional instance buffer: %d (based on %d total jobs)", proportionalBuffer, totalJobs)
202+
if proportionalBuffer > 0 {
203+
log.Printf("↳ 🧮 Adding proportional instance buffer: %d (based on %d total jobs)", proportionalBuffer, totalJobs)
204+
}
205205
desired += proportionalBuffer
206206
}
207207

scaler/scaler_test.go

Lines changed: 156 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -268,7 +268,7 @@ func TestScalingOutWithoutError(t *testing.T) {
268268
scaling: ScalingCalculator{
269269
includeWaiting: tc.params.IncludeWaiting,
270270
agentsPerInstance: tc.params.AgentsPerInstance,
271-
availabilityThreshold: 0.0, // Disable availability threshold for tests
271+
availabilityThreshold: 0, // Disable availability threshold for tests
272272
},
273273
}
274274

@@ -417,7 +417,7 @@ func TestScalingInWithoutError(t *testing.T) {
417417
scaling: ScalingCalculator{
418418
includeWaiting: tc.params.IncludeWaiting,
419419
agentsPerInstance: tc.params.AgentsPerInstance,
420-
availabilityThreshold: 0.0, // Disable availability threshold for tests
420+
availabilityThreshold: 0, // Disable availability threshold for tests
421421
},
422422
scaleInParams: tc.params.ScaleInParams,
423423
scaleOutParams: tc.params.ScaleOutParams,
@@ -451,6 +451,7 @@ func (d *buildkiteTestDriver) GetAgentMetrics(ctx context.Context) (buildkite.Ag
451451
type asgTestDriver struct {
452452
err error
453453
desiredCapacity int64
454+
actualCapacity int64 // If 0, will default to desiredCapacity
454455
sigTermsSent []string
455456
elasticCIMode bool
456457
danglingInstancesFound int
@@ -463,8 +464,14 @@ func (d *asgTestDriver) Describe(ctx context.Context) (AutoscaleGroupDetails, er
463464
instanceIDs[i] = fmt.Sprintf("i-%012d", i)
464465
}
465466

467+
actualCount := d.actualCapacity
468+
if actualCount == 0 {
469+
actualCount = d.desiredCapacity
470+
}
471+
466472
return AutoscaleGroupDetails{
467473
DesiredCount: d.desiredCapacity,
474+
ActualCount: actualCount,
468475
MinSize: 0,
469476
MaxSize: 100,
470477
InstanceIDs: instanceIDs,
@@ -488,3 +495,150 @@ func (d *asgTestDriver) CleanupDanglingInstances(ctx context.Context, minimumIns
488495
d.danglingInstancesFound++
489496
return d.err
490497
}
498+
499+
func TestAvailabilityBasedScaling(t *testing.T) {
500+
testCases := []struct {
501+
name string
502+
metrics buildkite.AgentMetrics
503+
asgDesired int64
504+
asgActual int64
505+
agentsPerInstance int
506+
availabilityThreshold float64
507+
expectedDesiredCapacity int64
508+
}{
509+
// With 2 instances @ 4 agents each = 8 expected, but only 3 online (37.5%).
510+
// Should scale from 2 to 3 instances when ASG has converged.
511+
{
512+
name: "Low availability triggers scale-out when ASG converged",
513+
metrics: buildkite.AgentMetrics{
514+
ScheduledJobs: 5,
515+
RunningJobs: 2,
516+
TotalAgents: 3,
517+
},
518+
asgDesired: 2,
519+
asgActual: 2,
520+
agentsPerInstance: 4,
521+
availabilityThreshold: 0.5,
522+
expectedDesiredCapacity: 3,
523+
},
524+
// ASG not converged (actual 1 != desired 2), should wait for convergence
525+
// before applying availability-based scaling.
526+
{
527+
name: "Low availability does not trigger when ASG still converging",
528+
metrics: buildkite.AgentMetrics{
529+
ScheduledJobs: 5,
530+
RunningJobs: 2,
531+
TotalAgents: 3,
532+
},
533+
asgDesired: 2,
534+
asgActual: 1,
535+
agentsPerInstance: 4,
536+
availabilityThreshold: 0.5,
537+
expectedDesiredCapacity: 2,
538+
},
539+
// 7 out of 8 expected agents (87.5% availability) is above 50% threshold.
540+
// No scale-out needed.
541+
{
542+
name: "Good availability does not trigger scale-out",
543+
metrics: buildkite.AgentMetrics{
544+
ScheduledJobs: 5,
545+
RunningJobs: 2,
546+
TotalAgents: 7,
547+
},
548+
asgDesired: 2,
549+
asgActual: 2,
550+
agentsPerInstance: 4,
551+
availabilityThreshold: 0.5,
552+
expectedDesiredCapacity: 2,
553+
},
554+
// Threshold set to 0 disables availability-based scaling.
555+
// No scale-out despite only 2 out of 8 agents online (25%).
556+
{
557+
name: "Availability threshold disabled (0) does not trigger",
558+
metrics: buildkite.AgentMetrics{
559+
ScheduledJobs: 5,
560+
RunningJobs: 2,
561+
TotalAgents: 2,
562+
},
563+
asgDesired: 2,
564+
asgActual: 2,
565+
agentsPerInstance: 4,
566+
availabilityThreshold: 0,
567+
expectedDesiredCapacity: 2,
568+
},
569+
// With 0 instances, job-based scaling takes over.
570+
// Need 2 instances for 5 jobs (at 4 agents per instance).
571+
{
572+
name: "Low availability from zero instances scales to 1",
573+
metrics: buildkite.AgentMetrics{
574+
ScheduledJobs: 5,
575+
RunningJobs: 0,
576+
TotalAgents: 0,
577+
},
578+
asgDesired: 0,
579+
asgActual: 0,
580+
agentsPerInstance: 4,
581+
availabilityThreshold: 0.5,
582+
expectedDesiredCapacity: 2,
583+
},
584+
// Only 2 out of 12 expected agents online (16.7% availability).
585+
// Availability-based boost from 3 to 4 overrides lower job-based need (1).
586+
{
587+
name: "Availability boost when job-based need is lower",
588+
metrics: buildkite.AgentMetrics{
589+
ScheduledJobs: 2,
590+
RunningJobs: 0,
591+
TotalAgents: 2,
592+
},
593+
asgDesired: 3,
594+
asgActual: 3,
595+
agentsPerInstance: 4,
596+
availabilityThreshold: 0.5,
597+
expectedDesiredCapacity: 4,
598+
},
599+
// Need 5 instances for 20 jobs. Job-based scaling (5) dominates
600+
// over availability boost (3), despite low availability (25%).
601+
{
602+
name: "No boost when job-based need is higher",
603+
metrics: buildkite.AgentMetrics{
604+
ScheduledJobs: 20,
605+
RunningJobs: 0,
606+
TotalAgents: 2,
607+
},
608+
asgDesired: 2,
609+
asgActual: 2,
610+
agentsPerInstance: 4,
611+
availabilityThreshold: 0.5,
612+
expectedDesiredCapacity: 5,
613+
},
614+
}
615+
616+
for _, tc := range testCases {
617+
t.Run(tc.name, func(t *testing.T) {
618+
asg := &asgTestDriver{
619+
desiredCapacity: tc.asgDesired,
620+
actualCapacity: tc.asgActual,
621+
}
622+
623+
s := Scaler{
624+
autoscaling: asg,
625+
bk: &buildkiteTestDriver{metrics: tc.metrics},
626+
scaling: ScalingCalculator{
627+
includeWaiting: false,
628+
agentsPerInstance: tc.agentsPerInstance,
629+
availabilityThreshold: tc.availabilityThreshold,
630+
},
631+
}
632+
633+
_, err := s.Run(context.Background())
634+
if err != nil {
635+
t.Fatalf("Unexpected error: %v", err)
636+
}
637+
638+
if asg.desiredCapacity != tc.expectedDesiredCapacity {
639+
t.Errorf("Expected desired capacity: %d, got: %d",
640+
tc.expectedDesiredCapacity, asg.desiredCapacity)
641+
}
642+
})
643+
}
644+
}

0 commit comments

Comments
 (0)