Add secondary node group failover and continuous execution mode by HongboDu-at · Pull Request #12 · buildbarn/bb-autoscaler

HongboDu-at · 2026-01-13T00:02:21Z

Key Changes

1. New Continuous Execution Mode

Added execution_interval_seconds configuration option
When set to a positive value, the autoscaler runs indefinitely at the specified interval
When not set or zero, it runs once and exits (original cron job behavior)
Includes error tolerance: exits after 5 consecutive failures

2. Secondary Node Group Failover Support

Added secondary_node_group_name field to EKSManagedNodeGroupConfiguration
Enables maximizing spot instance usage while automatically failing over to on-demand instances when spot capacity is unavailable
When the primary EKS node group (spot instances) has health issues (degraded status or capacity shortage), the secondary node group (on-demand instances) automatically scales up to fill the capacity gap
When primary is healthy, secondary scales down to its minimum size to minimize costs

3. Code Refactoring

Extracted autoscaling logic from main.go into a new autoscaler.go file
Created an Autoscaler struct that holds reusable clients (Prometheus, AWS, Kubernetes)
NewAutoscaler() constructor initializes all clients once
RunOnce() method executes a single autoscaling cycle
Prevents resource leaks from recreating HTTP clients in continuous mode

Files Changed

File	Change
`cmd/bb_autoscaler/autoscaler.go`	New file with `Autoscaler` struct. Holds reusable clients (Prometheus, AWS, Kubernetes). Prevents resource leaks from recreating HTTP clients in continuous mode
`cmd/bb_autoscaler/main.go`	Simplified to use `Autoscaler`
`pkg/proto/.../bb_autoscaler.proto`	Added new config fields
`cmd/bb_autoscaler/BUILD.bazel`	Added new source and dependency

aspect-workflows · 2026-01-13T00:03:00Z

Test

⚠️ Buildkite build #8 is currently failing.

@@gazelle++go_deps+org_golang_google_grpc_cmd_protoc_gen_go_grpc//:protoc-gen-go-grpc failed to build

no such package '@@gazelle++go_deps+org_golang_google_grpc_cmd_protoc_gen_go_grpc//': no such package
'@@gazelle++non_module_deps+bazel_gazelle_go_repository_tools//': failed to build tools:

💡 To reproduce the build failures, run

bazel build @@gazelle++go_deps+org_golang_google_grpc_cmd_protoc_gen_go_grpc//:protoc-gen-go-grpc

EdSchouten · 2026-01-14T13:59:27Z

pkg/proto/configuration/bb_autoscaler/bb_autoscaler.proto

+  // the autoscaler runs once and exits (suitable for running as a cron job).
+  //
+  // Example: 300 for 5 minutes
+  int32 execution_interval_seconds = 6;


I'm not really convinced we should add a feature like this. What platform are you trying to run this on, which doesn't support cron jobs? Plain UNIX has crontab, Kubernetes has cron jobs, etc.

We are on Kubernetes. I used to use cron jobs to run it every minute. But I found I want it to run every 10 seconds so that it can bring up workers faster. We have no builds in the night so we let workers to scaler down to 0. Our build volumes fluctuate quite a lot during the day.

Totally okay and understandable if this feature is not accepted and we will just keep a fork. Let me know how you think.

EdSchouten · 2026-01-14T14:01:52Z

pkg/proto/configuration/bb_autoscaler/bb_autoscaler.proto

+
+  // Optional: Secondary node group that scales up to fill capacity gaps
+  // when the primary has health issues (e.g., spot capacity shortage).
+  string secondary_node_group_name = 3;


I'm always a big fan of the "zero one infinity" rule: https://en.wikipedia.org/wiki/Zero_one_infinity_rule

Instead of going down this route, can't we turn node_group_name into something like this:

// Names of the managed node groups in the EKS cluster, specified in the order in which attempts to allocate resources should be attempted. repeated string node_group_names = 2;

This is a good idea.

Add secondary node group failover and continuous execution mode

7f2e2e2

EdSchouten reviewed Jan 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add secondary node group failover and continuous execution mode#12

Add secondary node group failover and continuous execution mode#12
HongboDu-at wants to merge 1 commit intobuildbarn:mainfrom
HongboDu-at:add-node-group-secondary-failover-and-continuous-mode

HongboDu-at commented Jan 13, 2026 •

edited

Loading

Uh oh!

aspect-workflows bot commented Jan 13, 2026 •

edited

Loading

Uh oh!

EdSchouten Jan 14, 2026

Uh oh!

HongboDu-at Jan 14, 2026

Uh oh!

EdSchouten Jan 14, 2026

Uh oh!

HongboDu-at Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

HongboDu-at commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key Changes

1. New Continuous Execution Mode

2. Secondary Node Group Failover Support

3. Code Refactoring

Files Changed

Uh oh!

aspect-workflows bot commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test

Uh oh!

EdSchouten Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

HongboDu-at Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

EdSchouten Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

HongboDu-at Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

HongboDu-at commented Jan 13, 2026 •

edited

Loading

aspect-workflows bot commented Jan 13, 2026 •

edited

Loading