Skip to content

Add secondary node group failover and continuous execution mode#12

Open
HongboDu-at wants to merge 1 commit intobuildbarn:mainfrom
HongboDu-at:add-node-group-secondary-failover-and-continuous-mode
Open

Add secondary node group failover and continuous execution mode#12
HongboDu-at wants to merge 1 commit intobuildbarn:mainfrom
HongboDu-at:add-node-group-secondary-failover-and-continuous-mode

Conversation

@HongboDu-at
Copy link

@HongboDu-at HongboDu-at commented Jan 13, 2026

Key Changes

1. New Continuous Execution Mode

  • Added execution_interval_seconds configuration option
  • When set to a positive value, the autoscaler runs indefinitely at the specified interval
  • When not set or zero, it runs once and exits (original cron job behavior)
  • Includes error tolerance: exits after 5 consecutive failures

2. Secondary Node Group Failover Support

  • Added secondary_node_group_name field to EKSManagedNodeGroupConfiguration
  • Enables maximizing spot instance usage while automatically failing over to on-demand instances when spot capacity is unavailable
  • When the primary EKS node group (spot instances) has health issues (degraded status or capacity shortage), the secondary node group (on-demand instances) automatically scales up to fill the capacity gap
  • When primary is healthy, secondary scales down to its minimum size to minimize costs

3. Code Refactoring

  • Extracted autoscaling logic from main.go into a new autoscaler.go file
  • Created an Autoscaler struct that holds reusable clients (Prometheus, AWS, Kubernetes)
  • NewAutoscaler() constructor initializes all clients once
  • RunOnce() method executes a single autoscaling cycle
  • Prevents resource leaks from recreating HTTP clients in continuous mode

Files Changed

File Change
cmd/bb_autoscaler/autoscaler.go New file with Autoscaler struct. Holds reusable clients (Prometheus, AWS, Kubernetes). Prevents resource leaks from recreating HTTP clients in continuous mode
cmd/bb_autoscaler/main.go Simplified to use Autoscaler
pkg/proto/.../bb_autoscaler.proto Added new config fields
cmd/bb_autoscaler/BUILD.bazel Added new source and dependency

@aspect-workflows
Copy link

aspect-workflows bot commented Jan 13, 2026

Test

⚠️ Buildkite build #8 is currently failing.

@@gazelle++go_deps+org_golang_google_grpc_cmd_protoc_gen_go_grpc//:protoc-gen-go-grpc failed to build

no such package '@@gazelle++go_deps+org_golang_google_grpc_cmd_protoc_gen_go_grpc//': no such package
'@@gazelle++non_module_deps+bazel_gazelle_go_repository_tools//': failed to build tools:

💡 To reproduce the build failures, run

bazel build @@gazelle++go_deps+org_golang_google_grpc_cmd_protoc_gen_go_grpc//:protoc-gen-go-grpc

// the autoscaler runs once and exits (suitable for running as a cron job).
//
// Example: 300 for 5 minutes
int32 execution_interval_seconds = 6;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really convinced we should add a feature like this. What platform are you trying to run this on, which doesn't support cron jobs? Plain UNIX has crontab, Kubernetes has cron jobs, etc.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are on Kubernetes. I used to use cron jobs to run it every minute. But I found I want it to run every 10 seconds so that it can bring up workers faster. We have no builds in the night so we let workers to scaler down to 0. Our build volumes fluctuate quite a lot during the day.

Totally okay and understandable if this feature is not accepted and we will just keep a fork. Let me know how you think.


// Optional: Secondary node group that scales up to fill capacity gaps
// when the primary has health issues (e.g., spot capacity shortage).
string secondary_node_group_name = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm always a big fan of the "zero one infinity" rule: https://en.wikipedia.org/wiki/Zero_one_infinity_rule

Instead of going down this route, can't we turn node_group_name into something like this:

// Names of the managed node groups in the EKS cluster, specified in the order in which attempts to allocate resources should be attempted.
repeated string node_group_names = 2;

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments