Problem
The CreateFleet batcher (pkg/batcher/createfleet.go) collects up to 1,000 individual instance requests into a single EC2 CreateFleet API call. If that call receives RequestLimitExceeded, all N batched requests receive the same error at once. There is no rate limiting before calling EC2 and no batcher-level retry. The error goes straight to all callers.
One rate-limited API call becomes N simultaneous NodeClaim failures. All N retry together on the same schedule, re-triggering the rate limit on the next attempt.
Evidence
During scale testing with 500+ pending pods, the batcher created a single CreateFleet call with TotalTargetCapacity=500. The call was rate-limited. All 500 NodeClaims failed and retried together. This repeated across retry cycles, sustaining the throttle until claims were manually deleted.
Root Causes
1. One API call failure fans out to the entire batch
File: pkg/batcher/createfleet.go
Batcher configuration:
IdleTimeout = 35ms
MaxTimeout = 1s
MaxItems = 1000
Requests arriving within the timeout window go into one CreateFleet call. If that call errors (including transient throttle), every request in the batch gets the same error. Blast radius equals batch size.
2. No rate limiting before EC2 API calls
The batcher calls CreateFleet as fast as batches fill. No token bucket, no adaptive limiter, no concurrency cap. It reacts to RequestLimitExceeded after the fact rather than staying within limits proactively.
3. No batcher-level retry for transient errors
When CreateFleet returns RequestLimitExceeded, the error goes to all waiting callers immediately. The batcher does not retry. Retry is delegated to the lifecycle controller, which retries each of the N failed NodeClaims independently, recreating the thundering herd at the controller level.
4. Throttle errors are not classified differently from permanent failures
File: pkg/providers/instance/instance.go:376-384
Two error paths exist for CreateFleet:
-
API-level error (the call itself fails, e.g., RequestLimitExceeded): The batcher returns err to all callers. The instance provider wraps it at line 384:
return ec2types.CreateFleetInstance{}, cloudprovider.NewCreateError(
fmt.Errorf("creating fleet request, %w", err), reason, message)
-
Fleet-level error (the call succeeds but returns errors in the response): combineFleetErrors (line 766) classifies UnfulfillableCapacity as InsufficientCapacityError, but other errors (including throttle) become generic CreateError.
In both paths, RequestLimitExceeded has no special classification. The upstream lifecycle controller cannot tell that this is transient.
Proposed Improvements
A. Add a token bucket rate limiter before CreateFleet calls
Where: pkg/batcher/createfleet.go
Cap the rate of CreateFleet API calls with a rate.Limiter or similar. Parameters could be fixed or adaptive (slow down when throttle errors appear, speed up when calls succeed).
B. Batcher-level retry with backoff and jitter
Where: pkg/batcher/createfleet.go
When CreateFleet returns RequestLimitExceeded, retry the call within the batcher (2-3 attempts, jittered backoff at 500ms/1s/2s base) before surfacing the error to callers. This keeps the blast radius contained and avoids triggering N independent retry paths upstream.
C. Reduce maximum batch size (tradeoff)
Cap effective batch size below 1,000 (e.g., 100-200) or split large batches into staggered sub-batches. Fewer callers affected per failure. Tradeoff: more API calls total, but smaller blast radius when one fails.
D. Classify RequestLimitExceeded as retryable
Where: pkg/providers/instance/instance.go:384 and pkg/providers/instance/instance.go:766-780
Wrap RequestLimitExceeded errors with the RetryableError interface proposed in kubernetes-sigs/karpenter#3034. This lets the lifecycle controller apply different retry policies and avoid marking the NodePool unhealthy on transient throttle.
Update combineFleetErrors to check for throttle error codes alongside the existing UnfulfillableCapacity check.
Code References
| File |
Relevance |
pkg/batcher/createfleet.go |
Batcher config, error fan-out to all callers |
pkg/providers/instance/instance.go:376-384 |
API-level error wrapping as generic CreateError |
pkg/providers/instance/instance.go:766-780 |
combineFleetErrors, only classifies UnfulfillableCapacity specially |
Related
Companion to kubernetes-sigs/karpenter#3034 which addresses lifecycle-level handling of transient errors (stuck Unknown NodeClaims, jittered backoff, health condition separation). The RetryableError interface proposed there is what fix D here would implement.
Suggested labels:
kind/bug
priority/important-longterm
area/provisioning
Problem
The CreateFleet batcher (
pkg/batcher/createfleet.go) collects up to 1,000 individual instance requests into a single EC2 CreateFleet API call. If that call receivesRequestLimitExceeded, all N batched requests receive the same error at once. There is no rate limiting before calling EC2 and no batcher-level retry. The error goes straight to all callers.One rate-limited API call becomes N simultaneous NodeClaim failures. All N retry together on the same schedule, re-triggering the rate limit on the next attempt.
Evidence
During scale testing with 500+ pending pods, the batcher created a single CreateFleet call with TotalTargetCapacity=500. The call was rate-limited. All 500 NodeClaims failed and retried together. This repeated across retry cycles, sustaining the throttle until claims were manually deleted.
Root Causes
1. One API call failure fans out to the entire batch
File:
pkg/batcher/createfleet.goBatcher configuration:
IdleTimeout = 35msMaxTimeout = 1sMaxItems = 1000Requests arriving within the timeout window go into one CreateFleet call. If that call errors (including transient throttle), every request in the batch gets the same error. Blast radius equals batch size.
2. No rate limiting before EC2 API calls
The batcher calls CreateFleet as fast as batches fill. No token bucket, no adaptive limiter, no concurrency cap. It reacts to
RequestLimitExceededafter the fact rather than staying within limits proactively.3. No batcher-level retry for transient errors
When CreateFleet returns
RequestLimitExceeded, the error goes to all waiting callers immediately. The batcher does not retry. Retry is delegated to the lifecycle controller, which retries each of the N failed NodeClaims independently, recreating the thundering herd at the controller level.4. Throttle errors are not classified differently from permanent failures
File:
pkg/providers/instance/instance.go:376-384Two error paths exist for CreateFleet:
API-level error (the call itself fails, e.g.,
RequestLimitExceeded): The batcher returnserrto all callers. The instance provider wraps it at line 384:Fleet-level error (the call succeeds but returns errors in the response):
combineFleetErrors(line 766) classifiesUnfulfillableCapacityasInsufficientCapacityError, but other errors (including throttle) become genericCreateError.In both paths,
RequestLimitExceededhas no special classification. The upstream lifecycle controller cannot tell that this is transient.Proposed Improvements
A. Add a token bucket rate limiter before CreateFleet calls
Where:
pkg/batcher/createfleet.goCap the rate of CreateFleet API calls with a
rate.Limiteror similar. Parameters could be fixed or adaptive (slow down when throttle errors appear, speed up when calls succeed).B. Batcher-level retry with backoff and jitter
Where:
pkg/batcher/createfleet.goWhen CreateFleet returns
RequestLimitExceeded, retry the call within the batcher (2-3 attempts, jittered backoff at 500ms/1s/2s base) before surfacing the error to callers. This keeps the blast radius contained and avoids triggering N independent retry paths upstream.C. Reduce maximum batch size (tradeoff)
Cap effective batch size below 1,000 (e.g., 100-200) or split large batches into staggered sub-batches. Fewer callers affected per failure. Tradeoff: more API calls total, but smaller blast radius when one fails.
D. Classify RequestLimitExceeded as retryable
Where:
pkg/providers/instance/instance.go:384andpkg/providers/instance/instance.go:766-780Wrap
RequestLimitExceedederrors with theRetryableErrorinterface proposed in kubernetes-sigs/karpenter#3034. This lets the lifecycle controller apply different retry policies and avoid marking the NodePool unhealthy on transient throttle.Update
combineFleetErrorsto check for throttle error codes alongside the existingUnfulfillableCapacitycheck.Code References
pkg/batcher/createfleet.gopkg/providers/instance/instance.go:376-384CreateErrorpkg/providers/instance/instance.go:766-780combineFleetErrors, only classifiesUnfulfillableCapacityspeciallyRelated
Companion to kubernetes-sigs/karpenter#3034 which addresses lifecycle-level handling of transient errors (stuck Unknown NodeClaims, jittered backoff, health condition separation). The
RetryableErrorinterface proposed there is what fix D here would implement.Suggested labels:
kind/bugpriority/important-longtermarea/provisioning