CreateFleet batcher amplifies rate limit failures and lacks internal throttle protection


## Problem

The CreateFleet batcher (`pkg/batcher/createfleet.go`) collects up to 1,000 individual instance requests into a single EC2 CreateFleet API call. If that call receives `RequestLimitExceeded`, all N batched requests receive the same error at once. There is no rate limiting before calling EC2 and no batcher-level retry. The error goes straight to all callers.

One rate-limited API call becomes N simultaneous NodeClaim failures. All N retry together on the same schedule, re-triggering the rate limit on the next attempt.

## Evidence

During scale testing with 500+ pending pods, the batcher created a single CreateFleet call with TotalTargetCapacity=500. The call was rate-limited. All 500 NodeClaims failed and retried together. This repeated across retry cycles, sustaining the throttle until claims were manually deleted.

## Root Causes

### 1. One API call failure fans out to the entire batch

**File:** `pkg/batcher/createfleet.go`

Batcher configuration:
- `IdleTimeout = 35ms`
- `MaxTimeout = 1s`
- `MaxItems = 1000`

Requests arriving within the timeout window go into one CreateFleet call. If that call errors (including transient throttle), every request in the batch gets the same error. Blast radius equals batch size.

### 2. No rate limiting before EC2 API calls

The batcher calls CreateFleet as fast as batches fill. No token bucket, no adaptive limiter, no concurrency cap. It reacts to `RequestLimitExceeded` after the fact rather than staying within limits proactively.

### 3. No batcher-level retry for transient errors

When CreateFleet returns `RequestLimitExceeded`, the error goes to all waiting callers immediately. The batcher does not retry. Retry is delegated to the lifecycle controller, which retries each of the N failed NodeClaims independently, recreating the thundering herd at the controller level.

### 4. Throttle errors are not classified differently from permanent failures

**File:** `pkg/providers/instance/instance.go:376-384`

Two error paths exist for CreateFleet:

1. **API-level error** (the call itself fails, e.g., `RequestLimitExceeded`): The batcher returns `err` to all callers. The instance provider wraps it at line 384:
   ```go
   return ec2types.CreateFleetInstance{}, cloudprovider.NewCreateError(
       fmt.Errorf("creating fleet request, %w", err), reason, message)
   ```

2. **Fleet-level error** (the call succeeds but returns errors in the response): `combineFleetErrors` (line 766) classifies `UnfulfillableCapacity` as `InsufficientCapacityError`, but other errors (including throttle) become generic `CreateError`.

In both paths, `RequestLimitExceeded` has no special classification. The upstream lifecycle controller cannot tell that this is transient.

## Proposed Improvements

### A. Add a token bucket rate limiter before CreateFleet calls

**Where:** `pkg/batcher/createfleet.go`

Cap the rate of CreateFleet API calls with a `rate.Limiter` or similar. Parameters could be fixed or adaptive (slow down when throttle errors appear, speed up when calls succeed).

### B. Batcher-level retry with backoff and jitter

**Where:** `pkg/batcher/createfleet.go`

When CreateFleet returns `RequestLimitExceeded`, retry the call within the batcher (2-3 attempts, jittered backoff at 500ms/1s/2s base) before surfacing the error to callers. This keeps the blast radius contained and avoids triggering N independent retry paths upstream.

### C. Reduce maximum batch size (tradeoff)

Cap effective batch size below 1,000 (e.g., 100-200) or split large batches into staggered sub-batches. Fewer callers affected per failure. Tradeoff: more API calls total, but smaller blast radius when one fails.

### D. Classify RequestLimitExceeded as retryable

**Where:** `pkg/providers/instance/instance.go:384` and `pkg/providers/instance/instance.go:766-780`

Wrap `RequestLimitExceeded` errors with the `RetryableError` interface proposed in [kubernetes-sigs/karpenter#3034](https://github.com/kubernetes-sigs/karpenter/issues/3034). This lets the lifecycle controller apply different retry policies and avoid marking the NodePool unhealthy on transient throttle.

Update `combineFleetErrors` to check for throttle error codes alongside the existing `UnfulfillableCapacity` check.

## Code References

| File | Relevance |
|------|-----------|
| `pkg/batcher/createfleet.go` | Batcher config, error fan-out to all callers |
| `pkg/providers/instance/instance.go:376-384` | API-level error wrapping as generic `CreateError` |
| `pkg/providers/instance/instance.go:766-780` | `combineFleetErrors`, only classifies `UnfulfillableCapacity` specially |

## Related

Companion to [kubernetes-sigs/karpenter#3034](https://github.com/kubernetes-sigs/karpenter/issues/3034) which addresses lifecycle-level handling of transient errors (stuck Unknown NodeClaims, jittered backoff, health condition separation). The `RetryableError` interface proposed there is what fix D here would implement.

---

**Suggested labels:**
- `kind/bug`
- `priority/important-longterm`
- `area/provisioning`

File	Relevance
`pkg/batcher/createfleet.go`	Batcher config, error fan-out to all callers
`pkg/providers/instance/instance.go:376-384`	API-level error wrapping as generic `CreateError`
`pkg/providers/instance/instance.go:766-780`	`combineFleetErrors`, only classifies `UnfulfillableCapacity` specially

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CreateFleet batcher amplifies rate limit failures and lacks internal throttle protection #9175

Problem

Evidence

Root Causes

1. One API call failure fans out to the entire batch

2. No rate limiting before EC2 API calls

3. No batcher-level retry for transient errors

4. Throttle errors are not classified differently from permanent failures

Proposed Improvements

A. Add a token bucket rate limiter before CreateFleet calls

B. Batcher-level retry with backoff and jitter

C. Reduce maximum batch size (tradeoff)

D. Classify RequestLimitExceeded as retryable

Code References

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

CreateFleet batcher amplifies rate limit failures and lacks internal throttle protection #9175

Description

Problem

Evidence

Root Causes

1. One API call failure fans out to the entire batch

2. No rate limiting before EC2 API calls

3. No batcher-level retry for transient errors

4. Throttle errors are not classified differently from permanent failures

Proposed Improvements

A. Add a token bucket rate limiter before CreateFleet calls

B. Batcher-level retry with backoff and jitter

C. Reduce maximum batch size (tradeoff)

D. Classify RequestLimitExceeded as retryable

Code References

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions