Skip to content

CreateFleet batcher amplifies rate limit failures and lacks internal throttle protection #9175

@nathangeology

Description

@nathangeology

Problem

The CreateFleet batcher (pkg/batcher/createfleet.go) collects up to 1,000 individual instance requests into a single EC2 CreateFleet API call. If that call receives RequestLimitExceeded, all N batched requests receive the same error at once. There is no rate limiting before calling EC2 and no batcher-level retry. The error goes straight to all callers.

One rate-limited API call becomes N simultaneous NodeClaim failures. All N retry together on the same schedule, re-triggering the rate limit on the next attempt.

Evidence

During scale testing with 500+ pending pods, the batcher created a single CreateFleet call with TotalTargetCapacity=500. The call was rate-limited. All 500 NodeClaims failed and retried together. This repeated across retry cycles, sustaining the throttle until claims were manually deleted.

Root Causes

1. One API call failure fans out to the entire batch

File: pkg/batcher/createfleet.go

Batcher configuration:

  • IdleTimeout = 35ms
  • MaxTimeout = 1s
  • MaxItems = 1000

Requests arriving within the timeout window go into one CreateFleet call. If that call errors (including transient throttle), every request in the batch gets the same error. Blast radius equals batch size.

2. No rate limiting before EC2 API calls

The batcher calls CreateFleet as fast as batches fill. No token bucket, no adaptive limiter, no concurrency cap. It reacts to RequestLimitExceeded after the fact rather than staying within limits proactively.

3. No batcher-level retry for transient errors

When CreateFleet returns RequestLimitExceeded, the error goes to all waiting callers immediately. The batcher does not retry. Retry is delegated to the lifecycle controller, which retries each of the N failed NodeClaims independently, recreating the thundering herd at the controller level.

4. Throttle errors are not classified differently from permanent failures

File: pkg/providers/instance/instance.go:376-384

Two error paths exist for CreateFleet:

  1. API-level error (the call itself fails, e.g., RequestLimitExceeded): The batcher returns err to all callers. The instance provider wraps it at line 384:

    return ec2types.CreateFleetInstance{}, cloudprovider.NewCreateError(
        fmt.Errorf("creating fleet request, %w", err), reason, message)
  2. Fleet-level error (the call succeeds but returns errors in the response): combineFleetErrors (line 766) classifies UnfulfillableCapacity as InsufficientCapacityError, but other errors (including throttle) become generic CreateError.

In both paths, RequestLimitExceeded has no special classification. The upstream lifecycle controller cannot tell that this is transient.

Proposed Improvements

A. Add a token bucket rate limiter before CreateFleet calls

Where: pkg/batcher/createfleet.go

Cap the rate of CreateFleet API calls with a rate.Limiter or similar. Parameters could be fixed or adaptive (slow down when throttle errors appear, speed up when calls succeed).

B. Batcher-level retry with backoff and jitter

Where: pkg/batcher/createfleet.go

When CreateFleet returns RequestLimitExceeded, retry the call within the batcher (2-3 attempts, jittered backoff at 500ms/1s/2s base) before surfacing the error to callers. This keeps the blast radius contained and avoids triggering N independent retry paths upstream.

C. Reduce maximum batch size (tradeoff)

Cap effective batch size below 1,000 (e.g., 100-200) or split large batches into staggered sub-batches. Fewer callers affected per failure. Tradeoff: more API calls total, but smaller blast radius when one fails.

D. Classify RequestLimitExceeded as retryable

Where: pkg/providers/instance/instance.go:384 and pkg/providers/instance/instance.go:766-780

Wrap RequestLimitExceeded errors with the RetryableError interface proposed in kubernetes-sigs/karpenter#3034. This lets the lifecycle controller apply different retry policies and avoid marking the NodePool unhealthy on transient throttle.

Update combineFleetErrors to check for throttle error codes alongside the existing UnfulfillableCapacity check.

Code References

File Relevance
pkg/batcher/createfleet.go Batcher config, error fan-out to all callers
pkg/providers/instance/instance.go:376-384 API-level error wrapping as generic CreateError
pkg/providers/instance/instance.go:766-780 combineFleetErrors, only classifies UnfulfillableCapacity specially

Related

Companion to kubernetes-sigs/karpenter#3034 which addresses lifecycle-level handling of transient errors (stuck Unknown NodeClaims, jittered backoff, health condition separation). The RetryableError interface proposed there is what fix D here would implement.


Suggested labels:

  • kind/bug
  • priority/important-longterm
  • area/provisioning

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions