Skip to content

StandardRetryPolicy: Full Jitter backoff can collapse to near-zero delays for throttling errors #4341

@markhasper

Description

@markhasper

Describe the bug

StandardRetryPolicy.CalculateRetryDelay uses Full Jitter (random() * base_delay) for all error types, including throttling. When the random jitter value is low, retry delays can be near-zero, causing all retries to exhaust within seconds — too fast for rate limits that operate on per-minute windows.

Regression Issue

  • Select this option if this issue appears to be a regression.

Expected Behavior

Throttling retries should have a meaningful minimum delay. AWS's own "Exponential Backoff and Jitter" architecture blog recommends "Equal Jitter" for this scenario: base/2 + random(0, base/2), which guarantees at least 50% of the base delay.

Current Behavior

In StandardRetryPolicy.CalculateRetryDelay (StandardRetryPolicy.cs, line 243):

protected static int CalculateRetryDelay(int retries, int maxBackoffInMilliseconds)
{
    double jitter;
    lock (_randomJitter) {        
        jitter = _randomJitter.NextDouble(); // [0.0, 1.0)
    }
    return Convert.ToInt32(Math.Min(jitter * Math.Pow(2, retries - 1) * 1000.0, maxBackoffInMilliseconds));
}

When jitter = 0.05, 8 retries produce delays of approximately: 50ms, 100ms, 200ms, 400ms, 800ms, 1000ms, 1000ms, 1000ms — totalling ~4.5 seconds of backoff. The per-minute rate limit window needs 12+ seconds between requests to clear, so all retries fail.

This is fundamentally different from transient errors (network blips) where fast retries are fine. For throttling errors, the server has explicitly told the client to slow down.

Reproduction Steps

  1. Use AmazonBedrockRuntimeClient with MaxErrorRetry = 8 and ThrottleRetries = true
  2. Call ConverseStreamAsync against a model with a 5 requests/minute rate limit (e.g., eu.anthropic.claude-opus-4-6-v1 on Bedrock)
  3. When the rate limit is hit, the SDK retries 8 times
  4. With low jitter values (~0.05), all 8 retries complete in ~5 seconds total, never waiting long enough for the rate limit window to clear

Possible Solution

Use Equal Jitter for throttling errors, keep Full Jitter for transient errors:

protected static int CalculateRetryDelay(int retries, int maxBackoffInMilliseconds, 
    bool isThrottlingError = false)
{
    double jitter;
    lock (_randomJitter) {        
        jitter = _randomJitter.NextDouble();
    }
    
    double baseDelay = Math.Pow(2, retries - 1) * 1000.0;
    double delay = isThrottlingError
        ? baseDelay / 2 + jitter * baseDelay / 2  // Equal Jitter: guaranteed >= 50% of base
        : jitter * baseDelay;                       // Full Jitter: existing behavior
    
    return Convert.ToInt32(Math.Min(delay, maxBackoffInMilliseconds));
}

The isThrottlingError flag is already available in the retry pipeline — RetryForExceptionSync calls IsThrottlingError() and the result could be passed through to WaitBeforeRetry.

Additional Information/Context

No response

AWS .NET SDK and/or Package version used

  • Package: AWSSDK.BedrockRuntime 4.0.15.1 (also verified on latest main — CalculateRetryDelay is unchanged as of 4.0.16.1 / AWSSDK.Core 4.0.3.15)
  • Runtime: .NET 10
  • OS: Linux (Azure Container Apps)

Targeted .NET Platform

.NET 10

Operating System and version

Windows

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugThis issue is a bug.module/sdk-corep2This is a standard priority issueresponse-requestedWaiting on additional info and feedback. Will move to "closing-soon" in 7 days.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions