-
Notifications
You must be signed in to change notification settings - Fork 878
StandardRetryPolicy: Full Jitter backoff can collapse to near-zero delays for throttling errors #4341
Description
Describe the bug
StandardRetryPolicy.CalculateRetryDelay uses Full Jitter (random() * base_delay) for all error types, including throttling. When the random jitter value is low, retry delays can be near-zero, causing all retries to exhaust within seconds — too fast for rate limits that operate on per-minute windows.
Regression Issue
- Select this option if this issue appears to be a regression.
Expected Behavior
Throttling retries should have a meaningful minimum delay. AWS's own "Exponential Backoff and Jitter" architecture blog recommends "Equal Jitter" for this scenario: base/2 + random(0, base/2), which guarantees at least 50% of the base delay.
Current Behavior
In StandardRetryPolicy.CalculateRetryDelay (StandardRetryPolicy.cs, line 243):
protected static int CalculateRetryDelay(int retries, int maxBackoffInMilliseconds)
{
double jitter;
lock (_randomJitter) {
jitter = _randomJitter.NextDouble(); // [0.0, 1.0)
}
return Convert.ToInt32(Math.Min(jitter * Math.Pow(2, retries - 1) * 1000.0, maxBackoffInMilliseconds));
}When jitter = 0.05, 8 retries produce delays of approximately: 50ms, 100ms, 200ms, 400ms, 800ms, 1000ms, 1000ms, 1000ms — totalling ~4.5 seconds of backoff. The per-minute rate limit window needs 12+ seconds between requests to clear, so all retries fail.
This is fundamentally different from transient errors (network blips) where fast retries are fine. For throttling errors, the server has explicitly told the client to slow down.
Reproduction Steps
- Use
AmazonBedrockRuntimeClientwithMaxErrorRetry = 8andThrottleRetries = true - Call
ConverseStreamAsyncagainst a model with a 5 requests/minute rate limit (e.g.,eu.anthropic.claude-opus-4-6-v1on Bedrock) - When the rate limit is hit, the SDK retries 8 times
- With low jitter values (~0.05), all 8 retries complete in ~5 seconds total, never waiting long enough for the rate limit window to clear
Possible Solution
Use Equal Jitter for throttling errors, keep Full Jitter for transient errors:
protected static int CalculateRetryDelay(int retries, int maxBackoffInMilliseconds,
bool isThrottlingError = false)
{
double jitter;
lock (_randomJitter) {
jitter = _randomJitter.NextDouble();
}
double baseDelay = Math.Pow(2, retries - 1) * 1000.0;
double delay = isThrottlingError
? baseDelay / 2 + jitter * baseDelay / 2 // Equal Jitter: guaranteed >= 50% of base
: jitter * baseDelay; // Full Jitter: existing behavior
return Convert.ToInt32(Math.Min(delay, maxBackoffInMilliseconds));
}The isThrottlingError flag is already available in the retry pipeline — RetryForExceptionSync calls IsThrottlingError() and the result could be passed through to WaitBeforeRetry.
Additional Information/Context
No response
AWS .NET SDK and/or Package version used
- Package: AWSSDK.BedrockRuntime 4.0.15.1 (also verified on latest main — CalculateRetryDelay is unchanged as of 4.0.16.1 / AWSSDK.Core 4.0.3.15)
- Runtime: .NET 10
- OS: Linux (Azure Container Apps)
Targeted .NET Platform
.NET 10
Operating System and version
Windows