Description
Under cascade contention, the speculative write path raises false RateLimitExceeded with retry_after_seconds=0.0 when the parent bucket is contended.
Speculative Cascade Path (traced from load test)
When speculative_writes=True (default) and cascade=True:
- Child speculative
UpdateItem → succeeds (single-item conditional update)
- Parent speculative
UpdateItem → fails (contention: 10 users hitting same parent bucket)
- Falls back to parent-only slow path (
_try_parent_only_acquire):
- Reads parent buckets
- Calls
parent_lease._commit_initial() → issues transact_write()
transact_write() fails with TransactionCanceledException [None, TransactionConflict]
_is_condition_check_failure() returns True (treats all TransactionCanceledException as optimistic lock failures)
- Enters consumption-only retry path (
build_composite_retry)
- Retry also fails:
TransactionCanceledException [ConditionalCheckFailed, None] (child bucket version already updated)
- Raises
RateLimitExceeded with fabricated statuses from _build_retry_failure_statuses() — false rejection
The key insight: TransactionConflict is a transient contention error, not a failed optimistic lock. The parent speculative UpdateItem (step 2) is the bottleneck — all child users do conditional updates on the same parent bucket item simultaneously.
Impact
During a 60s load test (10 users, cascade=True):
- 259 RPS with 23 failures (0.1%)
- All failures are false
RateLimitExceeded from TransactionConflict, not real capacity exhaustion
retry_after_seconds=0.0 tells callers "you are rate limited, retry in 0s" — contradictory
Root Cause
1. _is_condition_check_failure() conflates TransactionConflict with ConditionalCheckFailed
# lease.py
def _is_condition_check_failure(exc: Exception) -> bool:
exc_name = type(exc).__name__
if exc_name in ("ConditionalCheckFailedException", "TransactionCanceledException"):
return True # ← treats ALL TransactionCanceledException the same
TransactionCanceledException includes a CancellationReasons array where each reason has a Code:
ConditionalCheckFailed — optimistic lock miss → retry path is correct
TransactionConflict — transient contention → should retry original transaction as-is
None — item not involved in the failure
2. _build_retry_failure_statuses() hardcodes retry_after_seconds=0.0
def _build_retry_failure_statuses(entries):
for entry in entries:
statuses.append(LimitStatus(
...
retry_after_seconds=0.0, # ← always zero, never computed from bucket state
))
Proposed Fix
1. Distinguish TransactionConflict from ConditionalCheckFailed
Inspect CancellationReasons in the TransactionCanceledException response:
def _is_condition_check_failure(exc: Exception) -> bool:
"""Check if exception is a ConditionalCheckFailed (not TransactionConflict)."""
response = getattr(exc, "response", {})
error_code = response.get("Error", {}).get("Code", "")
if error_code == "ConditionalCheckFailedException":
return True
if error_code == "TransactionCanceledException":
reasons = response.get("CancellationReasons", [])
return any(r.get("Code") == "ConditionalCheckFailed" for r in reasons)
# botocore class name fallback
if type(exc).__name__ == "ConditionalCheckFailedException":
return True
return False
2. Retry the original transaction on TransactionConflict
When TransactionCanceledException is raised but all cancellation reasons are TransactionConflict (no ConditionalCheckFailed), retry transact_write(items) as-is with backoff, rather than entering the consumption-only retry path.
3. Fix _build_retry_failure_statuses to compute meaningful retry_after_seconds
When the retry path does legitimately fail due to insufficient tokens, compute retry_after_seconds from bucket state using the same logic as try_consume() rather than hardcoding 0.0.
Steps to Reproduce
- Deploy a limiter stack
- Create a parent entity and 10 child entities with
cascade=True
- Run
zae-limiter load benchmark --user-classes MaxRpsCascadeUser -f locustfiles/max_rps.py --users 10 --duration 60
- Observe 0.1-0.2% false
RateLimitExceeded failures with retry_after_seconds=0.0
CloudWatch log signature:
Rate limit exceeded for user-xxx/api: [rpm, rpm]. Retry after 0.0s
TransactionCanceledException: ... [None, TransactionConflict]
TransactionCanceledException: ... [ConditionalCheckFailed, None]
Acceptance Criteria
Environment
- zae-limiter version: 0.8.2.dev167 (perf/stress-test branch)
- Python: 3.12
- Discovered during load testing with cascade contention
Description
Under cascade contention, the speculative write path raises false
RateLimitExceededwithretry_after_seconds=0.0when the parent bucket is contended.Speculative Cascade Path (traced from load test)
When
speculative_writes=True(default) andcascade=True:UpdateItem→ succeeds (single-item conditional update)UpdateItem→ fails (contention: 10 users hitting same parent bucket)_try_parent_only_acquire):parent_lease._commit_initial()→ issuestransact_write()transact_write()fails withTransactionCanceledException [None, TransactionConflict]_is_condition_check_failure()returnsTrue(treats allTransactionCanceledExceptionas optimistic lock failures)build_composite_retry)TransactionCanceledException [ConditionalCheckFailed, None](child bucket version already updated)RateLimitExceededwith fabricated statuses from_build_retry_failure_statuses()— false rejectionThe key insight:
TransactionConflictis a transient contention error, not a failed optimistic lock. The parent speculativeUpdateItem(step 2) is the bottleneck — all child users do conditional updates on the same parent bucket item simultaneously.Impact
During a 60s load test (10 users, cascade=True):
RateLimitExceededfromTransactionConflict, not real capacity exhaustionretry_after_seconds=0.0tells callers "you are rate limited, retry in 0s" — contradictoryRoot Cause
1.
_is_condition_check_failure()conflates TransactionConflict with ConditionalCheckFailedTransactionCanceledExceptionincludes aCancellationReasonsarray where each reason has aCode:ConditionalCheckFailed— optimistic lock miss → retry path is correctTransactionConflict— transient contention → should retry original transaction as-isNone— item not involved in the failure2.
_build_retry_failure_statuses()hardcodesretry_after_seconds=0.0Proposed Fix
1. Distinguish TransactionConflict from ConditionalCheckFailed
Inspect
CancellationReasonsin theTransactionCanceledExceptionresponse:2. Retry the original transaction on TransactionConflict
When
TransactionCanceledExceptionis raised but all cancellation reasons areTransactionConflict(noConditionalCheckFailed), retrytransact_write(items)as-is with backoff, rather than entering the consumption-only retry path.3. Fix
_build_retry_failure_statusesto compute meaningfulretry_after_secondsWhen the retry path does legitimately fail due to insufficient tokens, compute
retry_after_secondsfrom bucket state using the same logic astry_consume()rather than hardcoding0.0.Steps to Reproduce
cascade=Truezae-limiter load benchmark --user-classes MaxRpsCascadeUser -f locustfiles/max_rps.py --users 10 --duration 60RateLimitExceededfailures withretry_after_seconds=0.0CloudWatch log signature:
Acceptance Criteria
_is_condition_check_failure()returnsFalsewhen allCancellationReasonsareTransactionConflict_is_condition_check_failure()returnsTrueonly when at least one reason isConditionalCheckFailedTransactionConflictduring_commit_initial()retries the original transaction (not the consumption-only path)_build_retry_failure_statuses()computesretry_after_secondsfrom bucket state instead of hardcoding0.0TransactionConflict, (b) pureConditionalCheckFailed, (c) mixed reasonsTransactionConflictduring cascade does not raiseRateLimitExceededEnvironment