Skip to content

🐛 TransactionConflict incorrectly treated as ConditionalCheckFailed in _commit_initial retry path #332

Description

@sodre

Description

Under cascade contention, the speculative write path raises false RateLimitExceeded with retry_after_seconds=0.0 when the parent bucket is contended.

Speculative Cascade Path (traced from load test)

When speculative_writes=True (default) and cascade=True:

  1. Child speculative UpdateItem → succeeds (single-item conditional update)
  2. Parent speculative UpdateItemfails (contention: 10 users hitting same parent bucket)
  3. Falls back to parent-only slow path (_try_parent_only_acquire):
    • Reads parent buckets
    • Calls parent_lease._commit_initial() → issues transact_write()
  4. transact_write() fails with TransactionCanceledException [None, TransactionConflict]
  5. _is_condition_check_failure() returns True (treats all TransactionCanceledException as optimistic lock failures)
  6. Enters consumption-only retry path (build_composite_retry)
  7. Retry also fails: TransactionCanceledException [ConditionalCheckFailed, None] (child bucket version already updated)
  8. Raises RateLimitExceeded with fabricated statuses from _build_retry_failure_statuses()false rejection

The key insight: TransactionConflict is a transient contention error, not a failed optimistic lock. The parent speculative UpdateItem (step 2) is the bottleneck — all child users do conditional updates on the same parent bucket item simultaneously.

Impact

During a 60s load test (10 users, cascade=True):

  • 259 RPS with 23 failures (0.1%)
  • All failures are false RateLimitExceeded from TransactionConflict, not real capacity exhaustion
  • retry_after_seconds=0.0 tells callers "you are rate limited, retry in 0s" — contradictory

Root Cause

1. _is_condition_check_failure() conflates TransactionConflict with ConditionalCheckFailed

# lease.py
def _is_condition_check_failure(exc: Exception) -> bool:
    exc_name = type(exc).__name__
    if exc_name in ("ConditionalCheckFailedException", "TransactionCanceledException"):
        return True  # ← treats ALL TransactionCanceledException the same

TransactionCanceledException includes a CancellationReasons array where each reason has a Code:

  • ConditionalCheckFailed — optimistic lock miss → retry path is correct
  • TransactionConflict — transient contention → should retry original transaction as-is
  • None — item not involved in the failure

2. _build_retry_failure_statuses() hardcodes retry_after_seconds=0.0

def _build_retry_failure_statuses(entries):
    for entry in entries:
        statuses.append(LimitStatus(
            ...
            retry_after_seconds=0.0,  # ← always zero, never computed from bucket state
        ))

Proposed Fix

1. Distinguish TransactionConflict from ConditionalCheckFailed

Inspect CancellationReasons in the TransactionCanceledException response:

def _is_condition_check_failure(exc: Exception) -> bool:
    """Check if exception is a ConditionalCheckFailed (not TransactionConflict)."""
    response = getattr(exc, "response", {})
    error_code = response.get("Error", {}).get("Code", "")

    if error_code == "ConditionalCheckFailedException":
        return True

    if error_code == "TransactionCanceledException":
        reasons = response.get("CancellationReasons", [])
        return any(r.get("Code") == "ConditionalCheckFailed" for r in reasons)

    # botocore class name fallback
    if type(exc).__name__ == "ConditionalCheckFailedException":
        return True

    return False

2. Retry the original transaction on TransactionConflict

When TransactionCanceledException is raised but all cancellation reasons are TransactionConflict (no ConditionalCheckFailed), retry transact_write(items) as-is with backoff, rather than entering the consumption-only retry path.

3. Fix _build_retry_failure_statuses to compute meaningful retry_after_seconds

When the retry path does legitimately fail due to insufficient tokens, compute retry_after_seconds from bucket state using the same logic as try_consume() rather than hardcoding 0.0.

Steps to Reproduce

  1. Deploy a limiter stack
  2. Create a parent entity and 10 child entities with cascade=True
  3. Run zae-limiter load benchmark --user-classes MaxRpsCascadeUser -f locustfiles/max_rps.py --users 10 --duration 60
  4. Observe 0.1-0.2% false RateLimitExceeded failures with retry_after_seconds=0.0

CloudWatch log signature:

Rate limit exceeded for user-xxx/api: [rpm, rpm]. Retry after 0.0s
TransactionCanceledException: ... [None, TransactionConflict]
TransactionCanceledException: ... [ConditionalCheckFailed, None]

Acceptance Criteria

  • _is_condition_check_failure() returns False when all CancellationReasons are TransactionConflict
  • _is_condition_check_failure() returns True only when at least one reason is ConditionalCheckFailed
  • TransactionConflict during _commit_initial() retries the original transaction (not the consumption-only path)
  • _build_retry_failure_statuses() computes retry_after_seconds from bucket state instead of hardcoding 0.0
  • Unit tests cover: (a) pure TransactionConflict, (b) pure ConditionalCheckFailed, (c) mixed reasons
  • Unit test verifies that TransactionConflict during cascade does not raise RateLimitExceeded

Environment

  • zae-limiter version: 0.8.2.dev167 (perf/stress-test branch)
  • Python: 3.12
  • Discovered during load testing with cascade contention

Metadata

Metadata

Assignees

Labels

area/limiterCore rate limiting logic

Type

Fields

No fields configured for Bug.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions