Early yield on 429 throttling on barrier requests by mbhaskar · Pull Request #48914 · Azure/azure-sdk-for-java

mbhaskar · 2026-04-23T18:04:53Z

Description

This PR introduces early yield on 429s during barrier requests.
When receiving 429s with strong consistency, quorum reader/ writer code does not yield early enough creating multiple stack traces resulting into resource constraints on the client side.

All SDK Contribution checklist:

The pull request does not introduce [breaking changes]
CHANGELOG is updated for new features, bug fixes or other significant changes.
I have read the contribution guidelines.

General Guidelines and Best Practices

Title of the pull request is clear and informative.
There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

Pull request includes test coverage for the included changes.

Port of .NET PR #1667829: When receiving repeated 429 (Too Many Requests) responses with strong consistency, QuorumReader and ConsistencyWriter now handle throttling more efficiently. QuorumReader (reads): - waitForReadBarrierAsync: yield early when all replicas return 429 in both single-region and multi-region barrier loops - ensureQuorumSelectedStoreResponse: yield early when all replicas throttled during initial quorum read - All cases throw the 429 exception to let ResourceThrottleRetryPolicy handle retry with appropriate backoff ConsistencyWriter (writes): - waitForWriteBarrierAsync: track lastAttemptWasThrottled flag per iteration - Do NOT yield early (preserves idempotency guarantees) - When all retries exhausted due to consistent throttling, throw RequestTimeoutException (408) with substatus SERVER_WRITE_BARRIER_THROTTLED (21013) instead of returning barrier-not-met Other changes: - Added isThrottledException field to StoreResult - Added SERVER_WRITE_BARRIER_THROTTLED (21013) substatus code - Unit tests for all throttling scenarios Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ed replica yields early Port of .NET test ValidatesReadMultipleReplicaAsyncExcludesGoneReplicas. Validates that when replicas return a mix of 410 (Gone) and 429 (TooManyRequests): - Gone replicas are excluded from results by StoreReader (isValid=false for GONE) - The 429 replica with valid LSN headers is kept (isValid=true for non-GONE with lsn>=0) - Since all remaining replicas are throttled, early yield triggers - The 429 exception propagates to ResourceThrottleRetryPolicy Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR updates Cosmos direct connectivity quorum/barrier logic to “yield early” when replica reads are uniformly throttled (HTTP 429), allowing the existing ResourceThrottleRetryPolicy to apply appropriate backoff instead of progressing into additional quorum/primary/barrier attempts.

Changes:

Add StoreResult.isThrottledException to cheaply detect 429 responses.
In QuorumReader, propagate 429 immediately when all collected replica results are throttled (including barrier paths).
In ConsistencyWriter, track throttling during write barriers and, when retries are exhausted and the last attempt was fully throttled, throw a RequestTimeoutException with a new substatus code; add unit tests for the new behaviors.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/directconnectivity/StoreResult.java	Adds a computed flag to identify throttling (429) on replica results.
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/directconnectivity/QuorumReader.java	Early-yields on replica-wide throttling to let throttle retry policy handle backoff.
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/directconnectivity/ConsistencyWriter.java	Tracks throttling during write barriers and surfaces a distinct timeout substatus when retries are exhausted.
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/HttpConstants.java	Introduces a new substatus code for write-barrier throttling exhaustion.
sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/directconnectivity/QuorumReaderTest.java	Adds unit tests covering 429 propagation and Gone+429 interactions.
sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/directconnectivity/ConsistencyWriterTest.java	Adds unit tests for write-barrier behavior under sustained throttling and mixed outcomes.

Copilot · 2026-04-23T18:13:10Z

+                        return Mono.error(responseResult.get(0).getException());
+                    }
+
+                    int responseCount= (int) responseResult.stream().filter(response -> response.isValid).count();


WhitespaceAround checkstyle: add spaces around the '=' operator (e.g., int responseCount = ...). As written, this will likely fail checkstyle and break the build.

Suggested change

int responseCount= (int) responseResult.stream().filter(response -> response.isValid).count();

int responseCount = (int) responseResult.stream().filter(response -> response.isValid).count();

Fixed in e80964d.

xinlian12 · 2026-04-23T19:01:53Z

+                    // For writes, we do NOT yield early - continue retries to preserve idempotency.
+                    // If all retries are exhausted due to consistent throttling, we throw 408 (RequestTimeout).
+                    if (responses != null && !responses.isEmpty()
+                        && responses.stream().allMatch(r -> r.isThrottledException)) {


🔴 Blocking · Correctness: Stale State

lastAttemptWasThrottled not reset when avoidQuorumSelection path bypasses throttle tracking

The isAvoidQuorumSelectionStoreResult branch (lines 548—562) returns early without updating lastAttemptWasThrottled. If a prior iteration set this flag to true (all replicas returned 429), and a subsequent iteration enters the avoidQuorum path (replica returns 410/LeaseNotFound), the flag remains stale.

When isBarrierMeetPossibleInPresenceOfAvoidQuorumSelectionException returns Mono.empty() (continue loop) and retries exhaust, the check at line 520 sees lastAttemptWasThrottled == true and throws RequestTimeoutException(408, 21013) instead of the correct behavior of returning false → GoneException.

Concrete scenario:

Iteration N: all replicas return 429 → lastAttemptWasThrottled = true· retryCount decremented

Iteration N+1: a replica returns 410/LeaseNotFound → avoidQuorum path fires, lastAttemptWasThrottled NOT reset

isBarrierMeetPossibleInPresenceOfAvoidQuorumSelectionException → primary check returns insufficient LSN → Mono.empty() → loop continues

Retries exhaust → incorrect 408/21013 instead of the normal barrier-not-met flow

Suggested fix: Reset the flag before the avoidQuorum early return:

if (isAvoidQuorumSelectionStoreResult) { writeBarrierRetryCount.decrementAndGet(); lastAttemptWasThrottled.set(false); // This attempt was not a throttle return this.isBarrierMeetPossibleInPresenceOfAvoidQuorumSelectionException(...); }

_{⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.}

Good catch — fixed in e80964d. Added \lastAttemptWasThrottled.set(false)\ before the avoidQuorumSelection early return to prevent the stale flag scenario you described.

xinlian12 · 2026-04-23T19:01:53Z

            }

            if (writeBarrierRetryCount.get() == 0) {
+                if (lastAttemptWasThrottled.get()) {


🟡 Recommendation · Correctness: Behavioral Change

Write barrier throttle failure changes exception type from retryable to non-retryable

Before this PR, when write barrier retries exhausted (for any reason), waitForWriteBarrierAsync returned false, and the caller at barrierForWriteRequests (line 419—427) threw GoneException (substatus GLOBAL_STRONG_WRITE_BARRIER_NOT_MET). GoneException IS retried by GoneAndRetryWithRetryPolicy.

With this change, throttle-specific exhaustion throws RequestTimeoutException(408, 21013) directly, which is not retried by GoneAndRetryWithRetryPolicy (isNonRetryableException() returns true for RequestTimeoutException).

This is likely the intended behavior (avoid re-executing committed writes), but it is a significant behavioral change:

Customers under heavy throttling will now see terminal 408 errors instead of automatic retry

The .NET SDK (azure-cosmos-dotnet-v3#5155) takes the opposite approach — yielding early with 429 (which IS retried by the throttle policy)

Please confirm this is intentional. If so, consider documenting the new substatus code 21013 so customers can handle it in their retry logic. Also consider whether ClientRetryPolicy.shouldRetryOnRequestTimeout should avoid marking the endpoint as unavailable for this specific substatus (since it's a transient throttle issue, not an endpoint problem).

_{⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.}

This is intentional — the .NET SDK PR (1667829) takes the same approach for writes: continue barrier retries and throw \RequestTimeoutException(408, Server_WriteBarrierThrottled)\ when retries exhaust. The design rationale is to preserve idempotency guarantees for writes (avoiding re-execution of committed writes). The \ClientRetryPolicy.shouldRetryOnRequestTimeout\ point is worth investigating separately — will track.

xinlian12 · 2026-04-23T19:01:53Z

+                    if (!responseResult.isEmpty() && responseResult.stream().allMatch(r -> r.isThrottledException)) {
+                        logger.info("QuorumReader: ensureQuorumSelectedStoreResponse - All replicas returned 429 Too Many Requests. "
+                            + "Yielding early to ResourceThrottleRetryPolicy.");
+                        return Mono.error(responseResult.get(0).getException());


🟢 Suggestion · Correctness: Exception Selection

Exception selection uses first response without considering retryAfterInMs

responseResult.get(0).getException() propagates the exception from the first (arbitrary) replica. Different replicas may return different Retry-After header values. ResourceThrottleRetryPolicy uses this value for backoff timing. Picking the first response could use a too-short backoff, causing the retry to be immediately throttled again.

Suggested improvement:

StoreResult maxRetryResult = responseResult.stream() .max(Comparator.comparingDouble(r -> r.retryAfterInMs != null ? r.retryAfterInMs : 0)) .get(); return Mono.error(maxRetryResult.getException());

The same pattern applies at lines 708 and 803.

_{⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.}

Not acting on this for now — the .NET PR design notes explicitly state: \different replica may return different retry-after value, but it won't matter which one we use as load between replica should be relatively equal. The first response is acceptable.

xinlian12 · 2026-04-23T20:12:34Z

@sdkReviewAgent

xinlian12 · 2026-04-24T06:41:29Z

✅ Review complete (49:16)

Posted 1 inline comment(s).

_{Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage}

Fixes from xinlian12 (blocking): - Fix lastAttemptWasThrottled stale state: reset flag before avoidQuorumSelection early return to prevent incorrect 408 when prior iteration was throttled but current iteration hits 410 - Fix readStrong_AllReplicasThrottled_Returns429 false positive: set LSN on exception so StoreReader marks isValid=true, ensuring the early yield path is actually exercised. Add transport invocation count assertions to verify primary read is NOT attempted. - Add readStrong_BarrierRequestsThrottled_Returns429 test covering the waitForReadBarrierAsync barrier path (quorum succeeds, then barrier HEAD requests return 429) Fixes from Copilot review: - Fix checkstyle: add missing spaces around = operator (2 places) - Fix log wording: 'All replicas' -> 'All contacted replicas' (more accurate since not all replicas may be contacted per attempt) - Fix ConsistencyWriter log: 'consistent throttling' -> 'last attempt was throttled' (flag only tracks last attempt, not all attempts) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…tling Unit tests (4 new, 10 total throttling tests): - writeBarrier_AvoidQuorumSelectionAfterThrottling_NoFalse408: validates lastAttemptWasThrottled reset on avoidQuorumSelection path (stale state fix) - writeBarrier_NRegionCommit_AllReplicasThrottled_Returns408: N-region synchronous commit barrier throttling produces 408/21013 - readStrong_QuorumNotSelected_PrimaryThrottled_Returns429: primary 429 propagates correctly through QuorumNotSelected → readPrimary path - readStrong_BarrierPartialThrottle_StillSucceeds: barrier succeeds when one replica is throttled but other meets LSN (no false-negative yield) Fault injection E2E tests (3 new, require strong consistency account): - faultInjection_readBarrierThrottled_yieldsEarly: inject 429 on HEAD_COLLECTION + GCLSN interceptor → verify early yield on reads - faultInjection_writeBarrierThrottled_returns408: inject 429 on HEAD_COLLECTION + GCLSN interceptor → verify 408 on writes - faultInjection_readBarrierThrottled_thenRecovers: inject 429 with hitLimit(2) → verify read succeeds after throttle clears Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Read and write barrier requests are only triggered on multi-region strong consistency accounts (numberOfReadRegions > 0). The emulator is single-region, so the GCLSN interceptor never triggers barriers and the tests fail with empty supplementalResponseStatisticsList. Added accountLevelReadRegions.size() > 1 skip check to all three E2E fault injection tests so they correctly skip on single-region environments. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

mbhaskar · 2026-04-28T21:04:15Z

/azp run java - cosmos - tests

azure-pipelines · 2026-04-28T21:04:24Z

No pipelines are associated with this pull request.

mbhaskar · 2026-04-28T22:22:10Z

/azp run java - cosmos - tests

azure-pipelines · 2026-04-28T22:22:36Z

Azure Pipelines successfully started running 1 pipeline(s).

mbhaskar and others added 2 commits April 20, 2026 15:28

Copilot AI review requested due to automatic review settings April 23, 2026 18:04

mbhaskar requested review from a team and kirankumarkolli as code owners April 23, 2026 18:04

github-actions Bot added the Cosmos label Apr 23, 2026

Copilot started reviewing on behalf of mbhaskar April 23, 2026 18:06 View session

mbhaskar changed the title ~~Early yield on 429 throttling~~ Early yield on 429 throttling on barrier requests Apr 23, 2026

Copilot AI reviewed Apr 23, 2026

View reviewed changes

xinlian12 reviewed Apr 23, 2026

View reviewed changes

xinlian12 reviewed Apr 24, 2026

View reviewed changes

Comment thread ...tests/src/test/java/com/azure/cosmos/implementation/directconnectivity/QuorumReaderTest.java

mbhaskar and others added 3 commits April 27, 2026 14:19

	int responseCount= (int) responseResult.stream().filter(response -> response.isValid).count();
	int responseCount = (int) responseResult.stream().filter(response -> response.isValid).count();

Conversation

mbhaskar commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

All SDK Contribution checklist:

General Guidelines and Best Practices

Testing Guidelines

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

mbhaskar Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xinlian12 Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mbhaskar Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

xinlian12 Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mbhaskar Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xinlian12 Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mbhaskar Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xinlian12 commented Apr 23, 2026

Uh oh!

Uh oh!

xinlian12 commented Apr 24, 2026

Uh oh!

mbhaskar commented Apr 28, 2026

Uh oh!

azure-pipelines Bot commented Apr 28, 2026

Uh oh!

mbhaskar commented Apr 28, 2026

Uh oh!

azure-pipelines Bot commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mbhaskar commented Apr 23, 2026 •

edited

Loading

xinlian12 Apr 23, 2026 •

edited

Loading

xinlian12 Apr 23, 2026 •

edited

Loading

xinlian12 Apr 23, 2026 •

edited

Loading