Metadata Retry: Adds Cross-Region Operation-level Retry for Metadata Request Failures#5780
Metadata Retry: Adds Cross-Region Operation-level Retry for Metadata Request Failures#5780
Conversation
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
1 similar comment
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
|
@sdkReviewAgent-2 |
| await this.synchronizer.CreateMissingLeasesAsync().ConfigureAwait(false); | ||
| await this.leaseStore.MarkInitializedAsync().ConfigureAwait(false); | ||
| } | ||
| catch (CosmosException ex) when (retryCount < MaxInitializationRetries) |
There was a problem hiding this comment.
🟡 Recommendation — Overbroad Exception Filter: Catches all CosmosException types including non-transient errors
The when guard only checks retryCount, but the XML doc and comment explicitly say this is for "regional error (e.g., CosmosException with 503 or HttpRequestException)". The code doesn't enforce this — a 400 BadRequest, 404 NotFound, or 409 Conflict will also be caught and retried 3 times with 15-second delays (up to 45 seconds wasted on a deterministic failure).
This is inconsistent with MetadataRequestThrottleRetryPolicy.ShouldRetryInternalAsync() in this same PR, which carefully discriminates by status code (503, 500, 410/LeaseNotFound, 403/DatabaseAccountNotFound).
Concrete failure scenario: If the lease container is deleted mid-initialization, CreateMissingLeasesAsync() throws a 404 CosmosException. This catch retries 3× against the same endpoint with the same result, adding ~45 seconds of latency before finally propagating.
Suggestion: Add a status code filter to align with the policy:
catch (CosmosException ex) when (retryCount < MaxInitializationRetries
&& (ex.StatusCode == HttpStatusCode.ServiceUnavailable
|| ex.StatusCode == HttpStatusCode.InternalServerError
|| (ex.StatusCode == HttpStatusCode.Gone
&& ex.SubStatusCode == (int)SubStatusCodes.LeaseNotFound)
|| (ex.StatusCode == HttpStatusCode.Forbidden
&& ex.SubStatusCode == (int)SubStatusCodes.DatabaseAccountNotFound)))| throw; | ||
| } | ||
| } | ||
| catch (CosmosException cosmosException) |
There was a problem hiding this comment.
🟡 Recommendation — Missing test coverage for the most critical change in this PR
This new catch (CosmosException) block is the key change enabling the entire operation-level cross-region retry for metadata failures. Without it, CosmosException escapes to SendAsync's outer catch at line 38 and is silently converted to a ResponseMessage without consulting ClientRetryPolicy.
However, there is no test in RetryHandlerTests.cs (or elsewhere) that verifies this path works. The existing RetryHandlerDoesNotRetryOnException test throws a generic Exception, and RetryHandlerHttpClientExceptionRefreshesLocations covers HttpRequestException — but neither exercises the CosmosException catch.
Suggestion: Add a test in RetryHandlerTests.cs following the RetryHandlerHttpClientExceptionRefreshesLocations pattern:
- Have the inner handler throw a
CosmosException(503) - Configure the mock retry policy to return
RetryAfteron the firstShouldRetryAsync(Exception)call - Verify the retry policy is consulted and the handler retries the operation
- Add a complementary test verifying that when the policy returns
NoRetry, theCosmosExceptionre-throws (not silently converted)
This is especially important because NamedCacheRetryHandler also extends AbstractRetryHandler — its InvalidPartitionExceptionRetryPolicy will now see CosmosException too, and should be verified to handle it safely (it returns NoRetry, causing re-throw, which is correct).
| mock.Setup(gem => gem.ResolveServiceEndpoint(It.IsAny<DocumentServiceRequest>())) | ||
| .Returns(primaryEndpoint); | ||
| // Default PreferredLocationCount = 0 → maxRetries = Max(1,0) = 1 | ||
| mock.Setup(gem => gem.PreferredLocationCount).Returns(0); |
There was a problem hiding this comment.
🟡 Recommendation — All tests use PreferredLocationCount = 0, missing multi-region cycling coverage
Every test in this file uses CreateMockEndpointManager with PreferredLocationCount = 0, which yields maxUnavailableEndpointRetryCount = Max(1, 0) = 1. This means only one retry is ever exercised before exhaustion.
In production, most accounts have multiple preferred regions (e.g., 3). The core value proposition of this PR — cycling through regions on metadata failure — is never tested with more than a single retry. If IncrementRetryIndexOnUnavailableEndpointForMetadataRead() had an off-by-one error at higher retry counts, or if OnBeforeSendRequest failed to properly resolve different endpoints on successive retries, no test would catch it.
Suggestion: Add a test with PreferredLocationCount = 3 that:
- Returns different endpoints from
ResolveServiceEndpointon each call (region1, region2, region3) - Simulates 3 sequential 503 failures (calling
OnBeforeSendRequest+ShouldRetryAsyncin a loop) - Asserts each returns
ShouldRetry = truewithMarkEndpointUnavailableForReadcalled for the correct endpoint - Asserts the 4th attempt returns
NoRetry
| return Task.FromResult(this.HandleRegionalFailure()); | ||
| } | ||
|
|
||
| if (exception is OperationCanceledException && !cancellationToken.IsCancellationRequested) |
There was a problem hiding this comment.
🟢 Suggestion — Novel pattern: treating non-user OperationCanceledException as a regional failure
This is a new behavioral pattern not present in any sibling retry policy. ClientRetryPolicy.ShouldRetryAsync(Exception) does not treat OperationCanceledException as a regional failure requiring endpoint marking and cross-region retry — it only marks per-partition endpoints and falls through to the throttling policy (which returns NoRetry).
The rationale here (transport timeout = region issue) is reasonable for metadata requests, since a timeout on a metadata read is a strong signal of regional unavailability. But this creates a behavioral divergence between the metadata-level and operation-level policies interpreting the same exception type.
Questions to confirm:
- Is this intentional? If so, consider adding a comment explaining why metadata-level OCE handling differs from
ClientRetryPolicy. - Should
ClientRetryPolicybe aligned in a follow-up PR to also treat non-user OCE as a regional failure for metadata requests?
| } | ||
|
|
||
| [TestMethod] | ||
| public async Task InitializeAsync_WithCosmosException_ShouldThrowAfterMaxRetries() |
There was a problem hiding this comment.
🟢 Suggestion — Missing test for HttpRequestException retry exhaustion in BootstrapperCore
The CosmosException path has an exhaustion test (InitializeAsync_WithCosmosException_ShouldThrowAfterMaxRetries) that verifies the exception propagates after MaxInitializationRetries + 1 attempts. The HttpRequestException path has no equivalent exhaustion test — only a retry-and-succeed test.
While the two catch blocks are structurally symmetric, they are separate code paths with independent when guards. A bug in the HttpRequestException catch's guard or retry logic would not be caught by the existing CosmosException exhaustion test.
Suggestion: Add InitializeAsync_WithHttpRequestException_ShouldThrowAfterMaxRetries mirroring the CosmosException exhaustion test.
| /// regions left to try; <see cref="ShouldRetryResult"/> with <c>ShouldRetry = false</c> otherwise, | ||
| /// allowing the exception to propagate to the operation-level retry policy. | ||
| /// </returns> | ||
| private ShouldRetryResult HandleRegionalFailure() |
There was a problem hiding this comment.
💬 Observation — Clean consolidation with a subtle behavioral change
HandleRegionalFailure() cleanly consolidates the "mark + increment" pattern and replaces 4 separate code blocks. The structural improvement is welcome.
One subtle behavioral change worth noting: previously, when IncrementRetryIndexOnUnavailableEndpointForMetadataRead() returned false (retries exhausted), the code fell through to this.throttlingRetryPolicy.ShouldRetryAsync(exception, cancellationToken). Now it returns ShouldRetryResult.NoRetry() directly.
This is functionally equivalent today because ResourceThrottleRetryPolicy only retries 429s and returns NoRetry for the status codes handled here (503, 500, 410/LeaseNotFound, 403/DatabaseAccountNotFound). However, it changes the implicit contract: the throttling policy is no longer consulted after region exhaustion. If ResourceThrottleRetryPolicy ever gains broader retry logic in the future, this path would be missed.
Not blocking — just calling it out for awareness.
| } | ||
|
|
||
| [TestMethod] | ||
| public async Task InitializeAsync_WithCosmosException_ShouldThrowAfterMaxRetries() |
There was a problem hiding this comment.
🟢 Suggestion — Add test for non-retryable CosmosException to document intent
The catch (CosmosException) in BootstrapperCore catches all status codes (see separate comment). Whether this is intentional or a bug, a test explicitly exercising a non-regional CosmosException (e.g., 404 NotFound) would document the expected behavior:
- If retrying all exceptions is intended: the test asserts 4 attempts and adds a comment explaining why
- If only regional errors should retry: the test asserts the exception propagates immediately, and the
whenguard gets a status code filter
Either way, a test closes the ambiguity.
| /// the endpoint as unavailable in the <see cref="IGlobalEndpointManager"/> when | ||
| /// a regional failure is detected. | ||
| /// </summary> | ||
| private Uri locationEndpoint; |
There was a problem hiding this comment.
💬 Observation — Design: Marking endpoint unavailable affects ALL reads, not just metadata
MarkEndpointUnavailableForRead() signals the LocationCache to deprioritize this endpoint for all subsequent read operations (queries, point reads, change feed), not just metadata requests. The endpoint moves to the end of the preferred list and self-heals after the 5-minute TTL (DefaultUnavailableLocationsExpirationTimeInSeconds = 300).
This is the correct trade-off: if a region is failing for metadata reads (partition key range lookups), it's very likely failing for data reads too. The impact is soft (deprioritized, not blocked) and self-healing. This aligns with how ClientRetryPolicy.ShouldRetryOnEndpointFailureAsync already works.
Also confirmed: double-marking (from both this policy and ClientRetryPolicy when the exception propagates) is benign — LocationCache.MarkEndpointUnavailable uses ConcurrentDictionary.AddOrUpdate, which is idempotent (just refreshes the TTL timestamp).
| MaxInitializationRetries, | ||
| this.sleepTime); | ||
|
|
||
| await Task.Delay(this.sleepTime).ConfigureAwait(false); |
There was a problem hiding this comment.
🟢 Suggestion — Task.Delay without CancellationToken
The review rules require verifying cancellation token propagation on all async methods. Both Task.Delay(this.sleepTime) calls in the new catch blocks don't accept a cancellation token — if the caller wants to cancel during the 15-second retry delay, they can't.
The InitializeAsync() method itself doesn't take a CancellationToken parameter (pre-existing design), so there's no token to pass here. This is a pre-existing limitation of the Bootstrapper interface, not introduced by this PR.
Consider adding CancellationToken support to InitializeAsync() in a follow-up if the team wants cancellable bootstrap initialization.
|
✅ Review complete (50:07) Posted 9 inline comment(s). Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
Description
When a Cosmos DB region becomes unavailable, metadata requests (e.g., partition key range reads) fail with 503/20003 and are retried only within the same region until exhausted, with no cross-region failover. This change enables cross-region retry at both the metadata request level and the operation level.
Problem
MetadataRequestThrottleRetryPolicyretried metadata requests in the same region using a location index, but never marked the failing endpoint as unavailable in theLocationCache. If all retries within the policy were exhausted, the exception propagated without any signal to route future requests to a different region. Additionally, theCosmosExceptionthrown from failed metadata requests was not caught inAbstractRetryHandler, preventingClientRetryPolicyfrom evaluating it for cross-region failover. The Change Feed Processor'sBootstrapperCorebypasses the handler pipeline entirely and had no retry logic for regional failures.Changes
MetadataRequestThrottleRetryPolicy - On regional failures (503, 500, 410/LeaseNotFound, 403/DatabaseAccountNotFound,
HttpRequestException, non-userOperationCanceledException):GlobalEndpointManager.MarkEndpointUnavailableForRead()so theLocationCachedeprioritizes that region for future requestsIncrementRetryIndexlogic, limited toPreferredLocationCount)NoRetryso the exception propagates to the operation-level retry policyAbstractRetryHandler - Added
catch (CosmosException)inExecuteHttpRequestAsyncso that metadata failures (which throwCosmosExceptionfromCosmosHttpClientCore) reachClientRetryPolicy.ShouldRetryAsync(Exception)for operation-level cross-region retry.BootstrapperCore - Added retry logic for
CosmosExceptionandHttpRequestExceptioninInitializeAsync()(up toMaxInitializationRetries = 3), since the Change Feed Processor bypasses the handler pipeline and has noClientRetryPolicycoverage.Testing
MetadataRequestThrottleRetryPolicycovering: first-attempt retry, retry exhaustion, regional failure status codes,HttpRequestException, non-userOperationCanceledException, user cancellation, and 429 throttlingBootstrapperCorecovering:CosmosExceptionretry + success,HttpRequestExceptionretry + success, and retry exhaustionRetryHandlerTestscontinue to passType of change
Please delete options that are not relevant.
Closing issues
closes #5660