Metadata Retry: Fixes cross-region failover preempted by caller cancellation#5806
Metadata Retry: Fixes cross-region failover preempted by caller cancellation#5806NaluTripician wants to merge 11 commits intomainfrom
Conversation
ClientCollectionCache's cold-path ReadCollection metadata call goes through TaskHelper.InlineIfPossible -> BackoffRetryUtility<T>.ExecuteAsync, which evaluates the caller's CancellationToken before invoking IDocumentClientRetryPolicy.ShouldRetryAsync. When the control-plane HTTP timeout policy burns 0.5s/5s/30s against an unhealthy region and the caller's ~36.5s timeout trips, the cross-region failover that ClientRetryPolicy would otherwise execute is silently preempted and the operation surfaces OperationCanceledException instead of retrying to the next preferred region. Introduces MetadataRetryHelper with a bespoke retry loop that: - Always consults the retry policy on exception, before honoring the caller token. - Grants a single bounded grace window (default 10s) when the caller token is already cancelled so availability-critical cross-region failover can execute. - Surfaces the original exception (not the grace-timeout OCE) if the grace attempt expires or fails. Wires the helper into ClientCollectionCache.GetByRidAsync / GetByNameAsync. PartitionKeyRangeCache is unaffected (its call does not pass a caller cancellation token into BackoffRetryUtility). Adds MetadataRetryHelperTests including a companion regression test (LegacyBackoffRetryUtility_CrossRegionRetryIsPreempted_ByCancelledCallerToken) that pins the preempted behavior of the shared utility so that if it is ever fixed upstream the bespoke helper becomes removable. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…est, negative grace validation - Document contract in MetadataRetryHelper that the operation lambda must honor the passed CancellationToken and not close over outer tokens. - Reject negative crossRegionRetryGrace with ArgumentOutOfRangeException. - Add regression test that pins 'policy consulted before surfacing OCE' even when the very first attempt throws OperationCanceledException. - Add test covering ArgumentOutOfRangeException for negative grace. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
…d, attempt cap, test hardening - MetadataRetryHelper: honor ShouldRetryResult.ExceptionToThrow when policy specifies a translated exception (matches BackoffRetryUtility contract). - MetadataRetryHelper: add MaxAttemptsHardCap (20) defensive bound so a misconfigured retry policy cannot spin the loop indefinitely. - MetadataRetryHelper: TraceWarning on grace-window expiration and grace-region failure to improve cross-region outage debuggability. - MetadataRetryHelper: rename graceTokenUsed -> graceAttempted; document that the grace CTS controls when the attempt STARTS, not total duration. - MetadataRetryHelper: tighten accessibility of ExecuteAsync overloads (public -> internal on internal class). - ClientCollectionCache.GetByRidAsync / GetByNameAsync: wrap MetadataRetryHelper.ExecuteAsync in TaskHelper.RunInlineIfNeededAsync to preserve the SynchronizationContext guard that TaskHelper.InlineIfPossible previously provided for .NET Framework sync-over-async call chains. - Tests: add detached-grace-token contract assertion in ExecuteAsync_CrossRegionRetryExecutes_EvenWhenCallerTokenCancelled; pins the invariant that the grace attempt receives a fresh, non-cancelled token decoupled from the caller's cancelled token. - Tests: replace Task.Delay(30s, ct) with TaskCompletionSource + ct.Register in ExecuteAsync_GraceExpires_SurfacesOriginalException — deterministic and removes 30s worst-case cliff on slow CI runners. - Tests: add ExecuteAsync_PolicySpecifiesExceptionToThrow regression test covering the new ExceptionToThrow code path. - Tests: remove dead SetupSequence NoRetry() stub in ExecuteAsync_RetriesOnTransient_WhenPolicyAllows; simplify to single Setup. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
|
@sdkReviewAgent-2 |
|
✅ Review complete (48:55) Posted 3 inline comment(s). Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
- Preserve original stack trace when ShouldRetryResult.ExceptionToThrow references the same instance as the captured exception (mirrors ShouldRetryResult.ThrowIfDoneTrying in BackoffRetryUtility). - Add test for MaxAttemptsHardCap defensive guard. - Add test for ShouldRetryAsync-throws path surfacing the original operation exception. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
|
✅ Review complete (40:03) Posted 1 inline comment(s). Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
Covers the production trigger for the grace path: caller's token is live at attempt start and trips during Task.Delay(backoff, cancellationToken), exercising the OperationCanceledException catch and the second IsCancellationRequested check (lines 159-178). Existing cancellation tests use a pre-cancelled token and never reach this branch. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
| try | ||
| { | ||
| shouldRetry = await retryPolicy | ||
| .ShouldRetryAsync(capturedException.SourceException, CancellationToken.None) |
There was a problem hiding this comment.
ShouldRetryAsync called with CancellationToken.None: This ensures the policy isn't preempted, but it also means a policy that internally relies on the caller's token for its own timeout won't get it.
Thinking out loud: Could this have any implication on any unbounded retry attempts ?
There was a problem hiding this comment.
Bounded by two structural guards: MaxAttemptsHardCap = 20 (covered by ExecuteAsync_ExceedsMaxAttemptsHardCap_ThrowsInvalidOperationException) caps total iterations even if a misconfigured policy returns ShouldRetry=true indefinitely, and the single-shot graceAttempted flag (line 193) ensures at most one cross-region attempt under the bounded crossRegionRetryGrace window after caller cancellation.
The CancellationToken.None is deliberate — ShouldRetryAsync is the policy decision call, not a network op. Threading the caller token into it would re-introduce the exact defect this PR fixes (caller cancellation preempts the cross-region failover decision). None of the in-tree retry policies use ShouldRetryAsync's CT for an internal timeout; they use it only for incidental awaits.
Addresses review feedback from @kundadebdatta: the entry-event 'granting Nms grace' message is now TraceVerbose since it is a mid-flow event. Terminal paths (grace already used / surfacing original exception, grace expired, grace attempt failed) remain at TraceInformation/TraceWarning so they remain in default logs during incident triage. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
ananth7592
left a comment
There was a problem hiding this comment.
Is 10s hardcoded grace not fundamentally changing the timeout contract? When a customer sets a CancellationToken with a 10s timeout, they
expect the operation to terminate within that window. With this change, it can silently extend to ~20s.
On the flip side, customers who have intentional fail-fast patterns, cannot opt out of this.
The change looks good but the contract is where my comment is
The users request will timeout and cancel with their cancellation token, the metadata request to populate the cache will still go on in the background here. This is because of the customer has an aggressive fail fast policy and there is some gateway delay, the caches will never be populated. |
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
Summary
Fixes #5805.
When the .NET Cosmos SDK has to warm a cold collection-metadata cache (
ClientCollectionCache.GetByRidAsync/GetByNameAsync), theReadCollectioncall goes throughTaskHelper.InlineIfPossible→BackoffRetryUtility<T>.ExecuteAsync. That utility checks the caller's cancellation token before invokingShouldRetryAsync. When the control-plane HTTP timeout policy burns 0.5 s → 5 s → 30 s (~36 s) against an unhealthy region and the caller's ~36.5 s timeout trips,ClientRetryPolicy's "retry to next preferred region" decision is silently preempted and the operation surfacesOperationCanceledException.Repro was landed in #5787.
Fix
MetadataRetryHelper.ExecuteAsync<T>(Func<CancellationToken, Task<T>>, IDocumentClientRetryPolicy, CancellationToken).OperationCanceledException) when the grace window expires or the grace attempt fails.ClientCollectionCache.GetByRidAsyncandGetByNameAsync.Scope
PartitionKeyRangeCacheis not affected by this defect (its call does not pass a caller CT intoBackoffRetryUtility); no change needed.AddressCacheis intentionally unchanged.Tests
Microsoft.Azure.Cosmos.Tests/MetadataRetryHelperTests.cs— 10 tests:ExecuteAsync_SucceedsFirstAttempt_NoRetryExecuteAsync_RetriesOnTransient_WhenPolicyAllowsExecuteAsync_CrossRegionRetryExecutes_EvenWhenCallerTokenCancelled(primary fix validation)ExecuteAsync_GraceIsBounded_SurfacesOriginalExceptionOnSecondFailureExecuteAsync_PolicyDeniesRetry_SurfacesOriginalExceptionEvenWhenCancelledExecuteAsync_ZeroGrace_DoesNotRetryAfterCancellationExecuteAsync_GraceExpires_SurfacesOriginalExceptionLegacyBackoffRetryUtility_CrossRegionRetryIsPreempted_ByCancelledCallerToken(pins the pre-fix buggy behavior so accidental revert is caught)ExecuteAsync_FirstAttemptOCE_PolicyIsConsultedBeforeSurfacingExecuteAsync_NegativeGrace_ThrowsAll pass. The companion "legacy" test demonstrates that the pre-fix code path preempts the retry and asserts the buggy behavior — this is the proof that the defect exists and that the fix is required.
Risk / rollout
DefaultCrossRegionRetryGrace) when the cross-region retry grace is invoked. This is opt-in per call via existing cancellation semantics: callers that want the previous behavior can continue to pass any CancellationToken; the grace only triggers when the retry policy says "retry" and the caller token is already cancelled.PartitionKeyRangeCachebehavior unchanged.Review
This PR went through two rounds of agent review before opening. Round 1 surfaced three items (operation-token contract comment, first-attempt-OCE regression test, negative-grace validation) which are committed in the second commit. Round 2 returned
APPROVEwith no findings.Closes #5805.
Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com