Skip to content

Metadata Retry: Fixes cross-region failover preempted by caller cancellation#5806

Open
NaluTripician wants to merge 11 commits intomainfrom
users/ntripician/metadata-retry-fix
Open

Metadata Retry: Fixes cross-region failover preempted by caller cancellation#5806
NaluTripician wants to merge 11 commits intomainfrom
users/ntripician/metadata-retry-fix

Conversation

@NaluTripician
Copy link
Copy Markdown
Contributor

Summary

Fixes #5805.

When the .NET Cosmos SDK has to warm a cold collection-metadata cache (ClientCollectionCache.GetByRidAsync / GetByNameAsync), the ReadCollection call goes through TaskHelper.InlineIfPossibleBackoffRetryUtility<T>.ExecuteAsync. That utility checks the caller's cancellation token before invoking ShouldRetryAsync. When the control-plane HTTP timeout policy burns 0.5 s → 5 s → 30 s (~36 s) against an unhealthy region and the caller's ~36.5 s timeout trips, ClientRetryPolicy's "retry to next preferred region" decision is silently preempted and the operation surfaces OperationCanceledException.

Repro was landed in #5787.

Fix

  • New internal MetadataRetryHelper.ExecuteAsync<T>(Func<CancellationToken, Task<T>>, IDocumentClientRetryPolicy, CancellationToken).
  • Consults the retry policy on every exception before honoring caller cancellation.
  • On caller-cancelled + retryable exception: grants one bounded grace window (default 10 s) for a cross-region retry attempt on a detached token. The grace token is threaded through the operation lambda, so the underlying HTTP call is not preempted by the (already cancelled) caller token.
  • Surfaces the original exception (not a grace-timeout OperationCanceledException) when the grace window expires or the grace attempt fails.
  • Wired into ClientCollectionCache.GetByRidAsync and GetByNameAsync.

Scope

  • PartitionKeyRangeCache is not affected by this defect (its call does not pass a caller CT into BackoffRetryUtility); no change needed.
  • AddressCache is intentionally unchanged.
  • Tuning the HTTP timeout escalation (0.5 s / 5 s / 30 s) is tracked separately.
  • Extending the hedging availability strategy to metadata reads is tracked separately as a design doc.

Tests

Microsoft.Azure.Cosmos.Tests/MetadataRetryHelperTests.cs — 10 tests:

  • ExecuteAsync_SucceedsFirstAttempt_NoRetry
  • ExecuteAsync_RetriesOnTransient_WhenPolicyAllows
  • ExecuteAsync_CrossRegionRetryExecutes_EvenWhenCallerTokenCancelled (primary fix validation)
  • ExecuteAsync_GraceIsBounded_SurfacesOriginalExceptionOnSecondFailure
  • ExecuteAsync_PolicyDeniesRetry_SurfacesOriginalExceptionEvenWhenCancelled
  • ExecuteAsync_ZeroGrace_DoesNotRetryAfterCancellation
  • ExecuteAsync_GraceExpires_SurfacesOriginalException
  • LegacyBackoffRetryUtility_CrossRegionRetryIsPreempted_ByCancelledCallerToken (pins the pre-fix buggy behavior so accidental revert is caught)
  • ExecuteAsync_FirstAttemptOCE_PolicyIsConsultedBeforeSurfacing
  • ExecuteAsync_NegativeGrace_Throws

All pass. The companion "legacy" test demonstrates that the pre-fix code path preempts the retry and asserts the buggy behavior — this is the proof that the defect exists and that the fix is required.

Risk / rollout

  • Caller's effective timeout extends by up to 10 s (DefaultCrossRegionRetryGrace) when the cross-region retry grace is invoked. This is opt-in per call via existing cancellation semantics: callers that want the previous behavior can continue to pass any CancellationToken; the grace only triggers when the retry policy says "retry" and the caller token is already cancelled.
  • No API surface changes — new helper is internal.
  • PartitionKeyRangeCache behavior unchanged.

Review

This PR went through two rounds of agent review before opening. Round 1 surfaced three items (operation-token contract comment, first-attempt-OCE regression test, negative-grace validation) which are committed in the second commit. Round 2 returned APPROVE with no findings.

Closes #5805.


Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

NaluTripician and others added 2 commits April 22, 2026 13:05
ClientCollectionCache's cold-path ReadCollection metadata call goes through
TaskHelper.InlineIfPossible -> BackoffRetryUtility<T>.ExecuteAsync, which
evaluates the caller's CancellationToken before invoking
IDocumentClientRetryPolicy.ShouldRetryAsync. When the control-plane HTTP
timeout policy burns 0.5s/5s/30s against an unhealthy region and the
caller's ~36.5s timeout trips, the cross-region failover that
ClientRetryPolicy would otherwise execute is silently preempted and the
operation surfaces OperationCanceledException instead of retrying to the
next preferred region.

Introduces MetadataRetryHelper with a bespoke retry loop that:
- Always consults the retry policy on exception, before honoring the
  caller token.
- Grants a single bounded grace window (default 10s) when the caller
  token is already cancelled so availability-critical cross-region
  failover can execute.
- Surfaces the original exception (not the grace-timeout OCE) if the
  grace attempt expires or fails.

Wires the helper into ClientCollectionCache.GetByRidAsync /
GetByNameAsync. PartitionKeyRangeCache is unaffected (its call does not
pass a caller cancellation token into BackoffRetryUtility).

Adds MetadataRetryHelperTests including a companion regression test
(LegacyBackoffRetryUtility_CrossRegionRetryIsPreempted_ByCancelledCallerToken)
that pins the preempted behavior of the shared utility so that if it is
ever fixed upstream the bespoke helper becomes removable.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…est, negative grace validation

- Document contract in MetadataRetryHelper that the operation lambda must
  honor the passed CancellationToken and not close over outer tokens.
- Reject negative crossRegionRetryGrace with ArgumentOutOfRangeException.
- Add regression test that pins 'policy consulted before surfacing OCE'
  even when the very first attempt throws OperationCanceledException.
- Add test covering ArgumentOutOfRangeException for negative grace.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good!

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

@NaluTripician NaluTripician changed the title Metadata retry: consult retry policy before honoring caller cancellation (fixes #5805) Metadata Retry: Fixes cross-region failover preempted by caller cancellation Apr 22, 2026
@NaluTripician NaluTripician marked this pull request as ready for review April 22, 2026 20:32
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

…d, attempt cap, test hardening

- MetadataRetryHelper: honor ShouldRetryResult.ExceptionToThrow when policy
  specifies a translated exception (matches BackoffRetryUtility contract).
- MetadataRetryHelper: add MaxAttemptsHardCap (20) defensive bound so a
  misconfigured retry policy cannot spin the loop indefinitely.
- MetadataRetryHelper: TraceWarning on grace-window expiration and
  grace-region failure to improve cross-region outage debuggability.
- MetadataRetryHelper: rename graceTokenUsed -> graceAttempted; document
  that the grace CTS controls when the attempt STARTS, not total duration.
- MetadataRetryHelper: tighten accessibility of ExecuteAsync overloads
  (public -> internal on internal class).
- ClientCollectionCache.GetByRidAsync / GetByNameAsync: wrap
  MetadataRetryHelper.ExecuteAsync in TaskHelper.RunInlineIfNeededAsync to
  preserve the SynchronizationContext guard that TaskHelper.InlineIfPossible
  previously provided for .NET Framework sync-over-async call chains.
- Tests: add detached-grace-token contract assertion in
  ExecuteAsync_CrossRegionRetryExecutes_EvenWhenCallerTokenCancelled; pins
  the invariant that the grace attempt receives a fresh, non-cancelled token
  decoupled from the caller's cancelled token.
- Tests: replace Task.Delay(30s, ct) with TaskCompletionSource + ct.Register
  in ExecuteAsync_GraceExpires_SurfacesOriginalException — deterministic and
  removes 30s worst-case cliff on slow CI runners.
- Tests: add ExecuteAsync_PolicySpecifiesExceptionToThrow regression test
  covering the new ExceptionToThrow code path.
- Tests: remove dead SetupSequence NoRetry() stub in
  ExecuteAsync_RetriesOnTransient_WhenPolicyAllows; simplify to single Setup.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

@NaluTripician
Copy link
Copy Markdown
Contributor Author

@sdkReviewAgent-2

Comment thread Microsoft.Azure.Cosmos/src/MetadataRetryHelper.cs
Comment thread Microsoft.Azure.Cosmos/src/MetadataRetryHelper.cs
@xinlian12
Copy link
Copy Markdown
Member

Review complete (48:55)

Posted 3 inline comment(s).

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

- Preserve original stack trace when ShouldRetryResult.ExceptionToThrow
  references the same instance as the captured exception (mirrors
  ShouldRetryResult.ThrowIfDoneTrying in BackoffRetryUtility).
- Add test for MaxAttemptsHardCap defensive guard.
- Add test for ShouldRetryAsync-throws path surfacing the original
  operation exception.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

Comment thread Microsoft.Azure.Cosmos/src/MetadataRetryHelper.cs
@xinlian12
Copy link
Copy Markdown
Member

Review complete (40:03)

Posted 1 inline comment(s).

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

Covers the production trigger for the grace path: caller's token is live
at attempt start and trips during Task.Delay(backoff, cancellationToken),
exercising the OperationCanceledException catch and the second
IsCancellationRequested check (lines 159-178). Existing cancellation
tests use a pre-cancelled token and never reach this branch.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

try
{
shouldRetry = await retryPolicy
.ShouldRetryAsync(capturedException.SourceException, CancellationToken.None)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ShouldRetryAsync called with CancellationToken.None: This ensures the policy isn't preempted, but it also means a policy that internally relies on the caller's token for its own timeout won't get it.

Thinking out loud: Could this have any implication on any unbounded retry attempts ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bounded by two structural guards: MaxAttemptsHardCap = 20 (covered by ExecuteAsync_ExceedsMaxAttemptsHardCap_ThrowsInvalidOperationException) caps total iterations even if a misconfigured policy returns ShouldRetry=true indefinitely, and the single-shot graceAttempted flag (line 193) ensures at most one cross-region attempt under the bounded crossRegionRetryGrace window after caller cancellation.

The CancellationToken.None is deliberate — ShouldRetryAsync is the policy decision call, not a network op. Threading the caller token into it would re-introduce the exact defect this PR fixes (caller cancellation preempts the cross-region failover decision). None of the in-tree retry policies use ShouldRetryAsync's CT for an internal timeout; they use it only for incidental awaits.

Comment thread Microsoft.Azure.Cosmos/src/MetadataRetryHelper.cs
Addresses review feedback from @kundadebdatta: the entry-event 'granting Nms
grace' message is now TraceVerbose since it is a mid-flow event. Terminal
paths (grace already used / surfacing original exception, grace expired,
grace attempt failed) remain at TraceInformation/TraceWarning so they remain
in default logs during incident triage.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

Copy link
Copy Markdown
Member

@ananth7592 ananth7592 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 10s hardcoded grace not fundamentally changing the timeout contract? When a customer sets a CancellationToken with a 10s timeout, they
expect the operation to terminate within that window. With this change, it can silently extend to ~20s.

On the flip side, customers who have intentional fail-fast patterns, cannot opt out of this.

The change looks good but the contract is where my comment is

@NaluTripician
Copy link
Copy Markdown
Contributor Author

Is 10s hardcoded grace not fundamentally changing the timeout contract? When a customer sets a CancellationToken with a 10s timeout, they
expect the operation to terminate within that window. With this change, it can silently extend to ~20s.

On the flip side, customers who have intentional fail-fast patterns, cannot opt out of this.

The change looks good but the contract is where my comment is

The users request will timeout and cancel with their cancellation token, the metadata request to populate the cache will still go on in the background here. This is because of the customer has an aggressive fail fast policy and there is some gateway delay, the caches will never be populated.

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Metadata retry: cross-region failover preempted by caller cancellation during control-plane HTTP timeout escalation

4 participants