Summary
When the .NET Cosmos SDK warms a cold collection-metadata cache (ClientCollectionCache.GetByRidAsync / GetByNameAsync), the ReadCollection call is routed through TaskHelper.InlineIfPossible(..., retryPolicy, cancellationToken), which delegates to BackoffRetryUtility<T>.ExecuteAsync in Cosmos.Direct. That utility evaluates cancellationToken.ThrowIfCancellationRequested() before consulting IDocumentClientRetryPolicy.ShouldRetryAsync.
The control-plane HTTP timeout policy escalates at 0.5 s → 5 s → 30 s (~36 s total) against an unhealthy region. When the caller's request-level timeout (typically ~36.5 s) trips during that escalation, ThrowIfCancellationRequested fires on the next iteration and throws OperationCanceledException without ever consulting ClientRetryPolicy — which would otherwise have returned "retry to next preferred region". The customer sees a CosmosOperationCanceledException instead of a successful cross-region failover.
This is the bug described in the diagnostic bundle attached to PR #5787.
Reproduction
The repro is included in PR #5787 (ClientRetryPolicyTests + ControlPlaneHedgingReproTests). A concise end-to-end repro:
- Preferred regions =
[East US 2, Central US] where East US 2 is unhealthy (gateway returns 503 after slow timeouts).
- Cold collection cache.
- Issue
QueryDocument with a ~36.5 s client-side timeout.
- Observe:
CosmosOperationCanceledException; no HTTP call to Central US.
Expected: the cross-region retry against Central US executes and the query succeeds.
Root cause
BackoffRetryUtility<T>.ExecuteAsync (in Cosmos.Direct, not editable in this repo) honors the caller token before calling ShouldRetryAsync. See ClientRetryPolicy.ShouldRetryAsync lines 141-156 — on OCE it explicitly returns non-retry, but in the preempted path the policy is never consulted at all.
PartitionKeyRangeCache is not affected (its BackoffRetryUtility.ExecuteAsync call does not thread a caller cancellation token).
Fix
See PR #5806. New internal MetadataRetryHelper runs a bespoke retry loop that:
- Always consults the retry policy on exception, before honoring the caller token.
- On caller-cancelled + retryable: grants ONE bounded grace window (default 10 s) for a cross-region retry attempt using a detached token.
- Surfaces the original exception (not grace-timeout OCE) if grace expires or the grace attempt fails.
Wired into ClientCollectionCache.GetByRidAsync and GetByNameAsync.
Related
Cross-SDK tracking
The same class of defect was investigated across all maintained Cosmos SDKs. Sibling issues:
Summary
When the .NET Cosmos SDK warms a cold collection-metadata cache (
ClientCollectionCache.GetByRidAsync/GetByNameAsync), theReadCollectioncall is routed throughTaskHelper.InlineIfPossible(..., retryPolicy, cancellationToken), which delegates toBackoffRetryUtility<T>.ExecuteAsyncin Cosmos.Direct. That utility evaluatescancellationToken.ThrowIfCancellationRequested()before consultingIDocumentClientRetryPolicy.ShouldRetryAsync.The control-plane HTTP timeout policy escalates at 0.5 s → 5 s → 30 s (~36 s total) against an unhealthy region. When the caller's request-level timeout (typically ~36.5 s) trips during that escalation,
ThrowIfCancellationRequestedfires on the next iteration and throwsOperationCanceledExceptionwithout ever consultingClientRetryPolicy— which would otherwise have returned "retry to next preferred region". The customer sees aCosmosOperationCanceledExceptioninstead of a successful cross-region failover.This is the bug described in the diagnostic bundle attached to PR #5787.
Reproduction
The repro is included in PR #5787 (
ClientRetryPolicyTests+ControlPlaneHedgingReproTests). A concise end-to-end repro:[East US 2, Central US]where East US 2 is unhealthy (gateway returns 503 after slow timeouts).QueryDocumentwith a ~36.5 s client-side timeout.CosmosOperationCanceledException; no HTTP call to Central US.Expected: the cross-region retry against Central US executes and the query succeeds.
Root cause
BackoffRetryUtility<T>.ExecuteAsync(in Cosmos.Direct, not editable in this repo) honors the caller token before callingShouldRetryAsync. SeeClientRetryPolicy.ShouldRetryAsynclines 141-156 — on OCE it explicitly returns non-retry, but in the preempted path the policy is never consulted at all.PartitionKeyRangeCacheis not affected (itsBackoffRetryUtility.ExecuteAsynccall does not thread a caller cancellation token).Fix
See PR #5806. New internal
MetadataRetryHelperruns a bespoke retry loop that:Wired into
ClientCollectionCache.GetByRidAsyncandGetByNameAsync.Related
Cross-SDK tracking
The same class of defect was investigated across all maintained Cosmos SDKs. Sibling issues:
retryWhenoperator structurally isolates subscription cancellation from the retry decision, so the defect does not reproduce.