Skip to content

Metadata retry: cross-region failover preempted by caller cancellation during control-plane HTTP timeout escalation #5805

@NaluTripician

Description

@NaluTripician

Summary

When the .NET Cosmos SDK warms a cold collection-metadata cache (ClientCollectionCache.GetByRidAsync / GetByNameAsync), the ReadCollection call is routed through TaskHelper.InlineIfPossible(..., retryPolicy, cancellationToken), which delegates to BackoffRetryUtility<T>.ExecuteAsync in Cosmos.Direct. That utility evaluates cancellationToken.ThrowIfCancellationRequested() before consulting IDocumentClientRetryPolicy.ShouldRetryAsync.

The control-plane HTTP timeout policy escalates at 0.5 s → 5 s → 30 s (~36 s total) against an unhealthy region. When the caller's request-level timeout (typically ~36.5 s) trips during that escalation, ThrowIfCancellationRequested fires on the next iteration and throws OperationCanceledException without ever consulting ClientRetryPolicy — which would otherwise have returned "retry to next preferred region". The customer sees a CosmosOperationCanceledException instead of a successful cross-region failover.

This is the bug described in the diagnostic bundle attached to PR #5787.

Reproduction

The repro is included in PR #5787 (ClientRetryPolicyTests + ControlPlaneHedgingReproTests). A concise end-to-end repro:

  1. Preferred regions = [East US 2, Central US] where East US 2 is unhealthy (gateway returns 503 after slow timeouts).
  2. Cold collection cache.
  3. Issue QueryDocument with a ~36.5 s client-side timeout.
  4. Observe: CosmosOperationCanceledException; no HTTP call to Central US.

Expected: the cross-region retry against Central US executes and the query succeeds.

Root cause

BackoffRetryUtility<T>.ExecuteAsync (in Cosmos.Direct, not editable in this repo) honors the caller token before calling ShouldRetryAsync. See ClientRetryPolicy.ShouldRetryAsync lines 141-156 — on OCE it explicitly returns non-retry, but in the preempted path the policy is never consulted at all.

PartitionKeyRangeCache is not affected (its BackoffRetryUtility.ExecuteAsync call does not thread a caller cancellation token).

Fix

See PR #5806. New internal MetadataRetryHelper runs a bespoke retry loop that:

  • Always consults the retry policy on exception, before honoring the caller token.
  • On caller-cancelled + retryable: grants ONE bounded grace window (default 10 s) for a cross-region retry attempt using a detached token.
  • Surfaces the original exception (not grace-timeout OCE) if grace expires or the grace attempt fails.

Wired into ClientCollectionCache.GetByRidAsync and GetByNameAsync.

Related

Cross-SDK tracking

The same class of defect was investigated across all maintained Cosmos SDKs. Sibling issues:

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions