Metadata Retry [Spike]: Adds detached-cancellation model exploration alongside grace-window fix by NaluTripician · Pull Request #5828 · Azure/azure-cosmos-dotnet-v3

NaluTripician · 2026-05-04T22:42:04Z

Metadata Retry [Spike]: Adds detached-cancellation model exploration alongside grace-window fix

This PR is a design spike, not a competing fix.
It targets users/ntripician/metadata-retry-fix (PR #5806) — not master — so reviewers can see the
detached-cancellation alternative as a pure delta on top of the bounded-grace fix.
The recommendation at the bottom is "ship #5806 as-is, treat this as the next-step direction." It is not
"merge this instead of #5806."

Background

PR #5806 fixes the bug where a caller CancellationToken timing out at the boundary of the cross-region
failover decision in BackoffRetryUtility.ExecuteAsync silently preempts the failover and surfaces an
OperationCanceledException. The fix is a MetadataRetryHelper that adds a bounded 10 second grace
window — if the caller's CT trips, we still allow the in-flight metadata read up to 10s on a detached token
so cross-region failover can complete.

A senior reviewer asked the natural follow-up:

"why we do not choose the latter [fully detached cache refresh] instead of extending 10s for graceful retry?"

The answer in the PR thread covered the trade-offs at a high level (caller intent, blast radius, surgical fix).
This branch turns that conversation into running code so the team can pick a direction with a concrete
implementation in front of them, including a Java-SDK-parity check.

What's in this branch

Microsoft.Azure.Cosmos/src/MetadataDetachedExecutor.cs — alternative executor that is fully detached
from the caller CancellationToken from the start. Caller CT only short-circuits the response path
(via Task.WhenAny); the underlying retry loop runs on a CTS owned by the executor with a 2-minute
internal deadline.
Microsoft.Azure.Cosmos/src/Routing/ClientCollectionCache.cs — both GetByRidAsync and GetByNameAsync
re-pointed at the new executor. Same call sites, same arguments, no public API change.
Microsoft.Azure.Cosmos/tests/.../MetadataDetachedExecutorTests.cs — 10 unit tests covering success,
retry, mid-flight cancellation (the production bug scenario), no-retry policy, deadline, attempt cap,
and argument validation. All pass.
MetadataRetryHelper.cs is left in place verbatim so reviewers can compare both approaches without
flipping branches.

Verification

dotnet build Microsoft.Azure.Cosmos\src\Microsoft.Azure.Cosmos.csproj -c Release   # clean
dotnet test  Microsoft.Azure.Cosmos\tests\...Tests.csproj
   --filter "FullyQualifiedName~MetadataDetachedExecutorTests"                      # 10 / 10 pass
   --filter "FullyQualifiedName~MetadataRetryHelperTests|...|AsyncCache|..."        # 83 / 83 pass

Approach comparison

A. Bounded grace window (PR #5806 — the production fix)

caller CT trips → executor opens a 10s grace CTS linked to nothing →
  one more retry iteration runs on the grace token → success or grace expires →
  if grace expires: OCE bubbles to caller; in-flight HTTP eventually completes or
  gets disposed when the grace CTS disposes; AsyncLazy entry is replaced.

Pros

Surgical: behavior identical to today on the happy path; only changes behavior in the exact bug window.
Bounded blast radius: at most one extra retry iteration, capped at 10s wall-clock.
Caller intent is mostly respected: a caller passing a 30s CT does not see metadata work running for
minutes after they cancelled.
Easy to reason about for SREs reading traces: one extra "grace retry" is visible in diagnostics.
Backportable to other SDKs without rethinking their cancellation model.

Cons

10s is a magic number. The actual cross-region retry sequence (0.5s + 5s + 30s = ~35s) can exceed it,
so pathological cases still fail. We're shrinking the bug window, not closing it.
Two cancellation tokens floating around (caller + grace) — easy to introduce a future regression by
threading the wrong one.
Not parity with Java, which has been operating in the detached model effectively since day one.

B. Detached executor (this branch)

caller invokes ExecuteAsync(callerCT) →
  executor creates internalCTS linked to internalDeadline (2 min, hard cap) — NOT linked to callerCT →
  retry loop runs on internalCTS.Token →
  Task.WhenAny(operationTask, callerCT.WhenCanceled()) →
    if callerCT trips first: OCE to caller, operationTask continues to completion in background →
    AsyncCache.AsyncLazy still resolves, so any subsequent caller for the same key
    awaits the in-flight result for free (no duplicate HTTP).

Pros

Closes the bug window completely. No magic 10s number; the in-flight read is never preempted by a
caller.
Parity with Java SDK. BackoffRetryUtility.executeRetry does not take a CT; cancellation is Reactor
subscription disposal (lazy). This branch brings .NET to the same operating model. Cross-SDK behavior
becomes consistent for the same regional-outage scenario.
AsyncCache deduplication becomes a feature, not a coincidence. A caller who timed out and retries
the operation 100ms later attaches to the same AsyncLazy and gets the result without firing a second
metadata read.
Single cancellation source inside the executor. The retry loop only ever observes the internal token.
Less foot-gunning for future maintainers.
Internal deadline is explicit. 2 minutes is a hard ceiling sized to "all four regions exhausted twice"
rather than a tactical 10s. Configurable via overload.

Cons

In-flight HTTP is not aborted on caller cancel. During a regional outage, callers who time out at
35s leave a 30s read still pending against the gateway/backend. If the caller re-enters the cache the
AsyncLazy dedups them, but if they don't (different CT, fire-and-forget shutdown), there is a
resource cost.
Behavior change is broader than Metadata Retry: Fixes cross-region failover preempted by caller cancellation #5806. Callers who today rely on "I cancel my CT, the metadata read
stops" will see the read continue in the background. We believe this is correct (metadata reads are
shared infrastructure, not user-owned work), but it is a real semantic change.
Trace ownership is fuzzier. The ITrace flows through to a request that may outlive the caller's
trace scope. We didn't change the trace plumbing; the activity ID is still the original caller's.
Diagnostics for that orphan tail land on the original ITrace after the caller has stopped looking
at it. Not incorrect, but unfamiliar.
CTS disposal is non-trivial. The detached CTS cannot be disposed eagerly when the caller cancels
(the operation is still using it). We schedule disposal on the operation's continuation, which works
but is the kind of code that gets misread on a future PR. See MetadataDetachedExecutor.cs
finally-block comment.
Harder to backport to data-plane operations. Data-plane reads (point reads, queries) should
abort on caller cancellation — the current BackoffRetryUtility semantic is correct for them. So this
pattern only fits the "shared infrastructure refresh" call sites: ClientCollectionCache, eventually
PartitionKeyRangeCache, address resolver. It's a metadata-reads-only tool.

Side-by-side matrix

Property	A. Grace window (#5806)	B. Detached (this branch)
Closes the failover-preemption bug	Mostly (10s window)	Yes
Adds a magic timeout number	Yes (10s)	Yes (2min, but explicit)
In-flight HTTP cancelled when caller cancels	Yes (after grace)	No
Caller sees OCE on their CT	Yes	Yes
AsyncCache dedup benefits subsequent callers	Sometimes	Always
Parity with Java SDK behavior	No	Yes
Pattern is generalizable beyond metadata reads	Yes	No (metadata only)
Diff size on top of master	Small	Small (+1 file, 1 swap)
Existing test surface	New helper tests	New executor tests + same
Risk of regressing non-metadata callers	Low	None (call sites scoped)

Impact on `ClientCollectionCache`

No API change. Both GetByRidAsync and GetByNameAsync continue to take CancellationToken cancellationToken
and to throw OperationCanceledException when it trips. What changes:

Before (with Metadata Retry: Fixes cross-region failover preempted by caller cancellation #5806): caller CT is honored after the bounded grace; in-flight HTTP eventually
observes cancellation when the grace CTS expires.
After (this branch): caller CT short-circuits the response path (Task.WhenAny). The in-flight
HTTP completes on its own schedule, populating the AsyncLazy. A caller that retries within the
internal deadline (2 min) gets the in-flight result for free.

This composes cleanly with AsyncCache<TKey,TValue>'s existing AsyncLazy + CAS pattern (see
AsyncCache.cs): the detached factory delegate produces the value the AsyncLazy is already going to
hand out to every other awaiter. There is no new state machine.

PartitionKeyRangeCache is unaffected — its call sites do not flow caller cancellation through
BackoffRetryUtility today, so neither approach has any work to do there.

Cross-SDK applicability

SDK	Today's behavior	Applicability of detached model
Java	`BackoffRetryUtility.executeRetry` takes no CT. `RxClientCollectionCache` wraps reads in `ObservableHelper.inlineIfPossible(callbackMethod, retryPolicyInstance)` with no caller CT. Cancellation is Reactor subscription disposal, which is lazy and does not preempt in-flight retries.	Already there. No change needed.
Rust (`release/azure_data_cosmos-previews`)	`handler/retry_handler.rs::BackOffRetryHandler::send` runs an open-ended `loop { sender(request).await; should_retry().await; sleep(after).await; }` with no cancellation parameter and no caller-CT check between iterations. Already has a dedicated `MetadataRequestRetryPolicy` split out from `ClientRetryPolicy` in `retry_policies/mod.rs`.	*Already there for the bug shape* — but with a caveat. The bug we're fixing requires an explicit CT-check in the retry path; Rust has none, so the cross-region failover decision can never be silently preempted between iterations. However*, Rust async cancellation is "drop-the-future" at the next `.await`. A caller using `tokio::time::timeout(d, op)` will, if `d` expires, drop the entire future at the next yield — which interrupts the in-flight HTTP and* the retry loop. There is no `Task.WhenAny`-style separation of "tell the caller we cancelled while leaving the work running." So Rust matches Java on the retry-pipeline design but is closer to today's .NET on the user-visible cancellation shape: callers who compose with `timeout` will still preempt cross-region failover. The pure detached model (this branch) maps to wrapping `BackOffRetryHandler::send` in a `tokio::spawn`-detached task fed by a `oneshot` channel back to the caller — feasible but more invasive in Rust because spawning forces `'static + Send` bounds on captured state.
Python (`azure-cosmos` async)	`_retry_utility_async.ExecuteAsync` accepts a token and does check it between retries. Same class of bug as .NET.	Yes — same fix shape. asyncio.CancelledError propagation has the same trade-offs.
Go	Uses `context.Context` deadlines; metadata-cache refresh path observes the caller context directly.	Yes — would require introducing an internal context for the retry loop and observing the caller context only on the response. Same trade-off (orphan request body).
JS/TS	`AbortSignal`-based; metadata cache wires `abortSignal` through to fetch.	Yes — same shape; `AbortController` swap analogous to the .NET CTS swap.

The detached model is the right answer for the metadata refresh path on every SDK we ship, and the .NET
implementation here is the most complex one to land (because we have the strictest "operation honors its
CT" idiom). Java and Rust are already half-way there at the retry-pipeline layer (no CT threaded into
the loop); Python / Go / JS / Rust would all benefit from a follow-up that adds the explicit
"caller-cancel ≠ in-flight-cancel" separation that this PR introduces for .NET.

Side effects (Skeptic Lens)

Resource: during a regional brownout, callers timing out leaves orphan requests in flight. With
AsyncCache dedup, the orphan still serves any subsequent caller, so this is bounded by "one in-flight
read per cold cache key per gateway." Acceptable.
Observability: as noted, the orphan tail logs against the original caller's trace. If we want to
cut that, we'd need to introduce a "background trace" concept — out of scope for this spike.
Memory: one extra CancellationTokenSource and one TaskCompletionSource per call. Same order
of magnitude as the existing MetadataRetryHelper.
Concurrency: the AsyncLazy contract was already designed for this pattern; we're just leaning into
it more deliberately. No new locks.
Cancellation propagation: verified by tests — the caller sees OCE promptly, the detached task
continues, the next caller for the same key awaits the same AsyncLazy.

Recommendation

Ship PR #5806 as the production fix for the immediate cross-region-failover bug. It is small,
bounded, and reversible.

Treat this branch as the agreed direction for the next iteration. Once #5806 is in master, follow up
with this detached executor (or an evolution of it) and delete MetadataRetryHelper.cs. That follow-up
should be paired with the Python/Go/JS analogous changes so all SDKs land in the same operating model
that Java already enjoys.

If reviewers prefer to skip the intermediate step and merge the detached model directly, this branch is
test-clean and ready — but the recommendation above is the conservative path.

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

…alongside grace-window fix Adds MetadataDetachedExecutor as an alternative to PR #5806's MetadataRetryHelper. Caller CancellationToken is fully decoupled from the retry loop; it only short- circuits the response path via Task.WhenAny while the underlying read continues on a CTS owned by the executor (2-minute internal deadline). AsyncCache's AsyncLazy dedup ensures any subsequent caller for the same key gets the in-flight result for free. Re-points ClientCollectionCache.GetByRid/GetByName at the new executor so the alternative is exercised end-to-end. MetadataRetryHelper.cs is left in place for side-by-side comparison; deletion deferred until a direction is picked. This branch targets users/ntripician/metadata-retry-fix (PR #5806), not master. The PR description contains the full grace-window vs. detached comparison, cross-SDK applicability (Java is already in the detached model), side-effect analysis, and a conservative recommendation: ship #5806 now, follow up with this. Tests: 10 new MetadataDetachedExecutorTests; 83 existing AsyncCache/retry/cache tests still pass. Build clean with TreatWarningsAsErrors=true. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

azure-pipelines · 2026-05-04T22:42:15Z

Azure Pipelines: 1 pipeline(s) were filtered out due to trigger conditions.

jeet1995 · 2026-05-05T18:46:07Z

+        /// </summary>
+        [TestMethod]
+        [Owner("ntripician")]
+        public async Task ExecuteAsync_CrossRegionRetryExecutes_EvenWhenCallerTokenCancelsMidFlight()


can we do an e2e test with fault injection?

document operation has an aggressive cancellation token

no collection cache or pkrange cache populated

collection/pkrange resolution times out only from region a , x-region retry executes and populates the asynccachenonblocking entry (irrespective of document operation hitting the cancellation token)

jeet1995 · 2026-05-05T18:47:12Z

                this.sessionContainer, this.retryPolicy.GetRequestPolicy());
            return TaskHelper.RunInlineIfNeededAsync(
-                () => MetadataRetryHelper.ExecuteAsync(
+                () => MetadataDetachedExecutor.ExecuteAsync(


is this issue only applicable to collection resources?

jeet1995

AsyncCache integration gap — detached task result is orphaned

The executor works correctly in isolation, but the detached model's core claim — "a caller who timed out and retries 100ms later attaches to the same AsyncLazy and gets the result for free" — doesn't hold when composed with AsyncCache.

Call chain when caller CT fires:

CollectionCache.ResolveByRidAsync → AsyncCache.GetAsync(key, initFunc, callerCT) → initFunc: GetByRidAsync → MetadataDetachedExecutor.ExecuteAsync(op, policy, callerCT) → detached retry loop continues in background → caller gets OCE ← OCE propagates up through initFunc ← initFunc faults the AsyncLazy ← AsyncCache catches fault → TryRemoveValue(key) → cache entry DELETED

When the executor throws OCE to the caller, that exception propagates through initFunc, faults the AsyncLazy<T> task, and AsyncCache.GetAsync removes the entry (AsyncCache.cs line 159: this.TryRemoveValue(key, actualValue)). The detached background task is now orphaned — its result has nowhere to land.

A follow-up caller hits a cold cache miss and fires a brand new HTTP metadata read, which is the redundant work the detached model is supposed to eliminate.

Possible fix direction: The detachment needs to happen at the AsyncCache/CollectionCache layer, not inside the executor. The initFunc passed to AsyncCache.GetAsync should always resolve via the detached task (never fault with OCE), so the AsyncLazy stays alive in the cache. Caller cancellation should be observed outside the cache initialization path — e.g., in CollectionCache.ResolveByRidAsync itself, wrapping the AsyncCache.GetAsync call with Task.WhenAny.

Two smaller items:

Missing await in test — MetadataDetachedExecutorTests.cs:329: ExecuteAsync_NonPositiveDeadline_Throws is void (synchronous), so Assert.ThrowsExceptionAsync is fire-and-forget. The test always passes. Should be async Task with await.
Unreachable code — MetadataDetachedExecutor.cs:127: throw new OperationCanceledException(callerCancellationToken) is dead code after callerCancellationToken.ThrowIfCancellationRequested() on line 126.

NaluTripician · 2026-05-06T20:13:15Z

Superseded by #5844, which ships the detached-executor approach as the production fix from a fresh worktree, with magic numbers replaced by documented derivations + ConfigurationManager clamps, the full unit-test matrix per the master work-item template, and a Phase 3b fresh-eyes review (.coding-harness/review-feedback-1.json) addressed.

jeet1995 reviewed May 5, 2026

View reviewed changes

NaluTripician mentioned this pull request May 6, 2026

Routing: Adds detached metadata executor decoupling caller cancellation from cross-region failover #5844

Open

NaluTripician closed this May 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metadata Retry [Spike]: Adds detached-cancellation model exploration alongside grace-window fix#5828

Metadata Retry [Spike]: Adds detached-cancellation model exploration alongside grace-window fix#5828
NaluTripician wants to merge 1 commit into
users/ntripician/metadata-retry-fixfrom
users/ntripician/metadata-retry-detached-exploration

NaluTripician commented May 4, 2026 •

edited

Loading

Uh oh!

azure-pipelines Bot commented May 4, 2026

Uh oh!

jeet1995 May 5, 2026 •

edited

Loading

Uh oh!

jeet1995 May 5, 2026

Uh oh!

jeet1995 left a comment

Uh oh!

NaluTripician commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

NaluTripician commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Metadata Retry [Spike]: Adds detached-cancellation model exploration alongside grace-window fix

Background

What's in this branch

Verification

Approach comparison

A. Bounded grace window (PR #5806 — the production fix)

B. Detached executor (this branch)

Side-by-side matrix

Impact on ClientCollectionCache

Cross-SDK applicability

Side effects (Skeptic Lens)

Recommendation

Uh oh!

azure-pipelines Bot commented May 4, 2026

Uh oh!

jeet1995 May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeet1995 May 5, 2026

Choose a reason for hiding this comment

Uh oh!

jeet1995 left a comment

Choose a reason for hiding this comment

AsyncCache integration gap — detached task result is orphaned

Uh oh!

NaluTripician commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

NaluTripician commented May 4, 2026 •

edited

Loading

Impact on `ClientCollectionCache`

jeet1995 May 5, 2026 •

edited

Loading