Skip to content

Metadata Retry [Spike]: Adds detached-cancellation model exploration alongside grace-window fix#5828

Closed
NaluTripician wants to merge 1 commit into
users/ntripician/metadata-retry-fixfrom
users/ntripician/metadata-retry-detached-exploration
Closed

Metadata Retry [Spike]: Adds detached-cancellation model exploration alongside grace-window fix#5828
NaluTripician wants to merge 1 commit into
users/ntripician/metadata-retry-fixfrom
users/ntripician/metadata-retry-detached-exploration

Conversation

@NaluTripician
Copy link
Copy Markdown
Contributor

@NaluTripician NaluTripician commented May 4, 2026

Metadata Retry [Spike]: Adds detached-cancellation model exploration alongside grace-window fix

This PR is a design spike, not a competing fix.
It targets users/ntripician/metadata-retry-fix (PR #5806) — not master — so reviewers can see the
detached-cancellation alternative as a pure delta on top of the bounded-grace fix.
The recommendation at the bottom is "ship #5806 as-is, treat this as the next-step direction." It is not
"merge this instead of #5806."

Background

PR #5806 fixes the bug where a caller CancellationToken timing out at the boundary of the cross-region
failover decision in BackoffRetryUtility.ExecuteAsync silently preempts the failover and surfaces an
OperationCanceledException. The fix is a MetadataRetryHelper that adds a bounded 10 second grace
window — if the caller's CT trips, we still allow the in-flight metadata read up to 10s on a detached token
so cross-region failover can complete.

A senior reviewer asked the natural follow-up:

"why we do not choose the latter [fully detached cache refresh] instead of extending 10s for graceful retry?"

The answer in the PR thread covered the trade-offs at a high level (caller intent, blast radius, surgical fix).
This branch turns that conversation into running code so the team can pick a direction with a concrete
implementation in front of them, including a Java-SDK-parity check.

What's in this branch

  • Microsoft.Azure.Cosmos/src/MetadataDetachedExecutor.cs — alternative executor that is fully detached
    from the caller CancellationToken from the start
    . Caller CT only short-circuits the response path
    (via Task.WhenAny); the underlying retry loop runs on a CTS owned by the executor with a 2-minute
    internal deadline.
  • Microsoft.Azure.Cosmos/src/Routing/ClientCollectionCache.cs — both GetByRidAsync and GetByNameAsync
    re-pointed at the new executor. Same call sites, same arguments, no public API change.
  • Microsoft.Azure.Cosmos/tests/.../MetadataDetachedExecutorTests.cs — 10 unit tests covering success,
    retry, mid-flight cancellation (the production bug scenario), no-retry policy, deadline, attempt cap,
    and argument validation. All pass.
  • MetadataRetryHelper.cs is left in place verbatim so reviewers can compare both approaches without
    flipping branches.

Verification

dotnet build Microsoft.Azure.Cosmos\src\Microsoft.Azure.Cosmos.csproj -c Release   # clean
dotnet test  Microsoft.Azure.Cosmos\tests\...Tests.csproj
   --filter "FullyQualifiedName~MetadataDetachedExecutorTests"                      # 10 / 10 pass
   --filter "FullyQualifiedName~MetadataRetryHelperTests|...|AsyncCache|..."        # 83 / 83 pass

Approach comparison

A. Bounded grace window (PR #5806 — the production fix)

caller CT trips → executor opens a 10s grace CTS linked to nothing →
  one more retry iteration runs on the grace token → success or grace expires →
  if grace expires: OCE bubbles to caller; in-flight HTTP eventually completes or
  gets disposed when the grace CTS disposes; AsyncLazy entry is replaced.

Pros

  • Surgical: behavior identical to today on the happy path; only changes behavior in the exact bug window.
  • Bounded blast radius: at most one extra retry iteration, capped at 10s wall-clock.
  • Caller intent is mostly respected: a caller passing a 30s CT does not see metadata work running for
    minutes after they cancelled.
  • Easy to reason about for SREs reading traces: one extra "grace retry" is visible in diagnostics.
  • Backportable to other SDKs without rethinking their cancellation model.

Cons

  • 10s is a magic number. The actual cross-region retry sequence (0.5s + 5s + 30s = ~35s) can exceed it,
    so pathological cases still fail. We're shrinking the bug window, not closing it.
  • Two cancellation tokens floating around (caller + grace) — easy to introduce a future regression by
    threading the wrong one.
  • Not parity with Java, which has been operating in the detached model effectively since day one.

B. Detached executor (this branch)

caller invokes ExecuteAsync(callerCT) →
  executor creates internalCTS linked to internalDeadline (2 min, hard cap) — NOT linked to callerCT →
  retry loop runs on internalCTS.Token →
  Task.WhenAny(operationTask, callerCT.WhenCanceled()) →
    if callerCT trips first: OCE to caller, operationTask continues to completion in background →
    AsyncCache.AsyncLazy still resolves, so any subsequent caller for the same key
    awaits the in-flight result for free (no duplicate HTTP).

Pros

  • Closes the bug window completely. No magic 10s number; the in-flight read is never preempted by a
    caller.
  • Parity with Java SDK. BackoffRetryUtility.executeRetry does not take a CT; cancellation is Reactor
    subscription disposal (lazy). This branch brings .NET to the same operating model. Cross-SDK behavior
    becomes consistent for the same regional-outage scenario.
  • AsyncCache deduplication becomes a feature, not a coincidence. A caller who timed out and retries
    the operation 100ms later attaches to the same AsyncLazy and gets the result without firing a second
    metadata read.
  • Single cancellation source inside the executor. The retry loop only ever observes the internal token.
    Less foot-gunning for future maintainers.
  • Internal deadline is explicit. 2 minutes is a hard ceiling sized to "all four regions exhausted twice"
    rather than a tactical 10s. Configurable via overload.

Cons

  • In-flight HTTP is not aborted on caller cancel. During a regional outage, callers who time out at
    35s leave a 30s read still pending against the gateway/backend. If the caller re-enters the cache the
    AsyncLazy dedups them, but if they don't (different CT, fire-and-forget shutdown), there is a
    resource cost.
  • Behavior change is broader than Metadata Retry: Fixes cross-region failover preempted by caller cancellation #5806. Callers who today rely on "I cancel my CT, the metadata read
    stops" will see the read continue in the background. We believe this is correct (metadata reads are
    shared infrastructure, not user-owned work), but it is a real semantic change.
  • Trace ownership is fuzzier. The ITrace flows through to a request that may outlive the caller's
    trace scope. We didn't change the trace plumbing; the activity ID is still the original caller's.
    Diagnostics for that orphan tail land on the original ITrace after the caller has stopped looking
    at it. Not incorrect, but unfamiliar.
  • CTS disposal is non-trivial. The detached CTS cannot be disposed eagerly when the caller cancels
    (the operation is still using it). We schedule disposal on the operation's continuation, which works
    but is the kind of code that gets misread on a future PR. See MetadataDetachedExecutor.cs
    finally-block comment.
  • Harder to backport to data-plane operations. Data-plane reads (point reads, queries) should
    abort on caller cancellation — the current BackoffRetryUtility semantic is correct for them. So this
    pattern only fits the "shared infrastructure refresh" call sites: ClientCollectionCache, eventually
    PartitionKeyRangeCache, address resolver. It's a metadata-reads-only tool.

Side-by-side matrix

Property A. Grace window (#5806) B. Detached (this branch)
Closes the failover-preemption bug Mostly (10s window) Yes
Adds a magic timeout number Yes (10s) Yes (2min, but explicit)
In-flight HTTP cancelled when caller cancels Yes (after grace) No
Caller sees OCE on their CT Yes Yes
AsyncCache dedup benefits subsequent callers Sometimes Always
Parity with Java SDK behavior No Yes
Pattern is generalizable beyond metadata reads Yes No (metadata only)
Diff size on top of master Small Small (+1 file, 1 swap)
Existing test surface New helper tests New executor tests + same
Risk of regressing non-metadata callers Low None (call sites scoped)

Impact on ClientCollectionCache

No API change. Both GetByRidAsync and GetByNameAsync continue to take CancellationToken cancellationToken
and to throw OperationCanceledException when it trips. What changes:

  • Before (with Metadata Retry: Fixes cross-region failover preempted by caller cancellation #5806): caller CT is honored after the bounded grace; in-flight HTTP eventually
    observes cancellation when the grace CTS expires.
  • After (this branch): caller CT short-circuits the response path (Task.WhenAny). The in-flight
    HTTP completes on its own schedule, populating the AsyncLazy. A caller that retries within the
    internal deadline (2 min) gets the in-flight result for free.

This composes cleanly with AsyncCache<TKey,TValue>'s existing AsyncLazy + CAS pattern (see
AsyncCache.cs): the detached factory delegate produces the value the AsyncLazy is already going to
hand out to every other awaiter. There is no new state machine.

PartitionKeyRangeCache is unaffected — its call sites do not flow caller cancellation through
BackoffRetryUtility today, so neither approach has any work to do there.

Cross-SDK applicability

SDK Today's behavior Applicability of detached model
Java BackoffRetryUtility.executeRetry takes no CT. RxClientCollectionCache wraps reads in ObservableHelper.inlineIfPossible(callbackMethod, retryPolicyInstance) with no caller CT. Cancellation is Reactor subscription disposal, which is lazy and does not preempt in-flight retries. Already there. No change needed.
Rust (release/azure_data_cosmos-previews) handler/retry_handler.rs::BackOffRetryHandler::send runs an open-ended loop { sender(request).await; should_retry().await; sleep(after).await; } with no cancellation parameter and no caller-CT check between iterations. Already has a dedicated MetadataRequestRetryPolicy split out from ClientRetryPolicy in retry_policies/mod.rs. Already there for the bug shape — but with a caveat. The bug we're fixing requires an explicit CT-check in the retry path; Rust has none, so the cross-region failover decision can never be silently preempted between iterations. However, Rust async cancellation is "drop-the-future" at the next .await. A caller using tokio::time::timeout(d, op) will, if d expires, drop the entire future at the next yield — which interrupts the in-flight HTTP and the retry loop. There is no Task.WhenAny-style separation of "tell the caller we cancelled while leaving the work running." So Rust matches Java on the retry-pipeline design but is closer to today's .NET on the user-visible cancellation shape: callers who compose with timeout will still preempt cross-region failover. The pure detached model (this branch) maps to wrapping BackOffRetryHandler::send in a tokio::spawn-detached task fed by a oneshot channel back to the caller — feasible but more invasive in Rust because spawning forces 'static + Send bounds on captured state.
Python (azure-cosmos async) _retry_utility_async.ExecuteAsync accepts a token and does check it between retries. Same class of bug as .NET. Yes — same fix shape. asyncio.CancelledError propagation has the same trade-offs.
Go Uses context.Context deadlines; metadata-cache refresh path observes the caller context directly. Yes — would require introducing an internal context for the retry loop and observing the caller context only on the response. Same trade-off (orphan request body).
JS/TS AbortSignal-based; metadata cache wires abortSignal through to fetch. Yes — same shape; AbortController swap analogous to the .NET CTS swap.

The detached model is the right answer for the metadata refresh path on every SDK we ship, and the .NET
implementation here is the most complex one to land (because we have the strictest "operation honors its
CT" idiom). Java and Rust are already half-way there at the retry-pipeline layer (no CT threaded into
the loop); Python / Go / JS / Rust would all benefit from a follow-up that adds the explicit
"caller-cancel ≠ in-flight-cancel" separation that this PR introduces for .NET.

Side effects (Skeptic Lens)

  • Resource: during a regional brownout, callers timing out leaves orphan requests in flight. With
    AsyncCache dedup, the orphan still serves any subsequent caller, so this is bounded by "one in-flight
    read per cold cache key per gateway." Acceptable.
  • Observability: as noted, the orphan tail logs against the original caller's trace. If we want to
    cut that, we'd need to introduce a "background trace" concept — out of scope for this spike.
  • Memory: one extra CancellationTokenSource and one TaskCompletionSource per call. Same order
    of magnitude as the existing MetadataRetryHelper.
  • Concurrency: the AsyncLazy contract was already designed for this pattern; we're just leaning into
    it more deliberately. No new locks.
  • Cancellation propagation: verified by tests — the caller sees OCE promptly, the detached task
    continues, the next caller for the same key awaits the same AsyncLazy.

Recommendation

Ship PR #5806 as the production fix for the immediate cross-region-failover bug. It is small,
bounded, and reversible.

Treat this branch as the agreed direction for the next iteration. Once #5806 is in master, follow up
with this detached executor (or an evolution of it) and delete MetadataRetryHelper.cs. That follow-up
should be paired with the Python/Go/JS analogous changes so all SDKs land in the same operating model
that Java already enjoys.

If reviewers prefer to skip the intermediate step and merge the detached model directly, this branch is
test-clean and ready — but the recommendation above is the conservative path.


Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

…alongside grace-window fix

Adds MetadataDetachedExecutor as an alternative to PR #5806's MetadataRetryHelper.
Caller CancellationToken is fully decoupled from the retry loop; it only short-
circuits the response path via Task.WhenAny while the underlying read continues
on a CTS owned by the executor (2-minute internal deadline). AsyncCache's
AsyncLazy dedup ensures any subsequent caller for the same key gets the in-flight
result for free.

Re-points ClientCollectionCache.GetByRid/GetByName at the new executor so the
alternative is exercised end-to-end. MetadataRetryHelper.cs is left in place for
side-by-side comparison; deletion deferred until a direction is picked.

This branch targets users/ntripician/metadata-retry-fix (PR #5806), not master.
The PR description contains the full grace-window vs. detached comparison,
cross-SDK applicability (Java is already in the detached model), side-effect
analysis, and a conservative recommendation: ship #5806 now, follow up with this.

Tests: 10 new MetadataDetachedExecutorTests; 83 existing AsyncCache/retry/cache
tests still pass. Build clean with TreatWarningsAsErrors=true.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
1 pipeline(s) were filtered out due to trigger conditions.

/// </summary>
[TestMethod]
[Owner("ntripician")]
public async Task ExecuteAsync_CrossRegionRetryExecutes_EvenWhenCallerTokenCancelsMidFlight()
Copy link
Copy Markdown
Member

@jeet1995 jeet1995 May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do an e2e test with fault injection?

  • document operation has an aggressive cancellation token
  • no collection cache or pkrange cache populated
  • collection/pkrange resolution times out only from region a , x-region retry executes and populates the asynccachenonblocking entry (irrespective of document operation hitting the cancellation token)

this.sessionContainer, this.retryPolicy.GetRequestPolicy());
return TaskHelper.RunInlineIfNeededAsync(
() => MetadataRetryHelper.ExecuteAsync(
() => MetadataDetachedExecutor.ExecuteAsync(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this issue only applicable to collection resources?

Copy link
Copy Markdown
Member

@jeet1995 jeet1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AsyncCache integration gap — detached task result is orphaned

The executor works correctly in isolation, but the detached model's core claim — "a caller who timed out and retries 100ms later attaches to the same AsyncLazy and gets the result for free" — doesn't hold when composed with AsyncCache.

Call chain when caller CT fires:

CollectionCache.ResolveByRidAsync → AsyncCache.GetAsync(key, initFunc, callerCT) → initFunc: GetByRidAsync → MetadataDetachedExecutor.ExecuteAsync(op, policy, callerCT) → detached retry loop continues in background → caller gets OCE ← OCE propagates up through initFunc ← initFunc faults the AsyncLazy ← AsyncCache catches fault → TryRemoveValue(key) → cache entry DELETED

When the executor throws OCE to the caller, that exception propagates through initFunc, faults the AsyncLazy<T> task, and AsyncCache.GetAsync removes the entry (AsyncCache.cs line 159: this.TryRemoveValue(key, actualValue)). The detached background task is now orphaned — its result has nowhere to land.

A follow-up caller hits a cold cache miss and fires a brand new HTTP metadata read, which is the redundant work the detached model is supposed to eliminate.

Possible fix direction: The detachment needs to happen at the AsyncCache/CollectionCache layer, not inside the executor. The initFunc passed to AsyncCache.GetAsync should always resolve via the detached task (never fault with OCE), so the AsyncLazy stays alive in the cache. Caller cancellation should be observed outside the cache initialization path — e.g., in CollectionCache.ResolveByRidAsync itself, wrapping the AsyncCache.GetAsync call with Task.WhenAny.


Two smaller items:

  1. Missing await in testMetadataDetachedExecutorTests.cs:329: ExecuteAsync_NonPositiveDeadline_Throws is void (synchronous), so Assert.ThrowsExceptionAsync is fire-and-forget. The test always passes. Should be async Task with await.

  2. Unreachable codeMetadataDetachedExecutor.cs:127: throw new OperationCanceledException(callerCancellationToken) is dead code after callerCancellationToken.ThrowIfCancellationRequested() on line 126.

@NaluTripician
Copy link
Copy Markdown
Contributor Author

Superseded by #5844, which ships the detached-executor approach as the production fix from a fresh worktree, with magic numbers replaced by documented derivations + ConfigurationManager clamps, the full unit-test matrix per the master work-item template, and a Phase 3b fresh-eyes review (.coding-harness/review-feedback-1.json) addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants