Metadata Retry [Spike]: Adds detached-cancellation model exploration alongside grace-window fix#5828
Conversation
…alongside grace-window fix Adds MetadataDetachedExecutor as an alternative to PR #5806's MetadataRetryHelper. Caller CancellationToken is fully decoupled from the retry loop; it only short- circuits the response path via Task.WhenAny while the underlying read continues on a CTS owned by the executor (2-minute internal deadline). AsyncCache's AsyncLazy dedup ensures any subsequent caller for the same key gets the in-flight result for free. Re-points ClientCollectionCache.GetByRid/GetByName at the new executor so the alternative is exercised end-to-end. MetadataRetryHelper.cs is left in place for side-by-side comparison; deletion deferred until a direction is picked. This branch targets users/ntripician/metadata-retry-fix (PR #5806), not master. The PR description contains the full grace-window vs. detached comparison, cross-SDK applicability (Java is already in the detached model), side-effect analysis, and a conservative recommendation: ship #5806 now, follow up with this. Tests: 10 new MetadataDetachedExecutorTests; 83 existing AsyncCache/retry/cache tests still pass. Build clean with TreatWarningsAsErrors=true. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Azure Pipelines: 1 pipeline(s) were filtered out due to trigger conditions. |
| /// </summary> | ||
| [TestMethod] | ||
| [Owner("ntripician")] | ||
| public async Task ExecuteAsync_CrossRegionRetryExecutes_EvenWhenCallerTokenCancelsMidFlight() |
There was a problem hiding this comment.
can we do an e2e test with fault injection?
- document operation has an aggressive cancellation token
- no collection cache or pkrange cache populated
- collection/pkrange resolution times out only from region a , x-region retry executes and populates the asynccachenonblocking entry (irrespective of document operation hitting the cancellation token)
| this.sessionContainer, this.retryPolicy.GetRequestPolicy()); | ||
| return TaskHelper.RunInlineIfNeededAsync( | ||
| () => MetadataRetryHelper.ExecuteAsync( | ||
| () => MetadataDetachedExecutor.ExecuteAsync( |
There was a problem hiding this comment.
is this issue only applicable to collection resources?
jeet1995
left a comment
There was a problem hiding this comment.
AsyncCache integration gap — detached task result is orphaned
The executor works correctly in isolation, but the detached model's core claim — "a caller who timed out and retries 100ms later attaches to the same AsyncLazy and gets the result for free" — doesn't hold when composed with AsyncCache.
Call chain when caller CT fires:
CollectionCache.ResolveByRidAsync → AsyncCache.GetAsync(key, initFunc, callerCT) → initFunc: GetByRidAsync → MetadataDetachedExecutor.ExecuteAsync(op, policy, callerCT) → detached retry loop continues in background → caller gets OCE ← OCE propagates up through initFunc ← initFunc faults the AsyncLazy ← AsyncCache catches fault → TryRemoveValue(key) → cache entry DELETED
When the executor throws OCE to the caller, that exception propagates through initFunc, faults the AsyncLazy<T> task, and AsyncCache.GetAsync removes the entry (AsyncCache.cs line 159: this.TryRemoveValue(key, actualValue)). The detached background task is now orphaned — its result has nowhere to land.
A follow-up caller hits a cold cache miss and fires a brand new HTTP metadata read, which is the redundant work the detached model is supposed to eliminate.
Possible fix direction: The detachment needs to happen at the AsyncCache/CollectionCache layer, not inside the executor. The initFunc passed to AsyncCache.GetAsync should always resolve via the detached task (never fault with OCE), so the AsyncLazy stays alive in the cache. Caller cancellation should be observed outside the cache initialization path — e.g., in CollectionCache.ResolveByRidAsync itself, wrapping the AsyncCache.GetAsync call with Task.WhenAny.
Two smaller items:
-
Missing
awaitin test —MetadataDetachedExecutorTests.cs:329:ExecuteAsync_NonPositiveDeadline_Throwsisvoid(synchronous), soAssert.ThrowsExceptionAsyncis fire-and-forget. The test always passes. Should beasync Taskwithawait. -
Unreachable code —
MetadataDetachedExecutor.cs:127:throw new OperationCanceledException(callerCancellationToken)is dead code aftercallerCancellationToken.ThrowIfCancellationRequested()on line 126.
|
Superseded by #5844, which ships the detached-executor approach as the production fix from a fresh worktree, with magic numbers replaced by documented derivations + ConfigurationManager clamps, the full unit-test matrix per the master work-item template, and a Phase 3b fresh-eyes review (.coding-harness/review-feedback-1.json) addressed. |
Metadata Retry [Spike]: Adds detached-cancellation model exploration alongside grace-window fix
Background
PR #5806 fixes the bug where a caller
CancellationTokentiming out at the boundary of the cross-regionfailover decision in
BackoffRetryUtility.ExecuteAsyncsilently preempts the failover and surfaces anOperationCanceledException. The fix is aMetadataRetryHelperthat adds a bounded 10 second gracewindow — if the caller's CT trips, we still allow the in-flight metadata read up to 10s on a detached token
so cross-region failover can complete.
A senior reviewer asked the natural follow-up:
The answer in the PR thread covered the trade-offs at a high level (caller intent, blast radius, surgical fix).
This branch turns that conversation into running code so the team can pick a direction with a concrete
implementation in front of them, including a Java-SDK-parity check.
What's in this branch
Microsoft.Azure.Cosmos/src/MetadataDetachedExecutor.cs— alternative executor that is fully detachedfrom the caller
CancellationTokenfrom the start. Caller CT only short-circuits the response path(via
Task.WhenAny); the underlying retry loop runs on a CTS owned by the executor with a 2-minuteinternal deadline.
Microsoft.Azure.Cosmos/src/Routing/ClientCollectionCache.cs— bothGetByRidAsyncandGetByNameAsyncre-pointed at the new executor. Same call sites, same arguments, no public API change.
Microsoft.Azure.Cosmos/tests/.../MetadataDetachedExecutorTests.cs— 10 unit tests covering success,retry, mid-flight cancellation (the production bug scenario), no-retry policy, deadline, attempt cap,
and argument validation. All pass.
MetadataRetryHelper.csis left in place verbatim so reviewers can compare both approaches withoutflipping branches.
Verification
Approach comparison
A. Bounded grace window (PR #5806 — the production fix)
Pros
minutes after they cancelled.
Cons
so pathological cases still fail. We're shrinking the bug window, not closing it.
threading the wrong one.
B. Detached executor (this branch)
Pros
caller.
BackoffRetryUtility.executeRetrydoes not take a CT; cancellation is Reactorsubscription disposal (lazy). This branch brings .NET to the same operating model. Cross-SDK behavior
becomes consistent for the same regional-outage scenario.
the operation 100ms later attaches to the same AsyncLazy and gets the result without firing a second
metadata read.
Less foot-gunning for future maintainers.
rather than a tactical 10s. Configurable via overload.
Cons
35s leave a 30s read still pending against the gateway/backend. If the caller re-enters the cache the
AsyncLazy dedups them, but if they don't (different CT, fire-and-forget shutdown), there is a
resource cost.
stops" will see the read continue in the background. We believe this is correct (metadata reads are
shared infrastructure, not user-owned work), but it is a real semantic change.
ITraceflows through to a request that may outlive the caller'strace scope. We didn't change the trace plumbing; the activity ID is still the original caller's.
Diagnostics for that orphan tail land on the original
ITraceafter the caller has stopped lookingat it. Not incorrect, but unfamiliar.
(the operation is still using it). We schedule disposal on the operation's continuation, which works
but is the kind of code that gets misread on a future PR. See
MetadataDetachedExecutor.csfinally-block comment.
abort on caller cancellation — the current
BackoffRetryUtilitysemantic is correct for them. So thispattern only fits the "shared infrastructure refresh" call sites:
ClientCollectionCache, eventuallyPartitionKeyRangeCache, address resolver. It's a metadata-reads-only tool.Side-by-side matrix
Impact on
ClientCollectionCacheNo API change. Both
GetByRidAsyncandGetByNameAsynccontinue to takeCancellationToken cancellationTokenand to throw
OperationCanceledExceptionwhen it trips. What changes:observes cancellation when the grace CTS expires.
HTTP completes on its own schedule, populating the AsyncLazy. A caller that retries within the
internal deadline (2 min) gets the in-flight result for free.
This composes cleanly with
AsyncCache<TKey,TValue>'s existing AsyncLazy + CAS pattern (seeAsyncCache.cs): the detached factory delegate produces the value the AsyncLazy is already going tohand out to every other awaiter. There is no new state machine.
PartitionKeyRangeCacheis unaffected — its call sites do not flow caller cancellation throughBackoffRetryUtilitytoday, so neither approach has any work to do there.Cross-SDK applicability
BackoffRetryUtility.executeRetrytakes no CT.RxClientCollectionCachewraps reads inObservableHelper.inlineIfPossible(callbackMethod, retryPolicyInstance)with no caller CT. Cancellation is Reactor subscription disposal, which is lazy and does not preempt in-flight retries.release/azure_data_cosmos-previews)handler/retry_handler.rs::BackOffRetryHandler::sendruns an open-endedloop { sender(request).await; should_retry().await; sleep(after).await; }with no cancellation parameter and no caller-CT check between iterations. Already has a dedicatedMetadataRequestRetryPolicysplit out fromClientRetryPolicyinretry_policies/mod.rs..await. A caller usingtokio::time::timeout(d, op)will, ifdexpires, drop the entire future at the next yield — which interrupts the in-flight HTTP and the retry loop. There is noTask.WhenAny-style separation of "tell the caller we cancelled while leaving the work running." So Rust matches Java on the retry-pipeline design but is closer to today's .NET on the user-visible cancellation shape: callers who compose withtimeoutwill still preempt cross-region failover. The pure detached model (this branch) maps to wrappingBackOffRetryHandler::sendin atokio::spawn-detached task fed by aoneshotchannel back to the caller — feasible but more invasive in Rust because spawning forces'static + Sendbounds on captured state.azure-cosmosasync)_retry_utility_async.ExecuteAsyncaccepts a token and does check it between retries. Same class of bug as .NET.context.Contextdeadlines; metadata-cache refresh path observes the caller context directly.AbortSignal-based; metadata cache wiresabortSignalthrough to fetch.AbortControllerswap analogous to the .NET CTS swap.The detached model is the right answer for the metadata refresh path on every SDK we ship, and the .NET
implementation here is the most complex one to land (because we have the strictest "operation honors its
CT" idiom). Java and Rust are already half-way there at the retry-pipeline layer (no CT threaded into
the loop); Python / Go / JS / Rust would all benefit from a follow-up that adds the explicit
"caller-cancel ≠ in-flight-cancel" separation that this PR introduces for .NET.
Side effects (Skeptic Lens)
AsyncCache dedup, the orphan still serves any subsequent caller, so this is bounded by "one in-flight
read per cold cache key per gateway." Acceptable.
cut that, we'd need to introduce a "background trace" concept — out of scope for this spike.
CancellationTokenSourceand oneTaskCompletionSourceper call. Same orderof magnitude as the existing
MetadataRetryHelper.it more deliberately. No new locks.
continues, the next caller for the same key awaits the same AsyncLazy.
Recommendation
Ship PR #5806 as the production fix for the immediate cross-region-failover bug. It is small,
bounded, and reversible.
Treat this branch as the agreed direction for the next iteration. Once #5806 is in master, follow up
with this detached executor (or an evolution of it) and delete
MetadataRetryHelper.cs. That follow-upshould be paired with the Python/Go/JS analogous changes so all SDKs land in the same operating model
that Java already enjoys.
If reviewers prefer to skip the intermediate step and merge the detached model directly, this branch is
test-clean and ready — but the recommendation above is the conservative path.
Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com