Routing: Adds detached metadata executor decoupling caller cancellation from cross-region failover#5844
Routing: Adds detached metadata executor decoupling caller cancellation from cross-region failover#5844NaluTripician wants to merge 18 commits into
Conversation
Introduces an internal MetadataDetachedExecutor that runs metadata-cache reads on a detached, internally-bounded CancellationToken and observes the caller's CancellationToken only on the response path. The retry-policy decision is therefore never preempted by caller-cancel, fixing the cross-region-failover preemption bug from issue #5805. ConfigurationManager exposes a configurable hard deadline (AZURE_COSMOS_METADATA_DETACHED_HARD_DEADLINE_SECONDS, default 5 min) so the detached attempt cannot leak background work indefinitely. A defensive 50-attempt cap guards against a misbehaving retry policy returning ShouldRetry=true with zero backoff. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tByRid/GetByName Repoints both metadata-cache-feeder factories from TaskHelper.InlineIfPossible (which delegates to BackoffRetryUtility, the source of the caller-cancel preemption) to MetadataDetachedExecutor. TaskHelper.RunInlineIfNeededAsync still wraps for NETFX SynchronizationContext safety. Caller CancellationToken is preserved at the entry-side ThrowIfCancellationRequested() gate and observed by the executor only on the response path. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Pins behavior of detached-cancellation execution model: success path, transient retry, primary-fix scenario where cross-region retry executes on detached token after caller-cancel mid-flight, caller OCE surfacing while detached task continues, already-cancelled caller token, CancellationToken.None fast path, policy NoRetry/ExceptionToThrow/throws, internal-deadline bound, hard attempt cap, first-attempt OCE consults policy, null-arg validation, non-positive deadline, backoff honored, SyncContext smoke test. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…p per fresh-eyes review Fresh-eyes review (.coding-harness/review-feedback-1.json) flagged: R1.1 (major): GetMetadataDetachedHardDeadline returned an unbounded TimeSpan. An envvar value larger than ~uint.MaxValue-1 ms (~49.7 days) would make new CancellationTokenSource(TimeSpan) throw ArgumentOutOfRangeException, breaking every metadata read. Added MaxMetadataDetachedHardDeadlineInSeconds=86400 (24h) clamp; new test verifies a 60-day envvar value is clamped and the resulting TimeSpan constructs a CancellationTokenSource without throwing. R1.3 (minor): the attempt-cap throw discarded last-failure context. Hoisted lastCapturedException above the loop; cap path now passes its SourceException as InnerException and traces type+message so the cap and the underlying failure are both diagnosable. R1.5 (nit): doc comment referenced a literal '5 minutes' default; now points at ConfigurationManager.DefaultMetadataDetachedHardDeadlineInSeconds so the doc cannot drift from the constant. R1.2 (nit): renamed ExecuteAsync_CancellationTokenNone_FastPath_NoCallerOcePropagation to ExecuteAsync_CancellationTokenNone_SucceedsAndOperationReceivesNonCanceledToken so the test name matches what it actually proves; the fast-path micro- optimization is not directly observable from outside the executor. Tests: 19/19 MetadataDetachedExecutor pass (was 17; +2 clamp tests); 60/60 in CollectionCache/ClientRetry/ConfigurationManager regression slice. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Phase 7 — Cross-SDK parity researchResearch artifact: Of the four non-.NET Cosmos DB SDKs, three are genuinely vulnerable to the same bug class:
Remediation sketches (idiomatic per language) are captured in |
Addresses iteration-2 deep review findings on PR #5844: - R2.4 (Correctness): Surface underlying exception when internal deadline trips during the operation lambda, not just during Task.Delay backoff. Adds a top-of-loop OCE-due-to-detached-token guard that surfaces the prior captured exception, preserving the design contract that callers see the failure mode that drove the retry (not a hard-deadline artifact). - R2.5 (Documentation): Reword the AsyncCache caveat doc-comment to accurately describe in-flight reuse semantics. Concurrent callers do not share the eventual successful result after the first caller cancels; AsyncCache discards the OCE-faulted lazy and the second caller starts a fresh detached attempt. The real benefit is side-effect accrual (LocationCache region marking, session clearing), not result reuse. - R2.9 (Style): Change ConfigurationManager.GetMetadataDetachedHardDeadline accessibility from public to internal for consistency with the internal-static class containing it. - R2.10 (Concurrency): Add Task.Yield() when the retry policy returns BackoffTime <= TimeSpan.Zero, bounding CPU and giving the threadpool a chance to schedule other work. Limits amplification of a misbehaving policy that returns ShouldRetry=true with zero backoff. - R2.11 (Documentation): Add comment explaining ContinueWith inline-completion ordering for disposeWhenDone, warning future maintainers not to read detachedCts after the registration. - R2.13 (Testing): Add [DoNotParallelize] to MetadataDetachedExecutorTests so the env-var clamp tests are isolated from MSTest class-level parallelism. - R2.1 (Diagnostics): Document the post-cancel trace/stats mutation as a known limitation in the executor's doc-comment. The full fix (isolate detached task into a child trace tree, merge only on success) is a follow-up tracked separately to keep this fix scoped. Adds regression test ExecuteAsync_DeadlineTripsDuringOperation_SurfacesUnderlyingException pinning the R2.4 contract: when the deadline trips during operation execution with prior failures, the underlying DocumentClientException surfaces, not the deadline OCE. Test results: 20/20 MetadataDetachedExecutor tests pass; 60/60 regression slice (CollectionCache | ClientRetry | ConfigurationManager) pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…er third deep-review pass Addresses iteration-3 deep review findings on PR #5844 (merge_recommendation: ready, 0 blocking): - R3.1 (Recommendation): Add comment in ExecuteRetryLoopAsync explaining the asymmetry of the OCE-during-operation guard's third filter clause. When previousException is itself OCE, the filter intentionally falls through to the general catch path because swapping one OCE for another offers no diagnostic gain and the general path correctly funnels through policy/hard-cap/backoff-catch termination. - R3.2 (Recommendation): Add 'Retry-policy invariant' paragraph to executor's <summary> documenting that the supplied IDocumentClientRetryPolicy MUST be a per-call instance because ShouldRetryAsync is intentionally NOT invoked on either OCE termination path. A future refactor that caches policies must preserve this invariant or move OCE termination paths through ShouldRetryAsync. - R3.3 (Suggestion): Bump ExecuteAsync_DeadlineTripsDuringOperation_SurfacesUnderlyingException internal-deadline from 200ms to 2s to remove CI flakiness risk on saturated runners. The test still verifies the same R2.4 contract; only the wall-clock generosity changes. Skipped (non-blocking, deferred or rejected): - R3.4 (Suggestion): Split tests into two classes for parallelism — over-engineering for zero observed flakes; class-level [DoNotParallelize] is acceptable for a 20-test class that completes in <2s. - R3.5 (Observation): AsyncCache fall-through coalescer — tracked as follow-up work item. - R3.6 (Observation): Split <summary> doc-comment into <remarks> — cosmetic; current structure renders correctly in IDE tooltips. Test results: 20/20 MetadataDetachedExecutor tests pass (~2s); build clean (0 errors). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Deep Review Summary (3 iterations)This PR has been through three iterations of local deep-review using a Sonnet/Opus emulation of the cosmos-sdk-copilot-toolkit's PR Deep Reviewer agent. All findings have been triaged and the relevant ones addressed in code; the rest are tracked as documented follow-ups. Iteration 1 (5 findings → resolved in 22cbcf4)
Iteration 2 (13 findings → 7 resolved in 09bd57e, 6 deferred)Resolved in code:
Deferred to follow-ups:
Iteration 3 (6 findings → 3 addressed in 8b1ec79, 3 deferred; merge_recommendation: ready, 0 blocking)Audit of iteration-2 fixes: R2.4, R2.5, R2.9, R2.10, R2.11, R2.13 all verified resolved. R2.1 verified as documented partial per the deferred plan. Resolved:
Skipped:
Final state
|
…ation - Replace object.ReferenceEquals(Exception, Exception) with reference equality '==' on typed locals to avoid the CDX1000 analyzer error (boxing Exception to object on the metadata hot path). - Re-derive MaxAttemptsHardCap against the actual SDK retry policies: the dominant per-call retry ceiling is ClientRetryPolicy.MaxRetryCount = 120 (cross-region failover counter), not the previously-claimed '5 preferred regions x 10 in-region retries = 50'. Bump the cap to 200 (120 + ~80 headroom for stacked throttling/session/serviceUnavailable retries) and rewrite the doc comment to cite the real source constants. - Update the matching DefaultMetadataDetachedHardDeadlineInSeconds doc comment to derive 300 s from the per-region 1+5+65 s timeout ladder and a typical ~3-5 region failover sweep, rather than the wrong '5 x 36 s' rationale. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…nd policy distinction Tighten DefaultMetadataDetachedHardDeadlineInSeconds doc comment to: - explicitly cite the wrapped call site (ClientCollectionCache.ReadCollectionAsync) and the GetTimeoutPolicy branch that routes it to the HotPath policy. - name the slower HttpTimeoutPolicyControlPlaneRead ladder (5+10+20 = 35 s/region) used by GatewayAccountReader, with a note that the executor does not wrap that path today. Keeps the comment self-correcting if the executor's surface ever expands to account reads. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
@sdkReviewAgent |
|
✅ Review complete (46:58) No new comments — existing review coverage is sufficient. Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
…hed CT Adds two mock-based regression tests asserting that ClientCollectionCache.GetByRidAsync and GetByNameAsync route through MetadataDetachedExecutor and pass the executor-owned detached CancellationToken (NOT the caller's token) into the inner ReadCollectionAsync lambda. Addresses SDK review agent feedback on PR #5844: a regression that reverts either lambda to the caller's CancellationToken would silently reintroduce the cross-region failover preemption bug (issue #5805); the existing MetadataDetachedExecutorTests would still pass because they exercise the executor directly with synthetic operations. Mechanism: hold the first storeModel.ProcessMessageAsync call on a gate, cancel the caller mid-flight, release the gate so the in-flight attempt fails transiently, and assert the retry policy drives a second ProcessMessageAsync invocation. Verified by temporarily reverting the lambda to caller-token passthrough -- both tests fail with timeout-on-second-invocation. Restored to detached wiring -- both pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Addressed the test-coverage gap from comment r3204019960 in commit 107ef6e. Added
If anyone reverts the lambda back to passing 22/22 Skipping the other three threads as nits/observations per evaluation. |
kundadebdatta
left a comment
There was a problem hiding this comment.
Few questions and add more test coverage.
The previous comment incorrectly attributed CDX1000 to boxing. Exception is a reference type so no boxing occurs in either form. The DontConvertExceptionToObject analyzer flags type-information loss when typed Exception references are converted to object — that is the actual concern the comment now describes. Addresses @xinlian12 review comment 3204019954 on PR #5844. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Review feedback addressedPushing two commits to address all open review feedback. Detailed inline replies are posted on each thread. Code changes
Discussion responses
Local verification
Re-review requested. Thanks! |
| { | ||
| // Pass CancellationToken.None: the retry policy must not be canceled | ||
| // mid-decision. This is the entire point of the detached design. | ||
| shouldRetry = await retryPolicy |
There was a problem hiding this comment.
Question;
why not just use something like the following, why do the try catch here?
CancellationTokenSource detachedCts = new CancellationTokenSource(internalDeadline); Task<T> detachedTask = TaskHelper.InlineIfPossible( () => operation(detachedCts.Token), retryPolicy, detachedCts.Token);
There was a problem hiding this comment.
And also another question - why we have to calculate an internal deadline cancellation token? do we have to? why not use CancellationToken.None?
since the values configured here are so high, so probably the metadata will complete its original loop, but curious why.
There was a problem hiding this comment.
Great question — the simplification works for the headline bug (caller CT no longer reaches the retry loop), but the executor is enforcing three contracts and TaskHelper.InlineIfPossible → BackoffRetryUtility<T>.ExecuteAsync (TaskHelper.cs:66-69) only preserves the first:
- ✅ Caller CT ∉ retry loop — both approaches give us this.
- ❌
ShouldRetryAsyncalways invoked withCancellationToken.None. The custom loop hard-codes this at line 293 with the comment "the retry policy must not be canceled mid-decision. This is the entire point of the detached design."BackoffRetryUtilitydoesn't make that guarantee — whendetachedCtstrips at the 300s mark, the policy can be preempted mid-decision the same way the caller CT used to preempt it (top-of-loopThrowIfCancellationRequestedfires beforeShouldRetryAsynceven runs). - ❌ Deadline-trip surfaces the underlying failure, not a generic OCE. With the simplification, a customer hitting the ceiling sees
OperationCanceledException at attempt N. With the explicit loop (filters at lines 260-280 + 334-340), they see the actualServiceUnavailableException/HttpRequestExceptionfrom the region that was failing. That's the diagnostics value — without it, a deadline trip looks like "the SDK gave up" rather than "region X kept failing for 300s". - ❌
MaxAttemptsHardCap = 200—BackoffRetryUtilityhas no attempt cap, so a misbehaving policy returningShouldRetry=truewithBackoffTime=Zerowould burn CPU until the time-based deadline. The cap bounds the burst rate.
~30 LOC to buy invariants 2-4. The XML doc at the top of MetadataDetachedExecutor covers this in prose, but I take your point that the rationale isn't obvious from the loop body itself — happy to add a 4-bullet invariants comment at the top of ExecuteRetryLoopAsync in a follow-up if you'd like.
There was a problem hiding this comment.
Defense-in-depth backstop — the value is intentionally high enough to almost never trip in production. The attempt cap (200) handles "tight loop with zero backoff," but it does not handle two pathologies:
- Single attempt that hangs forever — e.g., a future regression in the
HttpTimeoutPolicyladder, or a customIDocumentClientRetryPolicythat swallows OCE. Attempt 1 never returns → attempt cap never advances → leaked task forever. - Diagnostics-mutation horizon — per the caveat at
MetadataDetachedExecutor.cs:78-90, the detached task continues mutating the caller'sITraceandClientSideRequestStatisticsuntil something stops it. Without a deadline, "something" is only the policy itself.
With CancellationToken.None, one future code change = process-lifetime fire-and-forget task pinning HTTP connection-pool slots and growing a Trace whose owner has long since GC'd its references. The deadline guarantees that some upper bound exists.
300s is grounded in the actual ladder math (HttpTimeoutPolicyControlPlaneRetriableHotPath ≈ 72s/region × ~5 preferred regions ≈ 360s ceiling, rounded down with ClientRetryPolicy.RetryIntervalInMS = 1000ms per failover). Tunable via AZURE_COSMOS_METADATA_DETACHED_HARD_DEADLINE_SECONDS (clamped [30s, 86400s]) for ops who want to dial it.
So: yes, in steady-state behavior identical to CancellationToken.None. The deadline only matters when something else is already broken — which is exactly when you want a backstop.
|
In the PR description: "PartitionKeyRangeCache: its BackoffRetryUtility usage does not thread caller CT, so no parity fix needed there." does this mean even if customer passed a cancellation token, but SDK will not honor it? so customer's request by theory can go beyond their configured CT? If this is the case, sounds like a bug? And do we also validated the query plan path? |
kushagraThapar
left a comment
There was a problem hiding this comment.
Requesting changes for changelog entry.
Adds a new Unreleased Preview section (per the pattern established in PR #5815) with a Fixed entry for PR #5844. Customer-facing description focuses on the symptom (premature OperationCanceledException preempting cross-region failover during metadata-cache reads) rather than the implementation detail (the new MetadataDetachedExecutor). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Added in commit Entry placed under Happy to adjust the wording, section choice, or move it into |
|
Three good questions; the PR description's framing was slightly imprecise and I want to clear it up. In order: 1. "Customer's CT is silently ignored — sounds like a bug?"Yes — but it's a separate, pre-existing gap, not something this PR changes. The PR description's bullet was technically true but understated the cause. The actual structure:
2. "Did you validate the query plan path?"Partially. The query plan path is structurally vulnerable to the same bug class, and PR #5844 does NOT fix it:
The PR description listed a deferred test for the Linux Query Plan path. Your push is fair: if the bug class exists there, that should be elevated from a missing test to a potentially-missing fix. 3. "What is the PK range scenario in this PR?"There isn't one — and that's the intentional scope. PR #5844 only fixes the collection metadata read path ( Net
Filed #5862 to track both. The query-plan one is the more urgent of the two (because caller CT does reach the retry path there); the PKRange one requires a public-API-shape change so probably needs a separate design pass. Happy to tighten the PR description wording on the PKRange bullet too — want me to push that as a small follow-up commit, or is the resolution in this thread sufficient? |
…Executor for Java parity Generalizes MetadataDetachedExecutor with a no-retry-loop ExecuteDetachedAsync overload and wires QueryPlanRetriever.GetQueryPlanThroughGatewayAsync through it. The internal RequestInvokerHandler pipeline keeps its own retry semantics; the wrap only ensures those decisions cannot be preempted by caller CancellationToken, mirroring Java's error-signal-only retryWhen contract. - MetadataDetachedExecutor.cs: add ExecuteDetachedAsync; refactor ExecuteAsync to compose on top of it; expanded XML doc covering both overloads and the Java alignment matrix (PKRange + GatewayAccountReader already aligned). - QueryPlanRetriever.GetQueryPlanThroughGatewayAsync: route the gateway call through ExecuteDetachedAsync so caller CT is observed only on the response path. GetQueryPlanWithServiceInteropAsync is unchanged (local CPU work, not a metadata retry path). - MetadataDetachedExecutorTests: 9 new tests covering ExecuteDetachedAsync invariants (success, no outer retry, mid-flight cancel detaches, internal deadline, sync factory throw, null operation/task validation, fast path). - changelog.md: expand Unreleased Preview entry to mention query-plan path. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
@kundadebdatta @xinlian12 — heads up that I extended this PR in commit TL;DRJava's metadata-read paths are structurally immune to the bug class this PR closes (#5805), thanks to three independent layers:
After this PR, .NET matches the same outcome via
What changed in
|
…data-detached # Conflicts: # changelog.md
…ian/metadata-detached # Conflicts: # .gitignore # changelog.md
…into pr-5844 # Conflicts: # changelog.md
Resolves changelog.md conflict: keeps PR 5844 entry in Unreleased; drops 5870 entry now released in 3.60.0. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Addresses the findings produced by the PR Deep Reviewer on PR #5920 (excluding Finding #1, which assumed PR #5844 `MetadataDetachedExecutor` would merge into main; per author guidance, this design proceeds without it). Structural changes - §5.3 `ExecuteAsync` rewritten: per-branch CancellationTokenSources (no shared linkedCts) so the loser`s OperationCanceledException is contained inside `BackgroundCleanupAsync` and cannot reach `MetadataRequestThrottleRetryPolicy` (Finding #2 — protects healthy secondary from spurious `MarkEndpointUnavailableForRead` post-PR #5780). - §5.3: primary fault before threshold no longer bypasses hedge — adds `primaryTask.Status == RanToCompletion` guard so fast-fail-on-degraded primary triggers the hedge (Finding #3). - §5.3: wait-for-winner is now a loop that filters transient/faulted completions; a fast 503 from the hedge can no longer beat a healthy 200 from the primary (Finding #4). - §5.3 + §5.4 + §5.5: `as` cast (not hard cast) for `MetadataRequestThrottleRetryPolicy`; wrapped/test-double policies no longer throw `InvalidCastException` (Finding #7). - §5.3: added `BackgroundCleanupAsync` that awaits the loser, disposes its `DocumentServiceResponse` body (handle-leak fix), records outcome via volatile field, and disposes the loser CTS (Finding #11). Correctness/factual fixes - §5.10 + §5.6: corrected — `ClientCollectionCache` uses `AsyncCache` (not `AsyncCacheNonBlocking`); base-class abstract signature change must be defaulted for subclass compat; forbid inferring cold-start from `previousValue == null` inside the factory (Finding #5). - §5.2 + §6.1: added `HasHedgedThisOperation` flag (set via `Interlocked.Exchange`); fixes the broken §6.1 claim that retries wouldn`t re-hedge because the cache had a `previousValue` (false — cache is only populated when the loop exits) (Finding #8). - §5.2: `ConcurrentDictionary<Uri, byte>` replaces `HashSet<Uri>`; volatile `LoserOutcome` field for cross-thread updates (Finding #6). - §5.9: added `HttpTimeoutPolicy.FirstAttemptTimeout` accessor design — `TimeoutsAndDelays` is private today (Finding #9). - §5.7.4: sketched the per-index resolve loop for `IncrementRetryIndexOnUnavailableEndpointForMetadataRead` — today it`s a 1-line counter that never resolves an endpoint (Finding #10). New sections - §5.7 (4 subsections): coordination with PR #5780, structural invariant that hedge-loser OCE never reaches retry policy, shared `RetryUtility.IsRegionalFailure` helper, attempted-endpoints skip loop. - §5.12: net472 stack-unwind discipline (`SendOneAsync` middle-layer seam + `ExceptionDispatchInfo`) — adopts the PR #5870 lesson (Finding #12). - §5.13: per-auth-mode handling in `CloneForHedge`; hedge-401/403 guard for RBAC-role-assignment-missing-in-secondary case (Finding #13). - §7.1: wiring step for `isHedgingDisabledByGateway` from `DocumentClient` into the cache constructors via `Func<bool>` (Finding #15 bundle). - §9.1: `EventSource`/`Meter` counters for fire-rate, win-rate, budget-exhaustion, late-loser, hedge-fired-elapsed-ms (Finding #15 bundle). API/rollout - §5.1: `EnableMetadataHedgingForColdStart` becomes tri-state `bool?`; `MetadataHedgingOptions` promoted to public so customers can tune `PerClientConcurrencyBudget` for high-container-cardinality startups (Finding #14). - §12: Phase 3 no longer removes the opt-in (binary break avoided); only the phase default changes (Finding #14). Smaller items (Finding #15 bundle) - §5.3: drop `closest secondary` framing (SDK has no proximity measure); use `Wait(TimeSpan.Zero)` instead of `WaitAsync(TimeSpan.Zero)` (no Task allocation); add `EvaluateEligibility`-vs-budget-check note. - §5.4: defaulted `isColdStart = false` on the abstract method to avoid breaking subclass overrides (e.g., encryption-mirrored caches). - §6: added eligibility rules 8 (`ExcludeRegions` hard filter), 9 (`HasHedgedThisOperation`), 10 (single-master account guard). - §10: reconciled `Both branches fault` with §5.3 (consistent `ExceptionDispatchInfo` semantics). - §11: tests added for loser-cancellation-doesn`t-poison-secondary, loser-disposal, no-re-hedge-across-retries, cross-policy-type, net472 SO regression; mirrors PR #5787 `senderCallCount` assertions. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Summary
Closes the metadata-cancellation race that lets a caller-side
CancellationTokentimeout silently preempt a cross-region failover decision inIDocumentClientRetryPolicy.ShouldRetryAsync, for both metadata-cache reads and the gateway query-plan path.Customers occasionally see
CosmosOperationCanceledExceptionwhen theirCancellationTokendeadline lines up with the SDK's control-plane HTTP timeout policy ladder. At that boundaryBackoffRetryUtility<T>.ExecuteAsync's iteration-topThrowIfCancellationRequestedfires beforeShouldRetryAsyncruns, so the SDK never gets to mark the failing region or fail over.PR #5806 added a 10 s grace window that shrunk the bug. PR #5828 spike validated that fully decoupling caller cancellation from the retry loop closes it. This PR ships that approach as the production fix and supersedes both, and extends it to the gateway query-plan path so the .NET pipeline gets the same structural guarantee Java has had from its inception (Reactor
retryWhenis error-signal-only).Issue: #5805. Closes #5806, closes #5828. Tracking follow-up alignment audit: #5862.
What changed
MetadataDetachedExecutor(Microsoft.Azure.Cosmos/src/MetadataDetachedExecutor.cs)Two overloads on one isolation primitive:
ExecuteAsync<T>(operation, retryPolicy, callerCancellationToken)— runs the operation inside a self-contained retry loop driven by the suppliedIDocumentClientRetryPolicy. Used where the leaf operation has no inner pipeline retry (e.g.ClientCollectionCache.ReadCollectionAsyncwhich callsstoreModel.ProcessMessageAsyncdirectly).ExecuteDetachedAsync<T>(operation, callerCancellationToken)— provides only the detach + caller-CT-on-response-path isolation, with no outer retry loop. Used where the operation already runs through the standard request pipeline (RequestInvokerHandler→BackoffRetryUtility→ClientRetryPolicy) and therefore has its own retry semantics (e.g. the gateway query-plan request). The wrap only ensures the pipeline's retry decisions cannot be preempted by caller cancellation.Both overloads share the same core isolation contract:
CancellationTokenSourcebounded only by an SDK-internal hard deadline. CallerCancellationTokennever enters this scope.ExecuteAsync: always callspolicy.ShouldRetryAsync(ex, CancellationToken.None)so cross-region failover decisions and their side-effects (LocationCacheregion marking,ClearingSessionContainerClientRetryPolicysession clearing, HTTP connection-pool warming) run to completion.CancellationTokenonly on the response path viaTask.WhenAny(detachedTask, callerCancellationTcs). Caller cancel → caller surfaces OCE; detached task continues.Task.WhenAnyscaffolding whencallerCT.CanBeCanceled == false.MaxAttemptsHardCap(for the retry-loop overload) and a configurable hard deadline (env varAZURE_COSMOS_METADATA_DETACHED_HARD_DEADLINE_SECONDS, clamped into[30 s, 86400 s]).ExecuteAsyncis internally implemented in terms ofExecuteDetachedAsync, so there is exactly one detach primitive in the codebase.Wire-ups
ClientCollectionCache(Microsoft.Azure.Cosmos/src/Routing/ClientCollectionCache.cs):GetByRidAsync→MetadataDetachedExecutor.ExecuteAsync(retry-loop overload, withClearingSessionContainerClientRetryPolicywrapping the standardClientRetryPolicy).GetByNameAsync→ same.Routed through
TaskHelper.RunInlineIfNeededAsyncto preserve NETFXSynchronizationContextsafety.QueryPlanRetriever(Microsoft.Azure.Cosmos/src/Query/Core/QueryPlan/QueryPlanRetriever.cs):GetQueryPlanThroughGatewayAsync→MetadataDetachedExecutor.ExecuteDetachedAsync(no outer retry loop;RequestInvokerHandlerprovides its own retry throughBackoffRetryUtility+ClientRetryPolicy).GetQueryPlanWithServiceInteropAsyncis intentionally unchanged — that path is local CPU work via service-interop P/Invoke, not a metadata retry path.Java alignment
This PR closes the cross-SDK alignment gap with Java for both metadata-cache reads and the gateway query-plan path. Investigation artifact at
~/.copilot/session-state/<session>/files/pr5844-pkrange-account-investigation.md; key findings:ClientCollectionCache.GetBy{Rid,Name}Async)MetadataDetachedExecutor.ExecuteAsyncRxClientCollectionCache— no CT in API;BackoffRetryUtility.executeRetry→ ReactorretryWhen(error-signal-only)QueryPlanRetriever.GetQueryPlanThroughGatewayAsync)MetadataDetachedExecutor.ExecuteDetachedAsyncretryWhen+executeFeedOperationWithAvailabilityStrategy— no CT preemption vectorPartitionKeyRangeCache.*)CancellationTokenat any layer; internalBackoffRetryUtilityuses the 2-arg overload (CancellationToken.None)RxPartitionKeyRangeCache+AsyncCacheNonBlocking.getAsyncwithMono.fromFuture(..., suppressCancel=true)GatewayAccountReader+GlobalEndpointManager)this.cancellationTokenSource.Tokenonly, canceled onDispose); HTTP call passescancellationToken: defaultGLOBAL_ENDPOINT_MANAGER_BOUNDED_ELASTICscheduler +Flux.concatDelayErrorregional sweepMechanism: Java's
BackoffRetryUtility.executeRetryusesMono.defer(...).retryWhen(Retry.withThrowable(RetryUtils.toRetryWhenFunc(policy))). Reactor'sretryWhenis an error-signal-only operator —policy.shouldRetry(e)is called unconditionally on everyonError; downstream cancellation is a separatecancel()signal that bypassesretryWhenentirely. AndClientRetryPolicy.shouldRetry(Exception e)takes no token in Java.MetadataDetachedExecutoris the .NET equivalent: a per-call detach boundary that makes the .NET imperative retry loop behave like Java's error-signal-only reactive retry, with the caller'sCancellationTokenobservable only on the response path.Magic numbers — derivations (grounded in actual SDK retry policies)
ClientCollectionCache.ReadCollectionAsyncroutes viaHttpTimeoutPolicy.GetTimeoutPolicytoHttpTimeoutPolicyControlPlaneRetriableHotPathwith ladder(1 s, 0) → (5 s, 1 s) → (65 s, 0)= 71 s timeouts + 1 s inter-attempt delay ≈ 72 s/region. A typical cross-region failover sweep visits ~3-5 regions, so ~3-5 × 72 ≈ 215 s to 360 s +ClientRetryPolicy.RetryIntervalInMS = 1000 msper failover. 300 s covers the common-case multi-region failover with margin. The gateway query-plan path uses the sameRequestInvokerHandlerpipeline and therefore the same ladder.CancellationTokenSource(TimeSpan)'s~uint.MaxValue-1 ms(~49.7 days) overflow point and far above any realistic metadata-read budget. Without this clamp, an unbounded user value wouldthrow ArgumentOutOfRangeExceptionat every metadata read.ClientRetryPolicy.MaxRetryCount = 120(cross-region failover counter,ClientRetryPolicy.cs:24). On top of that,MaxServiceUnavailableRetryCount = 1,MaxSessionTokenRetryCount = 2, plus the defaultResourceThrottleRetryPolicybudget (~9 retries) can stack. 200 = 120 + ~80 headroom for stacked retries. Defensive only; a well-behavedClientRetryPolicytrips its own 120-retry limit before this cap is reached. The cap protects against a misbehaving policy returningShouldRetry=truewithBackoffTime=TimeSpan.Zeroin a tight loop, which the time-based deadline alone cannot prevent without burning CPU. The detach-only overload (ExecuteDetachedAsync) does not run an outer retry loop and therefore has no cap.Out of scope
PartitionKeyRangeCache: already structurally aligned with Java. The public API surface (TryGetOverlappingRangesAsync,TryLookupAsync, etc.) takes noCancellationTokenat any layer, and the internalBackoffRetryUtility<T>.ExecuteAsyncinvocation uses the 2-arg overload (no CT). There is no caller-CT vector by which the retry policy could be preempted. No code change required for parity. Tracking issue: Metadata retry: extend detached cancellation pattern to PartitionKeyRangeCache and Query Plan retrieval #5862.AddressCache: intentionally unchanged.GatewayAccountReader/GlobalEndpointManager: already detached from caller CT pre-PR; mirrors Java's dedicatedGLOBAL_ENDPOINT_MANAGER_BOUNDED_ELASTICscheduler.GetQueryPlanWithServiceInteropAsync: local CPU work via P/Invoke, not a metadata retry path.Tests
MetadataDetachedExecutorTestspass ([DoNotParallelize]to avoid cross-test interference).ExecuteAsync(retry-loop) overload — see "Review history" for the full list and additions during deep-review iterations.ExecuteDetachedAsync:ExecuteDetachedAsync_SucceedsFirstAttemptExecuteDetachedAsync_OperationFaults_NoOuterRetry(pins the no-retry-loop contract)ExecuteDetachedAsync_CallerCancelMidFlight_SurfacesOCE_DetachedOperationKeepsRunning(primary isolation invariant)ExecuteDetachedAsync_AlreadyCancelledCallerToken_ThrowsBeforeOperationExecuteDetachedAsync_InternalDeadlineTripsDuringOperation_SurfacesOCEExecuteDetachedAsync_NullOperation_ThrowsExecuteDetachedAsync_OperationFactoryThrowsSync_PropagatesAndDoesNotHangExecuteDetachedAsync_OperationReturnsNullTask_ThrowsExecuteDetachedAsync_CancellationTokenNone_SucceedsAndOperationReceivesNonCanceledTokenClientCollectionCacheDetachedWiringTestspass.QueryPlanRetrievertests + 36/36 query-pipeline / thin-client tests pass — no regression on the query plan path.CollectionCache | ClientRetry | ConfigurationManagerpass.Follow-up integration tests (deferred to emulator CI iteration)
ClientCollectionCacheand gateway query-plan paths.Review history
code-reviewagent): 0 blocker, 1 major + 3 minor + 1 nit. Resolved in22cbcf428.09bd57e66— top-of-loop OCE-due-to-detached-token guard surfaces the underlying exception when the deadline trips during the operation lambda; AsyncCache caveat reworded;GetMetadataDetachedHardDeadlinemadeinternal;Task.Yieldon zero-backoff branch; CTS lifetime comment;[DoNotParallelize]on test class.8b1ec7982— comment on OCE filter asymmetry, per-call retry-policy invariant doc in<summary>, deadline test stability bumped to 2 s for CI.c3f092849and07d94b6a4:MetadataDetachedExecutor.cs:304: replacedobject.ReferenceEquals(Exception, Exception)(which boxes both args toobject) with reference equality==on typedExceptionlocals. The analyzer was unset locally during the prior review iterations because of a duplicate.globalconfigbetween repo root and the worktree, which is why CI caught it after 3 clean local builds.MaxAttemptsHardCapre-grounded against actual retry policies. Previous derivation ("5 regions × 10 in-region = 50") was hand-waved and belowClientRetryPolicy.MaxRetryCount = 120; cap bumped to 200 with the derivation citing the real source constants.DefaultMetadataDetachedHardDeadlineInSecondsdoc corrected — old comment cited HotPath but used the ControlPlaneRead ladder math (35 s); now correctly cites the 72 s/region HotPath ladder and theClientCollectionCache.ReadCollectionAsynccall site.TaskHelper.InlineIfPossibleand the internal-deadline choice addressed in PR thread; Kushagra's changelog request fulfilled in1c4fb4482.60c2c77dc): per Annie's follow-up question on broader CT-preemption coverage. Java SDK confirmed structurally immune to the bug class via ReactorretryWhen(error-signal-only) + no caller CT in metadata APIs +Mono.fromFuture(..., suppressCancel=true)for PKRange. The .NET gap was the gateway query-plan path; closed in this commit. PKRange and account-info paths confirmed already aligned. Investigation artifact preserved in session workspace.Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com