Skip to content

Routing: Adds detached metadata executor decoupling caller cancellation from cross-region failover#5844

Open
NaluTripician wants to merge 18 commits into
mainfrom
users/ntripician/metadata-detached
Open

Routing: Adds detached metadata executor decoupling caller cancellation from cross-region failover#5844
NaluTripician wants to merge 18 commits into
mainfrom
users/ntripician/metadata-detached

Conversation

@NaluTripician
Copy link
Copy Markdown
Contributor

@NaluTripician NaluTripician commented May 6, 2026

Summary

Closes the metadata-cancellation race that lets a caller-side CancellationToken timeout silently preempt a cross-region failover decision in IDocumentClientRetryPolicy.ShouldRetryAsync, for both metadata-cache reads and the gateway query-plan path.

Customers occasionally see CosmosOperationCanceledException when their CancellationToken deadline lines up with the SDK's control-plane HTTP timeout policy ladder. At that boundary BackoffRetryUtility<T>.ExecuteAsync's iteration-top ThrowIfCancellationRequested fires before ShouldRetryAsync runs, so the SDK never gets to mark the failing region or fail over.

PR #5806 added a 10 s grace window that shrunk the bug. PR #5828 spike validated that fully decoupling caller cancellation from the retry loop closes it. This PR ships that approach as the production fix and supersedes both, and extends it to the gateway query-plan path so the .NET pipeline gets the same structural guarantee Java has had from its inception (Reactor retryWhen is error-signal-only).

Issue: #5805. Closes #5806, closes #5828. Tracking follow-up alignment audit: #5862.

What changed

MetadataDetachedExecutor (Microsoft.Azure.Cosmos/src/MetadataDetachedExecutor.cs)

Two overloads on one isolation primitive:

  1. ExecuteAsync<T>(operation, retryPolicy, callerCancellationToken) — runs the operation inside a self-contained retry loop driven by the supplied IDocumentClientRetryPolicy. Used where the leaf operation has no inner pipeline retry (e.g. ClientCollectionCache.ReadCollectionAsync which calls storeModel.ProcessMessageAsync directly).
  2. ExecuteDetachedAsync<T>(operation, callerCancellationToken) — provides only the detach + caller-CT-on-response-path isolation, with no outer retry loop. Used where the operation already runs through the standard request pipeline (RequestInvokerHandlerBackoffRetryUtilityClientRetryPolicy) and therefore has its own retry semantics (e.g. the gateway query-plan request). The wrap only ensures the pipeline's retry decisions cannot be preempted by caller cancellation.

Both overloads share the same core isolation contract:

  • Retry / pipeline retry runs on an executor-owned CancellationTokenSource bounded only by an SDK-internal hard deadline. Caller CancellationToken never enters this scope.
  • For ExecuteAsync: always calls policy.ShouldRetryAsync(ex, CancellationToken.None) so cross-region failover decisions and their side-effects (LocationCache region marking, ClearingSessionContainerClientRetryPolicy session clearing, HTTP connection-pool warming) run to completion.
  • Observes the caller CancellationToken only on the response path via Task.WhenAny(detachedTask, callerCancellationTcs). Caller cancel → caller surfaces OCE; detached task continues.
  • Fast path skips Task.WhenAny scaffolding when callerCT.CanBeCanceled == false.
  • Defensive: MaxAttemptsHardCap (for the retry-loop overload) and a configurable hard deadline (env var AZURE_COSMOS_METADATA_DETACHED_HARD_DEADLINE_SECONDS, clamped into [30 s, 86400 s]).

ExecuteAsync is internally implemented in terms of ExecuteDetachedAsync, so there is exactly one detach primitive in the codebase.

Wire-ups

ClientCollectionCache (Microsoft.Azure.Cosmos/src/Routing/ClientCollectionCache.cs):

  • GetByRidAsyncMetadataDetachedExecutor.ExecuteAsync (retry-loop overload, with ClearingSessionContainerClientRetryPolicy wrapping the standard ClientRetryPolicy).
  • GetByNameAsync → same.

Routed through TaskHelper.RunInlineIfNeededAsync to preserve NETFX SynchronizationContext safety.

QueryPlanRetriever (Microsoft.Azure.Cosmos/src/Query/Core/QueryPlan/QueryPlanRetriever.cs):

  • GetQueryPlanThroughGatewayAsyncMetadataDetachedExecutor.ExecuteDetachedAsync (no outer retry loop; RequestInvokerHandler provides its own retry through BackoffRetryUtility + ClientRetryPolicy).
  • GetQueryPlanWithServiceInteropAsync is intentionally unchanged — that path is local CPU work via service-interop P/Invoke, not a metadata retry path.

Java alignment

This PR closes the cross-SDK alignment gap with Java for both metadata-cache reads and the gateway query-plan path. Investigation artifact at ~/.copilot/session-state/<session>/files/pr5844-pkrange-account-investigation.md; key findings:

Metadata path .NET post-PR Java Aligned?
Collection cache (ClientCollectionCache.GetBy{Rid,Name}Async) Detached via MetadataDetachedExecutor.ExecuteAsync RxClientCollectionCache — no CT in API; BackoffRetryUtility.executeRetry → Reactor retryWhen (error-signal-only) ✅ Yes
Query plan gateway (QueryPlanRetriever.GetQueryPlanThroughGatewayAsync) Detached via MetadataDetachedExecutor.ExecuteDetachedAsync Same Reactor retryWhen + executeFeedOperationWithAvailabilityStrategy — no CT preemption vector ✅ Yes
PKRange cache (PartitionKeyRangeCache.*) Unchanged — API surface takes no CancellationToken at any layer; internal BackoffRetryUtility uses the 2-arg overload (CancellationToken.None) RxPartitionKeyRangeCache + AsyncCacheNonBlocking.getAsync with Mono.fromFuture(..., suppressCancel=true) ✅ Yes (structural; no caller-CT vector exists in either SDK)
Account info (GatewayAccountReader + GlobalEndpointManager) Unchanged — already detached from caller CT (uses this.cancellationTokenSource.Token only, canceled on Dispose); HTTP call passes cancellationToken: default Dedicated GLOBAL_ENDPOINT_MANAGER_BOUNDED_ELASTIC scheduler + Flux.concatDelayError regional sweep ✅ Yes (both stronger than collection-cache path; pre-existing)

Mechanism: Java's BackoffRetryUtility.executeRetry uses Mono.defer(...).retryWhen(Retry.withThrowable(RetryUtils.toRetryWhenFunc(policy))). Reactor's retryWhen is an error-signal-only operator — policy.shouldRetry(e) is called unconditionally on every onError; downstream cancellation is a separate cancel() signal that bypasses retryWhen entirely. And ClientRetryPolicy.shouldRetry(Exception e) takes no token in Java. MetadataDetachedExecutor is the .NET equivalent: a per-call detach boundary that makes the .NET imperative retry loop behave like Java's error-signal-only reactive retry, with the caller's CancellationToken observable only on the response path.

Magic numbers — derivations (grounded in actual SDK retry policies)

  • Default hard deadline = 300 s. Sized for the heavier of the two wrap points: ClientCollectionCache.ReadCollectionAsync routes via HttpTimeoutPolicy.GetTimeoutPolicy to HttpTimeoutPolicyControlPlaneRetriableHotPath with ladder (1 s, 0) → (5 s, 1 s) → (65 s, 0) = 71 s timeouts + 1 s inter-attempt delay ≈ 72 s/region. A typical cross-region failover sweep visits ~3-5 regions, so ~3-5 × 72 ≈ 215 s to 360 s + ClientRetryPolicy.RetryIntervalInMS = 1000 ms per failover. 300 s covers the common-case multi-region failover with margin. The gateway query-plan path uses the same RequestInvokerHandler pipeline and therefore the same ladder.
  • Min clamp = 30 s. Below this, the deadline cuts inside the first HTTP ladder rung, defeating the purpose of the executor.
  • Max clamp = 86400 s (24 h). Below CancellationTokenSource(TimeSpan)'s ~uint.MaxValue-1 ms (~49.7 days) overflow point and far above any realistic metadata-read budget. Without this clamp, an unbounded user value would throw ArgumentOutOfRangeException at every metadata read.
  • Attempt cap = 200 (retry-loop overload only). The dominant per-call retry ceiling is ClientRetryPolicy.MaxRetryCount = 120 (cross-region failover counter, ClientRetryPolicy.cs:24). On top of that, MaxServiceUnavailableRetryCount = 1, MaxSessionTokenRetryCount = 2, plus the default ResourceThrottleRetryPolicy budget (~9 retries) can stack. 200 = 120 + ~80 headroom for stacked retries. Defensive only; a well-behaved ClientRetryPolicy trips its own 120-retry limit before this cap is reached. The cap protects against a misbehaving policy returning ShouldRetry=true with BackoffTime=TimeSpan.Zero in a tight loop, which the time-based deadline alone cannot prevent without burning CPU. The detach-only overload (ExecuteDetachedAsync) does not run an outer retry loop and therefore has no cap.

Note: the slower HttpTimeoutPolicyControlPlaneRead ladder (5+10+20 = 35 s/region) used by GatewayAccountReader does not apply here — the account-discovery path is already detached from caller CT pre-existingly.

Out of scope

  • Metadata Hedging (separate work item).
  • PartitionKeyRangeCache: already structurally aligned with Java. The public API surface (TryGetOverlappingRangesAsync, TryLookupAsync, etc.) takes no CancellationToken at any layer, and the internal BackoffRetryUtility<T>.ExecuteAsync invocation uses the 2-arg overload (no CT). There is no caller-CT vector by which the retry policy could be preempted. No code change required for parity. Tracking issue: Metadata retry: extend detached cancellation pattern to PartitionKeyRangeCache and Query Plan retrieval #5862.
  • AddressCache: intentionally unchanged.
  • GatewayAccountReader / GlobalEndpointManager: already detached from caller CT pre-PR; mirrors Java's dedicated GLOBAL_ENDPOINT_MANAGER_BOUNDED_ELASTIC scheduler.
  • GetQueryPlanWithServiceInteropAsync: local CPU work via P/Invoke, not a metadata retry path.
  • HTTP timeout escalation tuning.

Tests

  • 30/30 unit tests in MetadataDetachedExecutorTests pass ([DoNotParallelize] to avoid cross-test interference).
    • 20 covering the original ExecuteAsync (retry-loop) overload — see "Review history" for the full list and additions during deep-review iterations.
    • 9 new tests covering ExecuteDetachedAsync:
      • ExecuteDetachedAsync_SucceedsFirstAttempt
      • ExecuteDetachedAsync_OperationFaults_NoOuterRetry (pins the no-retry-loop contract)
      • ExecuteDetachedAsync_CallerCancelMidFlight_SurfacesOCE_DetachedOperationKeepsRunning (primary isolation invariant)
      • ExecuteDetachedAsync_AlreadyCancelledCallerToken_ThrowsBeforeOperation
      • ExecuteDetachedAsync_InternalDeadlineTripsDuringOperation_SurfacesOCE
      • ExecuteDetachedAsync_NullOperation_Throws
      • ExecuteDetachedAsync_OperationFactoryThrowsSync_PropagatesAndDoesNotHang
      • ExecuteDetachedAsync_OperationReturnsNullTask_Throws
      • ExecuteDetachedAsync_CancellationTokenNone_SucceedsAndOperationReceivesNonCanceledToken
  • 2/2 wiring regression tests in ClientCollectionCacheDetachedWiringTests pass.
  • 5/5 QueryPlanRetriever tests + 36/36 query-pipeline / thin-client tests pass — no regression on the query plan path.
  • 60/60 regression slice in CollectionCache | ClientRetry | ConfigurationManager pass.

Follow-up integration tests (deferred to emulator CI iteration)

  • Linux Query Plan managed-path regression test (work-item template explicitly calls this out so we don't depend on ServiceInterop).
  • Cross-region failover end-to-end with FaultInjection HTTP delays + tight caller CT timeout, exercised on both ClientCollectionCache and gateway query-plan paths.
  • AsyncCache dedup under concurrent partial-cancel callers.

Review history

  • Phase 3b fresh-eyes review (code-review agent): 0 blocker, 1 major + 3 minor + 1 nit. Resolved in 22cbcf428.
  • Deep-review iteration 2 (Opus high): 13 findings (0 blocking). All addressed in 09bd57e66 — top-of-loop OCE-due-to-detached-token guard surfaces the underlying exception when the deadline trips during the operation lambda; AsyncCache caveat reworded; GetMetadataDetachedHardDeadline made internal; Task.Yield on zero-backoff branch; CTS lifetime comment; [DoNotParallelize] on test class.
  • Deep-review iteration 3 (Opus high): merge_recommendation = ready, 0 blocking, 6 polish findings. Addressed in 8b1ec7982 — comment on OCE filter asymmetry, per-call retry-policy invariant doc in <summary>, deadline test stability bumped to 2 s for CI.
  • CI fix-up in c3f092849 and 07d94b6a4:
    • CDX1000 boxing fix at MetadataDetachedExecutor.cs:304: replaced object.ReferenceEquals(Exception, Exception) (which boxes both args to object) with reference equality == on typed Exception locals. The analyzer was unset locally during the prior review iterations because of a duplicate .globalconfig between repo root and the worktree, which is why CI caught it after 3 clean local builds.
    • MaxAttemptsHardCap re-grounded against actual retry policies. Previous derivation ("5 regions × 10 in-region = 50") was hand-waved and below ClientRetryPolicy.MaxRetryCount = 120; cap bumped to 200 with the derivation citing the real source constants.
    • DefaultMetadataDetachedHardDeadlineInSeconds doc corrected — old comment cited HotPath but used the ControlPlaneRead ladder math (35 s); now correctly cites the 72 s/region HotPath ladder and the ClientCollectionCache.ReadCollectionAsync call site.
  • Annie + Kushagra review pass (2026-05-13): Annie's questions on the custom retry loop vs TaskHelper.InlineIfPossible and the internal-deadline choice addressed in PR thread; Kushagra's changelog request fulfilled in 1c4fb4482.
  • Cross-SDK alignment audit + query-plan extension (2026-05-13, commit 60c2c77dc): per Annie's follow-up question on broader CT-preemption coverage. Java SDK confirmed structurally immune to the bug class via Reactor retryWhen (error-signal-only) + no caller CT in metadata APIs + Mono.fromFuture(..., suppressCancel=true) for PKRange. The .NET gap was the gateway query-plan path; closed in this commit. PKRange and account-info paths confirmed already aligned. Investigation artifact preserved in session workspace.

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

NaluTripician and others added 4 commits May 6, 2026 12:54
Introduces an internal MetadataDetachedExecutor that runs metadata-cache reads on a
detached, internally-bounded CancellationToken and observes the caller's CancellationToken
only on the response path. The retry-policy decision is therefore never preempted by
caller-cancel, fixing the cross-region-failover preemption bug from issue #5805.

ConfigurationManager exposes a configurable hard deadline
(AZURE_COSMOS_METADATA_DETACHED_HARD_DEADLINE_SECONDS, default 5 min) so the detached
attempt cannot leak background work indefinitely. A defensive 50-attempt cap guards
against a misbehaving retry policy returning ShouldRetry=true with zero backoff.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tByRid/GetByName

Repoints both metadata-cache-feeder factories from TaskHelper.InlineIfPossible (which
delegates to BackoffRetryUtility, the source of the caller-cancel preemption) to
MetadataDetachedExecutor. TaskHelper.RunInlineIfNeededAsync still wraps for NETFX
SynchronizationContext safety. Caller CancellationToken is preserved at the entry-side
ThrowIfCancellationRequested() gate and observed by the executor only on the response path.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Pins behavior of detached-cancellation execution model: success path, transient retry, primary-fix scenario where cross-region retry executes on detached token after caller-cancel mid-flight, caller OCE surfacing while detached task continues, already-cancelled caller token, CancellationToken.None fast path, policy NoRetry/ExceptionToThrow/throws, internal-deadline bound, hard attempt cap, first-attempt OCE consults policy, null-arg validation, non-positive deadline, backoff honored, SyncContext smoke test.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…p per fresh-eyes review

Fresh-eyes review (.coding-harness/review-feedback-1.json) flagged:

R1.1 (major): GetMetadataDetachedHardDeadline returned an unbounded TimeSpan.

An envvar value larger than ~uint.MaxValue-1 ms (~49.7 days) would make new

CancellationTokenSource(TimeSpan) throw ArgumentOutOfRangeException, breaking

every metadata read. Added MaxMetadataDetachedHardDeadlineInSeconds=86400

(24h) clamp; new test verifies a 60-day envvar value is clamped and the

resulting TimeSpan constructs a CancellationTokenSource without throwing.

R1.3 (minor): the attempt-cap throw discarded last-failure context. Hoisted

lastCapturedException above the loop; cap path now passes its SourceException

as InnerException and traces type+message so the cap and the underlying

failure are both diagnosable.

R1.5 (nit): doc comment referenced a literal '5 minutes' default; now points

at ConfigurationManager.DefaultMetadataDetachedHardDeadlineInSeconds so the

doc cannot drift from the constant.

R1.2 (nit): renamed ExecuteAsync_CancellationTokenNone_FastPath_NoCallerOcePropagation

to ExecuteAsync_CancellationTokenNone_SucceedsAndOperationReceivesNonCanceledToken

so the test name matches what it actually proves; the fast-path micro-

optimization is not directly observable from outside the executor.

Tests: 19/19 MetadataDetachedExecutor pass (was 17; +2 clamp tests);

60/60 in CollectionCache/ClientRetry/ConfigurationManager regression slice.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@NaluTripician
Copy link
Copy Markdown
Contributor Author

Phase 7 — Cross-SDK parity research

Research artifact: .coding-harness/cross_sdk_status.json. Research-only; no PRs filed in other repos.

Of the four non-.NET Cosmos DB SDKs, three are genuinely vulnerable to the same bug class:

SDK Status Mechanism Tracked
Java ✅ parity (not vulnerable) BackoffRetryUtility.executeRetry() uses Reactor retryWhen. The retry decision (shouldRetry flatMap) is entered synchronously upon error emission; subscription cancellation cannot preempt an already-executing flatMap. Parity is structural, not coincidental. n/a — no fix needed (assessment depends on shouldRetry remaining synchronous)
Python ❌ bug-present asyncio.CancelledError (a BaseException since Python 3.8) escapes the except CosmosHttpResponseError guard in async ExecuteAsync (_retry_utility_async.py:139,187) without retry_policy.ShouldRetry being consulted. Sync path also affected via top-of-loop wall-clock timeout check. azure-sdk-for-python#46471
Rust ❌ bug-present BackOffRetryHandler::send() (retry_handler.rs:113-145) is a bare loop { sender(req).await; retry_policy.should_retry(&result).await; }. Dropping the caller future drops the in-flight retry decision; no tokio::spawn or detached scope. azure-sdk-for-rust#4253
Go ❌ bug-present clientRetryPolicy.Do() (cosmos_client_retry_policy.go:96-117) passes req.Raw().Context() (the caller's cancellable context) to gem.Update() inside attemptRetryOnNetworkError / attemptRetryOnEndpointFailure. Ironically, cosmos_global_endpoint_manager_policy.go:20-30 already uses context.WithoutCancel for the exact same purpose — the fix pattern just wasn't propagated. azure-sdk-for-go#26649

Remediation sketches (idiomatic per language) are captured in .coding-harness/cross_sdk_status.json under each SDK's remediation_sketch field. Open questions and follow-up recommendations are also persisted there. Do not block this .NET PR on those — they are tracked under their respective issue numbers.

@NaluTripician NaluTripician marked this pull request as draft May 6, 2026 20:20
@NaluTripician NaluTripician marked this pull request as ready for review May 6, 2026 20:21
NaluTripician and others added 2 commits May 6, 2026 13:44
Addresses iteration-2 deep review findings on PR #5844:

- R2.4 (Correctness): Surface underlying exception when internal deadline trips during the operation lambda, not just during Task.Delay backoff. Adds a top-of-loop OCE-due-to-detached-token guard that surfaces the prior captured exception, preserving the design contract that callers see the failure mode that drove the retry (not a hard-deadline artifact).

- R2.5 (Documentation): Reword the AsyncCache caveat doc-comment to accurately describe in-flight reuse semantics. Concurrent callers do not share the eventual successful result after the first caller cancels; AsyncCache discards the OCE-faulted lazy and the second caller starts a fresh detached attempt. The real benefit is side-effect accrual (LocationCache region marking, session clearing), not result reuse.

- R2.9 (Style): Change ConfigurationManager.GetMetadataDetachedHardDeadline accessibility from public to internal for consistency with the internal-static class containing it.

- R2.10 (Concurrency): Add Task.Yield() when the retry policy returns BackoffTime <= TimeSpan.Zero, bounding CPU and giving the threadpool a chance to schedule other work. Limits amplification of a misbehaving policy that returns ShouldRetry=true with zero backoff.

- R2.11 (Documentation): Add comment explaining ContinueWith inline-completion ordering for disposeWhenDone, warning future maintainers not to read detachedCts after the registration.

- R2.13 (Testing): Add [DoNotParallelize] to MetadataDetachedExecutorTests so the env-var clamp tests are isolated from MSTest class-level parallelism.

- R2.1 (Diagnostics): Document the post-cancel trace/stats mutation as a known limitation in the executor's doc-comment. The full fix (isolate detached task into a child trace tree, merge only on success) is a follow-up tracked separately to keep this fix scoped.

Adds regression test ExecuteAsync_DeadlineTripsDuringOperation_SurfacesUnderlyingException pinning the R2.4 contract: when the deadline trips during operation execution with prior failures, the underlying DocumentClientException surfaces, not the deadline OCE.

Test results: 20/20 MetadataDetachedExecutor tests pass; 60/60 regression slice (CollectionCache | ClientRetry | ConfigurationManager) pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…er third deep-review pass

Addresses iteration-3 deep review findings on PR #5844 (merge_recommendation: ready, 0 blocking):

- R3.1 (Recommendation): Add comment in ExecuteRetryLoopAsync explaining the asymmetry of the OCE-during-operation guard's third filter clause. When previousException is itself OCE, the filter intentionally falls through to the general catch path because swapping one OCE for another offers no diagnostic gain and the general path correctly funnels through policy/hard-cap/backoff-catch termination.

- R3.2 (Recommendation): Add 'Retry-policy invariant' paragraph to executor's <summary> documenting that the supplied IDocumentClientRetryPolicy MUST be a per-call instance because ShouldRetryAsync is intentionally NOT invoked on either OCE termination path. A future refactor that caches policies must preserve this invariant or move OCE termination paths through ShouldRetryAsync.

- R3.3 (Suggestion): Bump ExecuteAsync_DeadlineTripsDuringOperation_SurfacesUnderlyingException internal-deadline from 200ms to 2s to remove CI flakiness risk on saturated runners. The test still verifies the same R2.4 contract; only the wall-clock generosity changes.

Skipped (non-blocking, deferred or rejected):

- R3.4 (Suggestion): Split tests into two classes for parallelism — over-engineering for zero observed flakes; class-level [DoNotParallelize] is acceptable for a 20-test class that completes in <2s.

- R3.5 (Observation): AsyncCache fall-through coalescer — tracked as follow-up work item.

- R3.6 (Observation): Split <summary> doc-comment into <remarks> — cosmetic; current structure renders correctly in IDE tooltips.

Test results: 20/20 MetadataDetachedExecutor tests pass (~2s); build clean (0 errors).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@NaluTripician
Copy link
Copy Markdown
Contributor Author

Deep Review Summary (3 iterations)

This PR has been through three iterations of local deep-review using a Sonnet/Opus emulation of the cosmos-sdk-copilot-toolkit's PR Deep Reviewer agent. All findings have been triaged and the relevant ones addressed in code; the rest are tracked as documented follow-ups.

Iteration 1 (5 findings → resolved in 22cbcf4)

  • R1.1 Hard-deadline upper clamp added (24h ceiling stays under CTS overflow): MaxMetadataDetachedHardDeadlineInSeconds = 86400.
  • R1.2 Renamed CT-None fast-path test to honest behavioral assertion (the fast-path is non-observable in mock-based tests).
  • R1.3 MaxAttemptsHardCap throw now passes InnerException to the InvalidOperationException so the underlying failure isn't lost.
  • R1.4 / R1.5 left as polish backlog.

Iteration 2 (13 findings → 7 resolved in 09bd57e, 6 deferred)

Resolved in code:

  • R2.4 (Correctness): Surface underlying exception when internal deadline trips during the operation lambda (not just during Task.Delay backoff). Top-of-loop OCE-due-to-detached-token guard preserves the design contract that callers see the failure mode that drove the retry. Pinned by new regression test.
  • R2.5 (Documentation): Reworded AsyncCache caveat — concurrent callers do not share the eventual successful result after the first caller cancels; the real benefit is side-effect accrual.
  • R2.9 (Style): GetMetadataDetachedHardDeadline accessibility public → internal.
  • R2.10 (Concurrency): Task.Yield() when retry policy returns BackoffTime <= TimeSpan.Zero, bounding CPU.
  • R2.11 (Documentation): Comment near disposeWhenDone explaining ContinueWith inline-completion ordering.
  • R2.13 (Testing): [DoNotParallelize] on test class isolating env-var clamp tests.
  • R2.1 (Diagnostics): Documented as known limitation in executor's (full structural fix — isolating detached task into a child trace tree — is a follow-up).

Deferred to follow-ups:

  • R2.2 (CosmosClientOptions surface for the deadline) — env-var-only retained for now.
  • R2.3 (real ClientCollectionCache+ClientRetryPolicy+LocationCache integration test) — emulator CI work.
  • R2.6 (caller-cancel + deadline race stress test) — nice-to-have.
  • R2.7 (IMetadataExecutor interface for substitution).
  • R2.8 (Cosmos.Direct work item to add 'consult policy first' overload to BackoffRetryUtility).
  • R2.12 (customer-facing docs in Cross Region Request Hedging.md).

Iteration 3 (6 findings → 3 addressed in 8b1ec79, 3 deferred; merge_recommendation: ready, 0 blocking)

Audit of iteration-2 fixes: R2.4, R2.5, R2.9, R2.10, R2.11, R2.13 all verified resolved. R2.1 verified as documented partial per the deferred plan.

Resolved:

  • R3.1 (Recommendation): Comment explaining the OCE-during-operation guard's third filter clause asymmetry.
  • R3.2 (Recommendation): Documented per-call retry-policy invariant in executor — ShouldRetryAsync is intentionally not invoked on either OCE termination path.
  • R3.3 (Suggestion): Test deadline 200ms → 2s for CI stability.

Skipped:

  • R3.4 (split tests for parallelism) — over-engineering for zero observed flakes.
  • R3.5 (AsyncCache coalescer) — follow-up work item.
  • R3.6 (split into ) — cosmetic.

Final state

  • Build: 0 errors.
  • Tests: 20/20 MetadataDetachedExecutor tests pass; 60/60 regression slice (CollectionCache | ClientRetry | ConfigurationManager) pass.
  • 4 commits ahead of origin/main; PR remains in draft awaiting human flip-to-ready.
  • merge_recommendation: ready per iteration 3 deep review.

NaluTripician and others added 2 commits May 6, 2026 14:11
…ation

- Replace object.ReferenceEquals(Exception, Exception) with reference
  equality '==' on typed locals to avoid the CDX1000 analyzer error
  (boxing Exception to object on the metadata hot path).
- Re-derive MaxAttemptsHardCap against the actual SDK retry policies:
  the dominant per-call retry ceiling is ClientRetryPolicy.MaxRetryCount
  = 120 (cross-region failover counter), not the previously-claimed
  '5 preferred regions x 10 in-region retries = 50'. Bump the cap to 200
  (120 + ~80 headroom for stacked throttling/session/serviceUnavailable
  retries) and rewrite the doc comment to cite the real source constants.
- Update the matching DefaultMetadataDetachedHardDeadlineInSeconds doc
  comment to derive 300 s from the per-region 1+5+65 s timeout ladder
  and a typical ~3-5 region failover sweep, rather than the wrong
  '5 x 36 s' rationale.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…nd policy distinction

Tighten DefaultMetadataDetachedHardDeadlineInSeconds doc comment to:
- explicitly cite the wrapped call site (ClientCollectionCache.ReadCollectionAsync)
  and the GetTimeoutPolicy branch that routes it to the HotPath policy.
- name the slower HttpTimeoutPolicyControlPlaneRead ladder (5+10+20 = 35 s/region)
  used by GatewayAccountReader, with a note that the executor does not wrap that
  path today. Keeps the comment self-correcting if the executor's surface ever
  expands to account reads.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@xinlian12
Copy link
Copy Markdown
Member

@sdkReviewAgent

Comment thread Microsoft.Azure.Cosmos/src/MetadataDetachedExecutor.cs Outdated
Comment thread Microsoft.Azure.Cosmos/src/Routing/ClientCollectionCache.cs
@xinlian12
Copy link
Copy Markdown
Member

Review complete (46:58)

No new comments — existing review coverage is sufficient.

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

…hed CT

Adds two mock-based regression tests asserting that ClientCollectionCache.GetByRidAsync and GetByNameAsync route through MetadataDetachedExecutor and pass the executor-owned detached CancellationToken (NOT the caller's token) into the inner ReadCollectionAsync lambda.

Addresses SDK review agent feedback on PR #5844: a regression that reverts either lambda to the caller's CancellationToken would silently reintroduce the cross-region failover preemption bug (issue #5805); the existing MetadataDetachedExecutorTests would still pass because they exercise the executor directly with synthetic operations.

Mechanism: hold the first storeModel.ProcessMessageAsync call on a gate, cancel the caller mid-flight, release the gate so the in-flight attempt fails transiently, and assert the retry policy drives a second ProcessMessageAsync invocation. Verified by temporarily reverting the lambda to caller-token passthrough -- both tests fail with timeout-on-second-invocation. Restored to detached wiring -- both pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@NaluTripician
Copy link
Copy Markdown
Contributor Author

Addressed the test-coverage gap from comment r3204019960 in commit 107ef6e.

Added ClientCollectionCacheDetachedWiringTests with two mock-based regression tests (GetByRidAsync_CallerCancelMidFlight_DetachedReadProceedsToRetry and GetByNameAsync_CallerCancelMidFlight_DetachedReadProceedsToRetry) that pin the wiring contract:

  • Hold the first storeModel.ProcessMessageAsync on a gate
  • Cancel the caller mid-flight
  • Verify caller surfaces OperationCanceledException (response-path observer)
  • Release the gate; first attempt throws transient
  • Verify a second ProcessMessageAsync invocation occurs (mocked retry policy returns RetryAfter(Zero) then NoRetry)

If anyone reverts the lambda back to passing cancellationToken instead of the executor's detached ct, ReadCollectionAsync's top-of-method ThrowIfCancellationRequested fires on the retry iteration before ProcessMessageAsync is touched a second time and the test times out at attempt 2. Verified locally — temporarily reverting the wiring causes both tests to fail with the expected timeout; restoring it makes them pass.

22/22 MetadataDetachedExecutor* + ClientCollectionCacheDetachedWiringTests pass.

Skipping the other three threads as nits/observations per evaluation.

Comment thread Microsoft.Azure.Cosmos/src/Routing/ClientCollectionCache.cs
Comment thread Microsoft.Azure.Cosmos/src/Routing/ClientCollectionCache.cs
Copy link
Copy Markdown
Member

@kundadebdatta kundadebdatta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few questions and add more test coverage.

The previous comment incorrectly attributed CDX1000 to boxing. Exception
is a reference type so no boxing occurs in either form. The
DontConvertExceptionToObject analyzer flags type-information loss when
typed Exception references are converted to object — that is the actual
concern the comment now describes.

Addresses @xinlian12 review comment 3204019954 on PR #5844.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@NaluTripician
Copy link
Copy Markdown
Contributor Author

Review feedback addressed

Pushing two commits to address all open review feedback. Detailed inline replies are posted on each thread.

Code changes

Commit Addresses Change
107ef6e @xinlian12 #3204019960, @kundadebdatta #3207384055 Added ClientCollectionCacheDetachedWiringTests: mocks IStoreModel.ProcessMessageAsync, gates the first attempt, cancels the caller mid-flight, asserts a second ProcessMessageAsync invocation on the detached token for both GetByRidAsync and GetByNameAsync (a regression that re-couples the caller token to the operation lambda would fail with one invocation).
77f3ac0 @xinlian12 #3204019954 Corrected CDX1000 rationale: Exception is a reference type so no boxing occurs; the analyzer flags type-info loss when typed Exception references are converted to object.

Discussion responses

  • @kundadebdatta's AsyncCache question: detailed reply on #3207390957 — the preemption lives inside Cosmos.Direct's BackoffRetryUtility (one layer below AsyncCache); fixing it at the AsyncCache layer would require either modifying Cosmos.Direct or changing AsyncCache's contract for all callers. The wrapper-at-the-leaf-call-site is the right boundary.
  • @kundadebdatta's live multi-region tests: deferred to a follow-up tracking issue per the PR plan; spec captured in §3.2-3.4 of the design doc. The cache-boundary wiring is now pinned at the unit level.
  • @xinlian12's AsyncCache concurrent-caller scenario (#3204019958): noted as a future integration test target.

Local verification

  • Microsoft.Azure.Cosmos.Tests: 22 tests pass (20 MetadataDetachedExecutorTests + 2 ClientCollectionCacheDetachedWiringTests).
  • The push auto-triggered build 61123 which will rerun the previously-failing EmulatorTests Release - ThinClient job.

Re-review requested. Thanks!

{
// Pass CancellationToken.None: the retry policy must not be canceled
// mid-decision. This is the entire point of the detached design.
shouldRetry = await retryPolicy
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question;
why not just use something like the following, why do the try catch here?
CancellationTokenSource detachedCts = new CancellationTokenSource(internalDeadline); Task<T> detachedTask = TaskHelper.InlineIfPossible( () => operation(detachedCts.Token), retryPolicy, detachedCts.Token);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And also another question - why we have to calculate an internal deadline cancellation token? do we have to? why not use CancellationToken.None?

since the values configured here are so high, so probably the metadata will complete its original loop, but curious why.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question — the simplification works for the headline bug (caller CT no longer reaches the retry loop), but the executor is enforcing three contracts and TaskHelper.InlineIfPossibleBackoffRetryUtility<T>.ExecuteAsync (TaskHelper.cs:66-69) only preserves the first:

  1. Caller CT ∉ retry loop — both approaches give us this.
  2. ShouldRetryAsync always invoked with CancellationToken.None. The custom loop hard-codes this at line 293 with the comment "the retry policy must not be canceled mid-decision. This is the entire point of the detached design." BackoffRetryUtility doesn't make that guarantee — when detachedCts trips at the 300s mark, the policy can be preempted mid-decision the same way the caller CT used to preempt it (top-of-loop ThrowIfCancellationRequested fires before ShouldRetryAsync even runs).
  3. Deadline-trip surfaces the underlying failure, not a generic OCE. With the simplification, a customer hitting the ceiling sees OperationCanceledException at attempt N. With the explicit loop (filters at lines 260-280 + 334-340), they see the actual ServiceUnavailableException/HttpRequestException from the region that was failing. That's the diagnostics value — without it, a deadline trip looks like "the SDK gave up" rather than "region X kept failing for 300s".
  4. MaxAttemptsHardCap = 200BackoffRetryUtility has no attempt cap, so a misbehaving policy returning ShouldRetry=true with BackoffTime=Zero would burn CPU until the time-based deadline. The cap bounds the burst rate.

~30 LOC to buy invariants 2-4. The XML doc at the top of MetadataDetachedExecutor covers this in prose, but I take your point that the rationale isn't obvious from the loop body itself — happy to add a 4-bullet invariants comment at the top of ExecuteRetryLoopAsync in a follow-up if you'd like.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defense-in-depth backstop — the value is intentionally high enough to almost never trip in production. The attempt cap (200) handles "tight loop with zero backoff," but it does not handle two pathologies:

  • Single attempt that hangs forever — e.g., a future regression in the HttpTimeoutPolicy ladder, or a custom IDocumentClientRetryPolicy that swallows OCE. Attempt 1 never returns → attempt cap never advances → leaked task forever.
  • Diagnostics-mutation horizon — per the caveat at MetadataDetachedExecutor.cs:78-90, the detached task continues mutating the caller's ITrace and ClientSideRequestStatistics until something stops it. Without a deadline, "something" is only the policy itself.

With CancellationToken.None, one future code change = process-lifetime fire-and-forget task pinning HTTP connection-pool slots and growing a Trace whose owner has long since GC'd its references. The deadline guarantees that some upper bound exists.

300s is grounded in the actual ladder math (HttpTimeoutPolicyControlPlaneRetriableHotPath ≈ 72s/region × ~5 preferred regions ≈ 360s ceiling, rounded down with ClientRetryPolicy.RetryIntervalInMS = 1000ms per failover). Tunable via AZURE_COSMOS_METADATA_DETACHED_HARD_DEADLINE_SECONDS (clamped [30s, 86400s]) for ops who want to dial it.

So: yes, in steady-state behavior identical to CancellationToken.None. The deadline only matters when something else is already broken — which is exactly when you want a backstop.

@xinlian12
Copy link
Copy Markdown
Member

In the PR description: "PartitionKeyRangeCache: its BackoffRetryUtility usage does not thread caller CT, so no parity fix needed there."

does this mean even if customer passed a cancellation token, but SDK will not honor it? so customer's request by theory can go beyond their configured CT? If this is the case, sounds like a bug?

And do we also validated the query plan path?

xinlian12
xinlian12 previously approved these changes May 13, 2026
Copy link
Copy Markdown
Member

@xinlian12 xinlian12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Member

@kushagraThapar kushagraThapar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes for changelog entry.

Adds a new Unreleased Preview section (per the pattern established in PR #5815) with a Fixed entry for PR #5844. Customer-facing description focuses on the symptom (premature OperationCanceledException preempting cross-region failover during metadata-cache reads) rather than the implementation detail (the new MetadataDetachedExecutor).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@NaluTripician
Copy link
Copy Markdown
Contributor Author

Added in commit 1c4fb4482. Followed the Unreleased Preview section pattern from PR #5815 rather than appending to the already-released 3.60.0-preview.0 section.

Entry placed under #### Fixed with a customer-facing description that focuses on the symptom rather than the implementation:

- [5844](https://github.com/Azure/azure-cosmos-dotnet-v3/pull/5844) Routing: Fixes premature OperationCanceledException that preempted cross-region failover during metadata-cache reads

Happy to adjust the wording, section choice, or move it into 3.60.0-preview.0 directly (the PR #5825 pattern) if you'd prefer.

@NaluTripician
Copy link
Copy Markdown
Contributor Author

Three good questions; the PR description's framing was slightly imprecise and I want to clear it up. In order:

1. "Customer's CT is silently ignored — sounds like a bug?"

Yes — but it's a separate, pre-existing gap, not something this PR changes. The PR description's bullet was technically true but understated the cause. The actual structure:

  • PartitionKeyRangeCache.TryGetOverlappingRangesAsync (PartitionKeyRangeCache.cs:54), TryGetPartitionKeyRangeByIdAsync (:89), TryLookupAsync (:121), and GetRoutingMapForCollectionAsync (:185) all lack a CancellationToken parameter. The caller's CT is dropped at the PKRange public API surface, well above BackoffRetryUtility.
  • The BackoffRetryUtility call at PartitionKeyRangeCache.cs:212 uses the 2-arg overload (no CT) consistently with that.
  • The original preemption bug class from Metadata retry: cross-region failover preempted by caller cancellation during control-plane HTTP timeout escalation #5805 therefore cannot occur in PKRangeCache (CT was already gone before BackoffRetryUtility ran). That's the only claim the PR description was trying to make.
  • But you're right that there's a separate latent gap: a caller's CT cannot cancel an in-flight PKRange cache refresh. In pathological cases (unreachable region that drops connections rather than RSTing) the customer's request can extend beyond their configured CT. The same MetadataDetachedExecutor pattern would fit there — let the caller surface OCE on the response path while the background refresh continues. Scoped out of this PR to keep blast radius small.

2. "Did you validate the query plan path?"

Partially. The query plan path is structurally vulnerable to the same bug class, and PR #5844 does NOT fix it:

  • QueryPlanRetriever.GetQueryPlanThroughGatewayAsync (QueryPlanRetriever.cs:118-162) does accept and thread the caller's CancellationToken through to queryContext.ExecuteQueryPlanRequestAsync (line 161).
  • GetQueryPlanWithServiceInteropAsync (:57-115) also threads caller CT through to queryPlanHandler.TryGetQueryPlanAsync (line 100).
  • On Linux/OSX/32-bit-Windows (CosmosQueryExecutionContextFactory.cs:586) the gateway path runs through the standard request pipeline with retry policies — needs tracing through ExecuteQueryPlanRequestAsyncRequestInvokerHandler to confirm the caller CT actually reaches BackoffRetryUtility there. My read is "probably yes, same bug class as Metadata retry: cross-region failover preempted by caller cancellation during control-plane HTTP timeout escalation #5805."

The PR description listed a deferred test for the Linux Query Plan path. Your push is fair: if the bug class exists there, that should be elevated from a missing test to a potentially-missing fix.

3. "What is the PK range scenario in this PR?"

There isn't one — and that's the intentional scope. PR #5844 only fixes the collection metadata read path (ClientCollectionCache.GetByRidAsync / GetByNameAsync), which is what ClientRetryPolicy consults when it needs to decide on cross-region failover for a collection cache miss. The PKRange bullet in the PR description is purely a scope-exclusion note (explaining why PKRange code is not changed), but the wording made it read like an active claim about the absence of a bug there. Apologies for the confusion.

Net

  • ✅ This PR's fix to ClientCollectionCache is correct and complete for its scope.
  • ⚠️ Two pre-existing gaps remain — PKRange cache refresh and Query Plan retrieval — both of which deserve evaluation and potentially the same MetadataDetachedExecutor treatment.

Filed #5862 to track both. The query-plan one is the more urgent of the two (because caller CT does reach the retry path there); the PKRange one requires a public-API-shape change so probably needs a separate design pass.

Happy to tighten the PR description wording on the PKRange bullet too — want me to push that as a small follow-up commit, or is the resolution in this thread sufficient?

…Executor for Java parity

Generalizes MetadataDetachedExecutor with a no-retry-loop ExecuteDetachedAsync
overload and wires QueryPlanRetriever.GetQueryPlanThroughGatewayAsync through
it. The internal RequestInvokerHandler pipeline keeps its own retry semantics;
the wrap only ensures those decisions cannot be preempted by caller
CancellationToken, mirroring Java's error-signal-only retryWhen contract.

- MetadataDetachedExecutor.cs: add ExecuteDetachedAsync; refactor ExecuteAsync
  to compose on top of it; expanded XML doc covering both overloads and the
  Java alignment matrix (PKRange + GatewayAccountReader already aligned).
- QueryPlanRetriever.GetQueryPlanThroughGatewayAsync: route the gateway call
  through ExecuteDetachedAsync so caller CT is observed only on the response
  path. GetQueryPlanWithServiceInteropAsync is unchanged (local CPU work, not
  a metadata retry path).
- MetadataDetachedExecutorTests: 9 new tests covering ExecuteDetachedAsync
  invariants (success, no outer retry, mid-flight cancel detaches, internal
  deadline, sync factory throw, null operation/task validation, fast path).
- changelog.md: expand Unreleased Preview entry to mention query-plan path.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@NaluTripician
Copy link
Copy Markdown
Contributor Author

@kundadebdatta @xinlian12 — heads up that I extended this PR in commit 60c2c77dc to bring the gateway query-plan path into the same detached-execution model, after a cross-SDK alignment audit that compared every metadata read path on .NET against Java.

TL;DR

Java's metadata-read paths are structurally immune to the bug class this PR closes (#5805), thanks to three independent layers:

  1. Reactor's retryWhen is an error-signal-only operator — policy.shouldRetry(e) is called unconditionally on every onError; downstream cancellation is a separate cancel() signal that bypasses retryWhen entirely.
  2. ClientRetryPolicy.shouldRetry(Exception e) takes no CancellationToken — it cannot be preempted mid-decision.
  3. For PKRange specifically, AsyncCacheNonBlocking.getAsync uses Mono.fromFuture(..., suppressCancel=true) to suppress downstream cancellation propagation.

After this PR, .NET matches the same outcome via MetadataDetachedExecutor. Path-by-path:

Path .NET post-PR Java Aligned?
Collection cache (ClientCollectionCache.GetBy{Rid,Name}Async) Detached via MetadataDetachedExecutor.ExecuteAsync (retry-loop overload) RxClientCollectionCache — no CT in API + Reactor retryWhen
Query plan gateway (QueryPlanRetriever.GetQueryPlanThroughGatewayAsync) Detached via MetadataDetachedExecutor.ExecuteDetachedAsync (no outer retry loop — RequestInvokerHandler keeps its own retry) Same retryWhen model + availability strategy wrapper
PKRange cache (PartitionKeyRangeCache.*) Unchanged — API takes no CancellationToken at any layer; internal BackoffRetryUtility uses the 2-arg overload RxPartitionKeyRangeCache + AsyncCacheNonBlocking.getAsync with suppressCancel=true ✅ (structural; no caller-CT vector exists)
Account info (GatewayAccountReader + GlobalEndpointManager) Unchanged — already detached pre-PR (uses this.cancellationTokenSource.Token only, HTTP call passes cancellationToken: default) Dedicated GLOBAL_ENDPOINT_MANAGER_BOUNDED_ELASTIC scheduler + Flux.concatDelayError regional sweep ✅ (pre-existing)

What changed in 60c2c77dc

  1. MetadataDetachedExecutor generalized with two overloads, one isolation primitive.

    • ExecuteAsync<T>(operation, retryPolicy, callerCT) — original retry-loop overload, used where the leaf operation has no inner pipeline retry (ClientCollectionCache.ReadCollectionAsync calls storeModel.ProcessMessageAsync directly).
    • ExecuteDetachedAsync<T>(operation, callerCT)new detach-only overload, no outer retry loop. Used where the operation already runs through the standard request pipeline (RequestInvokerHandlerBackoffRetryUtilityClientRetryPolicy) and therefore has its own retry semantics. The wrap only ensures those decisions cannot be preempted by caller cancellation.
    • ExecuteAsync is internally implemented in terms of ExecuteDetachedAsync, so there is exactly one detach primitive in the codebase.
  2. QueryPlanRetriever.GetQueryPlanThroughGatewayAsync routes queryContext.ExecuteQueryPlanRequestAsync(...) through ExecuteDetachedAsync. The internal pipeline retry runs on the detached CT (bounded by the SDK-internal 300 s deadline); the caller's CT is observed only on the response path via Task.WhenAny. Caller cancel → caller surfaces OCE immediately; the underlying retry continues so its side-effects (LocationCache region marking, etc.) take hold for subsequent callers. This is the same contract ClientCollectionCache gets from the original ExecuteAsync wrap.

  3. GetQueryPlanWithServiceInteropAsync is intentionally unchanged — that path is local CPU work via service-interop P/Invoke, not a metadata retry path, and not part of the bug class.

Annie — to your specific questions from earlier (#issuecomment-4437372264)

  • "PartitionKeyRangeCache: its BackoffRetryUtility usage does not thread caller CT, so no parity fix needed there — does this mean even if customer passed a CT, the SDK will not honor it?" — The wording in the original PR description was imprecise. The real statement is: the entire PartitionKeyRangeCache API surface drops the CancellationToken at the public boundary. TryGetOverlappingRangesAsync, TryLookupAsync, TryGetPartitionKeyRangeByIdAsync, GetRoutingMapForCollectionAsync — none of them accept a CancellationToken. Customer-supplied tokens never reach this code in the first place, so there is no preemption vector. This matches Java exactly (RxPartitionKeyRangeCache.tryLookupAsync also takes no token). Whether this is "the right design" is a separate question (Metadata retry: extend detached cancellation pattern to PartitionKeyRangeCache and Query Plan retrieval #5862 tracks it as a Java-parity hardening item, though the PKRange side is parity-tightening rather than a real bug fix because no caller CT can enter).
  • "Did we also validate the query plan path?" — Yes; that was the gap. Pre-PR, QueryPlanRetriever.GetQueryPlanThroughGatewayAsync threaded the caller CT through queryContext.ExecuteQueryPlanRequestAsyncRequestInvokerHandlerBackoffRetryUtility.ExecuteAsync(3-arg) which checks the caller CT at the top of every retry-loop iteration. Same bug class as Metadata retry: cross-region failover preempted by caller cancellation during control-plane HTTP timeout escalation #5805. Closed in 60c2c77dc.
  • "What is the PK range scenario?" — There isn't one for this bug class. PKRange takes no caller CT, and its internal BackoffRetryUtility call uses the 2-arg overload (CT.None). The interesting cross-SDK observation is that Java protects this path with an additional layer (AsyncCacheNonBlocking.getAsync uses Mono.fromFuture(supplier, suppressCancel: true) so that even Reactor-level subscriber disposal cannot cancel the underlying CompletableFuture). .NET's AsyncCache does not make that explicit promise, but because the API takes no CT, there is no exploit vector. The aspiration of true 1:1 parity here is tracked in Metadata retry: extend detached cancellation pattern to PartitionKeyRangeCache and Query Plan retrieval #5862 but is not a bug fix.

Debatta — what's new for review

The relevant new diff is in 60c2c77dc. Aside from the wire-up, the doc comment on MetadataDetachedExecutor was rewritten to:

  • Describe both overloads and explicitly call out which one is appropriate for which kind of call site (leaf-without-pipeline vs. pipeline-with-internal-retry).
  • Document the Java alignment matrix inside the type, so future readers see why two paths get the executor wrap, two don't, and PKRange is parity-correct without a wrap.

Test coverage

  • 30/30 MetadataDetachedExecutorTests pass — 9 new tests cover ExecuteDetachedAsync invariants:
    • Pinned: no outer retry on fault (OperationFaults_NoOuterRetry), caller-cancel mid-flight surfaces OCE while detached operation keeps running, already-cancelled caller token short-circuits, internal deadline trip, null-operation / null-task validation, sync factory throw doesn't hang, CT.None fast path.
  • 2/2 ClientCollectionCacheDetachedWiringTests regression — wiring intact for the original PR scope.
  • 5/5 QueryPlanRetriever tests + 36/36 query-pipeline / thin-client tests pass — no regression on the query plan path.

What's now considered out of scope (and tracked in #5862)

  • PartitionKeyRangeCache: already aligned (no caller-CT vector). A defensive wrap would not change customer-visible behavior; would only mirror Java's explicit suppressCancel=true promise.
  • AddressCache: intentionally unchanged for this PR.
  • GatewayAccountReader / GlobalEndpointManager: pre-existingly detached.
  • Metadata Hedging: separate work item.

Happy to walk through any specific path on a quick sync if helpful. The full cross-SDK investigation artifact (with file:line citations from Azure/azure-sdk-for-java@3ed5b63f) lives in my session workspace; can paste it into a comment here if useful.

NaluTripician and others added 5 commits May 14, 2026 09:47
…ian/metadata-detached

# Conflicts:
#	.gitignore
#	changelog.md
Resolves changelog.md conflict: keeps PR 5844 entry in Unreleased; drops 5870 entry now released in 3.60.0.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
kundadebdatta added a commit that referenced this pull request Jun 2, 2026
Addresses the findings produced by the PR Deep Reviewer on PR #5920
(excluding Finding #1, which assumed PR #5844 `MetadataDetachedExecutor`
would merge into main; per author guidance, this design proceeds without it).

Structural changes
- §5.3 `ExecuteAsync` rewritten: per-branch CancellationTokenSources
  (no shared linkedCts) so the loser`s OperationCanceledException is
  contained inside `BackgroundCleanupAsync` and cannot reach
  `MetadataRequestThrottleRetryPolicy` (Finding #2 — protects healthy
  secondary from spurious `MarkEndpointUnavailableForRead` post-PR #5780).
- §5.3: primary fault before threshold no longer bypasses hedge — adds
  `primaryTask.Status == RanToCompletion` guard so fast-fail-on-degraded
  primary triggers the hedge (Finding #3).
- §5.3: wait-for-winner is now a loop that filters transient/faulted
  completions; a fast 503 from the hedge can no longer beat a healthy
  200 from the primary (Finding #4).
- §5.3 + §5.4 + §5.5: `as` cast (not hard cast) for
  `MetadataRequestThrottleRetryPolicy`; wrapped/test-double policies
  no longer throw `InvalidCastException` (Finding #7).
- §5.3: added `BackgroundCleanupAsync` that awaits the loser, disposes
  its `DocumentServiceResponse` body (handle-leak fix), records outcome
  via volatile field, and disposes the loser CTS (Finding #11).

Correctness/factual fixes
- §5.10 + §5.6: corrected — `ClientCollectionCache` uses `AsyncCache`
  (not `AsyncCacheNonBlocking`); base-class abstract signature change
  must be defaulted for subclass compat; forbid inferring cold-start from
  `previousValue == null` inside the factory (Finding #5).
- §5.2 + §6.1: added `HasHedgedThisOperation` flag (set via
  `Interlocked.Exchange`); fixes the broken §6.1 claim that retries
  wouldn`t re-hedge because the cache had a `previousValue` (false —
  cache is only populated when the loop exits) (Finding #8).
- §5.2: `ConcurrentDictionary<Uri, byte>` replaces `HashSet<Uri>`;
  volatile `LoserOutcome` field for cross-thread updates (Finding #6).
- §5.9: added `HttpTimeoutPolicy.FirstAttemptTimeout` accessor design
  — `TimeoutsAndDelays` is private today (Finding #9).
- §5.7.4: sketched the per-index resolve loop for
  `IncrementRetryIndexOnUnavailableEndpointForMetadataRead` — today
  it`s a 1-line counter that never resolves an endpoint (Finding #10).

New sections
- §5.7 (4 subsections): coordination with PR #5780, structural invariant
  that hedge-loser OCE never reaches retry policy, shared
  `RetryUtility.IsRegionalFailure` helper, attempted-endpoints skip loop.
- §5.12: net472 stack-unwind discipline (`SendOneAsync` middle-layer
  seam + `ExceptionDispatchInfo`) — adopts the PR #5870 lesson
  (Finding #12).
- §5.13: per-auth-mode handling in `CloneForHedge`; hedge-401/403
  guard for RBAC-role-assignment-missing-in-secondary case (Finding #13).
- §7.1: wiring step for `isHedgingDisabledByGateway` from
  `DocumentClient` into the cache constructors via `Func<bool>`
  (Finding #15 bundle).
- §9.1: `EventSource`/`Meter` counters for fire-rate, win-rate,
  budget-exhaustion, late-loser, hedge-fired-elapsed-ms (Finding #15 bundle).

API/rollout
- §5.1: `EnableMetadataHedgingForColdStart` becomes tri-state `bool?`;
  `MetadataHedgingOptions` promoted to public so customers can tune
  `PerClientConcurrencyBudget` for high-container-cardinality startups
  (Finding #14).
- §12: Phase 3 no longer removes the opt-in (binary break avoided);
  only the phase default changes (Finding #14).

Smaller items (Finding #15 bundle)
- §5.3: drop `closest secondary` framing (SDK has no proximity measure);
  use `Wait(TimeSpan.Zero)` instead of `WaitAsync(TimeSpan.Zero)`
  (no Task allocation); add `EvaluateEligibility`-vs-budget-check note.
- §5.4: defaulted `isColdStart = false` on the abstract method to avoid
  breaking subclass overrides (e.g., encryption-mirrored caches).
- §6: added eligibility rules 8 (`ExcludeRegions` hard filter), 9
  (`HasHedgedThisOperation`), 10 (single-master account guard).
- §10: reconciled `Both branches fault` with §5.3 (consistent
  `ExceptionDispatchInfo` semantics).
- §11: tests added for loser-cancellation-doesn`t-poison-secondary,
  loser-disposal, no-re-hedge-across-retries, cross-policy-type,
  net472 SO regression; mirrors PR #5787 `senderCallCount` assertions.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants