[Internal] CancellationToken: Fixes operation-level token being ignored in gateway HTTP retry delays and query pagination paths#5913
Draft
tvaron3 wants to merge 16 commits into
Conversation
…t types + CT-honor repros for issue #5906 - Add FaultInjectionServerErrorType.LeaseNotFound (410/1022) for injecting the lease-handoff failure mode observed during partition merge cutovers. - Implement the new error type in both Direct (StoreResponse) and Gateway (HttpResponseMessage) result paths, modeled on PartitionIsMigrating (410/1008). - Add FaultInjectionOperationType.MetadataRequest -> OperationType.Head so fault rules can target consistency-barrier Head/Collection requests issued by the SDK under Strong / Bounded-Staleness consistency. - Extend gateway and direct smoke [DataRow] tables with the new LeaseNotFound status/substatus pair. - Add two e2e tests that are EXPECTED TO FAIL on master until #5906 ships: * FaultInjectionLeaseNotFound_FeedIteratorHonorsCancellationToken: cross- partition FeedIterator under a 5 s CancellationToken with 410/1022 in the primary region — asserts wall-clock <= 7 s (currently runs much longer due to non-CT-aware ClientRetryPolicy + prefetch fan-out). * FaultInjectionMetadataRequestDelay_HonorsCancellationToken: Strong- consistency multi-region read with a 20 s delay injected on every Head barrier in the primary region — asserts the CT is honored within 7 s instead of waiting for the barrier to complete. Both tests reference customer ICM 21000001041367 and #5906. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…st comments Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
6 tasks
Member
Author
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…Found CT-honor repro Forces the SDK to fail over from a primary-region LeaseNotFound (410/1022) into a secondary region that holds the response open with a 120-second delay. With both regions blocked, a CT-aware code path must surface OperationCanceledException within the user's CT window; a non-CT-aware path waits in the secondary delay. Verified locally against single-master account (Strong, CUS write + WUS2 read). Repro reliably exceeds the 5+2=7s budget by ~2s, demonstrating CosmosOperationCanceledException surfaces only after the channel-level read finally observes the CT \u2014 not when the operation-level token first fires. Also hardens TestCleanup to tolerate partially-initialized fiClient/client. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ader split Restructures the barrier (Head/Collection) test to maximize the chance of provoking a Strong-consistency barrier poll: a dedicated writer client pinned to the write region creates a fresh item to establish a new GCLSN target, then a reader client with the non-write region preferred immediately issues a Strong ReadItemAsync. The barrier delay rule fires on any region (no region filter) since the SDK picks which replica to poll dynamically. On idle test accounts the read will often satisfy quorum from the initial response without needing a separate Head poll. The test returns Assert.Inconclusive in that case; under realistic write load the rule will fire and the CT-honor assertion will execute. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rier CT-honor test ResponseHeaderOverride synthesizes a 200 response with caller-supplied response headers merged in. Useful for deterministically provoking SDK consistency / barrier paths by lying about LSN / GlobalCommittedLSN / QuorumAckedLSN / NumberOfReadRegions values without depending on real cross-region replication lag. Implemented in both Direct (StoreResponse) and Gateway (HttpResponseMessage) paths. Updates FaultInjectionMetadataRequestDelay_HonorsCancellationToken to use the new primitive plus a Direct-mode constraint. The test now mirrors the proven pattern from BarrierRequestReplicaRetryTests: trigger a write barrier by setting NumberOfReadRegions=2 + GCLSN < LSN, then fail every barrier Head/Collection poll with 410/Gone for 60 s. Verified locally against single-master multi-region account: 4 barrier polls fire, 5 s CT is honored, CosmosOperationCanceledException surfaces within 7 s budget. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…g retries ClientRetryPolicy.ShouldRetryAsync (both overloads) received the caller's CancellationToken but never consulted it before returning a ShouldRetryResult.RetryAfter(...) decision. As a result, the SDK could schedule additional cross-region retry attempts past the user's cancellation deadline (e.g., after 410/1022 LeaseNotFound during partition merge / lease handoff). Adds an early NoRetry() return when the CancellationToken is already cancelled in both overloads, plus three unit tests in ClientRetryPolicyTests that pin the contract. Also caps FaultInjection's RNTBD-mode ResponseDelay / SendDelay at the client's configured RequestTimeout. The IChaosInterceptor RNTBD callback signatures do not accept a CancellationToken (closed-source Direct package limitation), so a non-cancellable Task.Delay could otherwise block past the caller's CT. Capping at RequestTimeout bounds the worst case and mirrors the HTTP path behavior. Adds FaultInjectionLeaseNotFound_FeedIteratorHonorsCancellationToken e2e test (FaultInjection multi-region account, single-master, Strong) that injects 410/1022 in the primary region and ResponseDelay in the secondary, then asserts the FeedIterator surfaces OperationCanceledException within the user's CT + 2s grace. Verified locally: previously 9.36s wall-clock under a 5s CT; with the fix and a 3s RequestTimeout the operation now completes in ~5.75s. Adds LeaseNotFound smoke [DataRow] in Direct + Gateway smoke tests and README documentation row for the (upstream-added) LeaseNotFound fault type, plus TestCleanup hardening to tolerate partially-initialized fiClient when the test bails early. Reverts the MetadataRequest operation type, ResponseHeaderOverride server error type, and FaultInjectionMetadataRequestDelay test that were intermediate exploration artifacts and are not needed for this fix. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Member
Author
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
📝 Changelog reminder (non-blocking)This PR touches shipped source but does not appear to update the Touched (and missing entry for):
How to decideUse the rubric in
This check is non-blocking — merge is not gated on it. The |
…ional retry/backoff paths Extends PR #5913 ClientRetryPolicy CT fix to cover the rest of the live-site incident class: - ResourceThrottleRetryPolicy: CT guard at top of both ShouldRetryAsync overloads (default maxWaitTimeInSeconds=60; 429 backoffs could previously be issued past the caller's deadline). - WebExceptionRetryPolicy: CT guard at top (exponential backoff up to ~16s). - BulkExecutionRetryPolicy: CT guard at top of both overloads. - CosmosHttpClientCore.SendHttpHelperAsync: pass cancellationToken to Task.Delay(delayForNextRequest, ...) so the inter-attempt HttpTimeoutPolicy delay is interruptible. Updates changelog entry to cover the broader scope. Adds unit tests for the five guard sites. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Member
Author
|
/azp run |
|
Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command. |
…nKeyRangeCache and query pagination boundaries Plumbs CancellationToken end-to-end through IRoutingMapProvider, PartitionKeyRangeCache (TryGetOverlappingRangesAsync, TryGetPartitionKeyRangeByIdAsync, GetRoutingMapForCollectionAsync), IRoutingMapProviderExtensions, CosmosQueryClient(Core), and the two NetworkAttachedDocumentContainer call sites used by query pagination (MonadicGetChildRangeAsync, MonadicRefreshProviderAsync). Also threads the token through BulkExecutionRetryPolicy and updates the StandByFeedContinuationToken delegate signature. Operations now surface OperationCanceledException within the caller's token deadline instead of blocking on PKRange cache refreshes between pages or during split/merge handling. Adds PartitionKeyRangeCacheTest.TryGetOverlappingRangesAsync_WithCancelledToken_ThrowsOperationCanceledException to lock in the behavior. Existing test mocks updated to match new signatures. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…PartitionKeyRangeCache and query pagination boundaries" This reverts commit aea58a2.
…ss additional retry/backoff paths" This reverts commit 18b6bfd.
…cheduling retries" This reverts commit 66bf824.
… HTTP retry loop and query pagination entry points Replaces the previously-reverted PartitionKeyRangeCache and retry-policy CancellationToken plumbing with a narrowed, share-of-fate-safe fix: - CosmosHttpClientCore.ProcessMessageAsync: Task.Delay(delayForNextRequest, cancellationToken) so inter-attempt gateway-retry backoffs honor the caller's deadline. Background metadata refreshes that go through this path pass CancellationToken.None, so they are unaffected. - FeedRangeInternal.GetEffectiveRangesAsync (+ Epk / PartitionKey / PartitionKeyRange overrides): adds CancellationToken parameter and ThrowIfCancellationRequested at the top so an already-cancelled caller fails fast. - Caller-stack plumbing through ContainerCore, CosmosQueryClient(Core), NetworkAttachedDocumentContainer, and DefaultDocumentQueryExecutionContext so the operation token reaches the entry-point check above. Deliberately does NOT change: - PartitionKeyRangeCache populator (shared via AsyncCacheNonBlocking; one caller's token would cancel the refresh for every concurrent caller). - Retry policies (BackoffRetryUtility already honors CT before each attempt and during backoff; short-circuiting inside ShouldRetryAsync would preempt cross-region failover marking pinned by the existing CancellationTokenDoesNotCancelFailover tests). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tion-leasenotfound-and-metada # Conflicts: # changelog.md
…n tests, removes CT param from GetTargetPartitionKeyRangesAsync - FaultInjection: rename MetadataRequest -> MetadataBarrierRequest across enum, processor, README, changelog, and direct-mode test. - FaultInjection: restore empty <param> doc stubs on FaultInjectionServerErrorResultInternal ctor (incl. new headerOverrides slot); rewrite FI changelog bullet to include PR link. - Tests: add FeedRangePKRangeId_GetEffectiveRangesAsync_WithCancelledToken_ThrowsOperationCanceledException and emulator-side GivenCancelledTokenIsFeedRangePartOfAsyncThrowsOperationCanceledException to cover the two CT-honor gaps. - Query: remove CancellationToken parameter from CosmosQueryClient.GetTargetPartitionKeyRangesAsync (abstract + override + 4 test mocks). The method routes straight into the shared PKRC populator, which is deliberately non-cancellable (share-of-fate). CT-honor for query routing comes from GetTargetPartitionKeyRangeByFeedRangeAsync -> GetEffectiveRangesAsync entry-point check. - Query: revert incidental forceRefresh: false literal back to no-arg form in TryGetOverlappingRangesAsync call site. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- changelog.md: removed the verbose Other Changes 5913 entry; shortened the Bugs Fixed bullet to keep just the customer-observable summary + entry-point fast-fail. - CancellationTokenTests.GatewayProcessMessageAsyncCancelsOnDeadline: now allows derived types on [ExpectedException(typeof(OperationCanceledException))]. Our Task.Delay(delayForNextRequest, cancellationToken) fix in CosmosHttpClientCore throws TaskCanceledException (a subtype of OperationCanceledException), which the pre-existing exact-type assertion rejected. - IsFeedRangePartOfAsyncTests.GivenCancelledTokenIsFeedRangePartOfAsync...: assert on CosmosOperationCanceledException (the SDK wrapper, also a subtype of OperationCanceledException) and verify the captured token matches the caller's. The container surface wraps the inner OCE in CosmosOperationCanceledException for diagnostics, so the base-type expectation was too narrow. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Member
Author
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two independent improvements bundled in one PR:
LeaseNotFound (410/1022)and gateway metadata-read failures — adds end-to-end tests that inject the lease-not-found substatus on point operations and gateway PKRange/Address metadata reads, exercising cross-region retry and address-cache invalidation paths that previously had no fault-injection coverage.CancellationTokenhonor in the gateway HTTP retry loop and query pagination call stack — the customer-observable fix. Inter-attemptTask.Delaybackoffs and entry-point checks now respect the caller's token so a cancelled operation surfacesOperationCanceledExceptionwithin the caller's deadline.Why this matters — walkthrough of the customer scenario
ICM 21000001041367 (Costco Wallet Gateway Service): customer query with a 30 s
CancellationTokenreturnedTaskCanceledExceptionafter ~100 s wall-clock during a partition merge onpaymentshub.wallets. Backend telemetry showed the legitimateLeaseNotFound (410/1022)window during lease handoff cutover was only 18 s. The remaining ~80 s of post-deadline time was inside the SDK.Where the 80 s actually live
410/1022returnsShouldRetryResult.RetryAfter(TimeSpan.Zero)— there is no deliberate backoff for this substatus. So why ~80 s?The time isn't in
RetryAfter. It's in HTTP transport timeouts between regions that the SDK sits on inside its ownTask.Delays — andFeedIteratorprefetch fans out parallel retry chains, each unwinding independently.HttpTimeoutPolicyretry wait insideCosmosHttpClientCore.ProcessMessageAsyncTask.Delay(delay)had no token)Task.Delay(delay, ct))Task.DelayinsideStoreReaderFaultInjectionMetadataBarrierRequestDelay_HonorsCancellationToken)GetEffectiveRangesAsync/ routingThrowIfCancellationRequested)Multiply that uncancellable per-attempt budget by N parallel prefetch chains × M retry rounds before the policy gives up = the ~80 s Costco saw.
Before vs. after
sequenceDiagram autonumber participant App as Customer App (CT = 30s) participant SDK as SDK Retry Loop participant E as East US (source, merging) participant W as West US 2 (target) Note over App,W: t=0s Query starts, CT armed for 30s App->>SDK: ReadFeedAsync(ct) SDK->>E: GET (attempt 1) E-->>SDK: 410.1022 (LeaseNotFound) - fast Note over SDK: RetryAfter(Zero) -> mark EUS<br/>unavailable, retry on WUS2 SDK->>W: GET (attempt 2) W-->>SDK: 410.1022 SDK->>E: GET (attempt 3, prefetch chain 2) E-->>SDK: 410.1022 Note over App: t=30s CT FIRES rect rgb(255, 230, 230) Note over SDK: BEFORE FIX:<br/>SDK is inside await Task.Delay(delayForNextRequest)<br/>with NO token. Sleeps the full HttpTimeoutPolicy budget. SDK->>W: GET (attempt 4) W-->>SDK: 410.1022 (5s later) SDK->>E: GET (attempt 5) E-->>SDK: 410.1022 (5s later) Note over SDK: Parallel prefetch chains<br/>(FeedIterator fans out 4-8)<br/>each unwind independently end Note over App: t~100s All chains drain<br/>TaskCanceledException (70s LATE) SDK-->>App: TaskCanceledExceptionsequenceDiagram autonumber participant App as Customer App (CT = 30s) participant SDK as SDK Retry Loop participant E as East US participant W as West US 2 Note over App,W: t=0s Query starts, CT armed for 30s App->>SDK: ReadFeedAsync(ct) SDK->>E: GET (attempt 1) E-->>SDK: 410.1022 SDK->>W: GET (attempt 2) W-->>SDK: 410.1022 SDK->>E: GET (attempt 3) E-->>SDK: 410.1022 Note over App: t=30s CT FIRES rect rgb(220, 255, 220) Note over SDK: AFTER FIX:<br/>await Task.Delay(delay, ct) throws immediately.<br/>Next GetEffectiveRangesAsync also throws at entry-point. SDK--xSDK: OperationCanceledException Note over SDK: Each prefetch chain hits the same<br/>cancel point on its very next await end Note over App: t~30.x s SDK-->>App: TaskCanceledException (on time)Net: the fix doesn't shorten any individual attempt. It makes the inter-attempt
Task.Delays and the routing entry points respond to the CT — so the moment the customer's deadline fires, every chain bails out on its very nextawaitinstead of waiting out the full HTTP timeout first.Cancellation token fixes — what changed
Customer-facing fix (transport layer)
CosmosHttpClientCore.ProcessMessageAsync: the gateway HTTP retry loop's inter-attempt delay wasawait Task.Delay(delayForNextRequest). It is nowawait Task.Delay(delayForNextRequest, cancellationToken). Per-operation gateway requests (point ops, queries, etc.) flow the caller's token into this delay so a cancellation during backoff no longer overruns the caller's deadline. Background metadata refreshes that call this same path passCancellationToken.None, so they are unaffected.Entry-point fast-fail
FeedRangeInternal.GetEffectiveRangesAsync(virtual +Epk/PartitionKey/PartitionKeyRangeoverrides): addedcancellationToken.ThrowIfCancellationRequested()at the top so an already-cancelled caller fails fast before doing any cache work.Caller-stack plumbing
CT is now threaded through (and ultimately reaches the
FeedRangeInternal.GetEffectiveRangesAsyncentry-point check above):ContainerCore.GetFeedRangesAsync/GetPartitionKeyRangesAsyncCosmosQueryClient/CosmosQueryClientCore.TryGetTargetPartitionKeyRangeByFeedRangeAsyncNetworkAttachedDocumentContainer.MonadicGetChildRangeAsync/ pagination pathsFeedRangeCompositeContinuation.HandleSplitAsyncStandByFeedContinuationTokendelegate signatureDeliberately NOT changed
PartitionKeyRangeCachepopulator (thepopulatordelegates insideAsyncCacheNonBlocking): the shared metadata refresh task is fetched once and awaited by all concurrent callers. Letting one caller's token cancel this shared task would cancel the refresh for every other caller (share-of-fate). Callers fast-fail at their own entry points instead; the populator itself runs withCancellationToken.None.CosmosQueryClient.GetTargetPartitionKeyRangesAsync: routes straight intoTryGetOverlappingRangesAsync/ the shared populator above. No CT parameter — same share-of-fate rationale.ClientRetryPolicy,BulkExecutionRetryPolicy,ResourceThrottleRetryPolicy,WebExceptionRetryPolicy):BackoffRetryUtility.ExecuteAsyncalready callscancellationToken.ThrowIfCancellationRequested()before each attempt and usesTask.Delay(backoffTime, cancellationToken)for backoff. Short-circuiting insideShouldRetryAsyncwould preempt important side-effects performed during exception classification — most notably the cross-region failover marking inClientRetryPolicy'sHttpRequestExceptionhandler (TryMarkEndpointUnavailableForPkRange/MarkEndpointUnavailableForRead). The existingCancellationTokenDoesNotCancelFailovertests pin that contract.Test coverage
New tests
Microsoft.Azure.Cosmos.EmulatorTests): lease-not-found injection on point operations and gateway metadata-read paths; verifies cross-region retry behavior and address-cache invalidation.FaultInjectionMetadataBarrierRequestDelay_HonorsCancellationTokencovers the consistency-barrier loop using the newMetadataBarrierRequestoperation type +ResponseHeaderOverrideserver-error result.Microsoft.Azure.Cosmos.Tests): entry-point fast-fail onFeedRangeInternal.GetEffectiveRangesAsyncoverrides (Epk / PartitionKey / PartitionKeyRange); CT propagation throughCosmosHttpClientCoreretry-delay path; CT propagation through query pagination call sites;Container.IsFeedRangePartOfAsyncend-to-end CT honor (emulator).Existing tests preserved
CancellationTokenDoesNotCancelFailover (Eventual)/(Session)— pins the rule that cross-region failover marking runs to completion even when the caller's token has fired during a retry classification step.CancellationToken|WithCancelledToken|PropagatesCancellation|AvailabilityStrategyUnitpass.Risk
Task.Delay(ct)is the only behavioral change for in-flight retries on the gateway path. Cancelled-during-retry-backoff now surfacesOperationCanceledException; previously, it would complete the delay and either retry or surface the underlying exception. This matches the documentedCancellationTokencontract.