Skip to content

[Internal] CancellationToken: Fixes operation-level token being ignored in gateway HTTP retry delays and query pagination paths#5913

Draft
tvaron3 wants to merge 16 commits into
mainfrom
users/tvaron3/fault-injection-leasenotfound-metadata
Draft

[Internal] CancellationToken: Fixes operation-level token being ignored in gateway HTTP retry delays and query pagination paths#5913
tvaron3 wants to merge 16 commits into
mainfrom
users/tvaron3/fault-injection-leasenotfound-metadata

Conversation

@tvaron3
Copy link
Copy Markdown
Member

@tvaron3 tvaron3 commented May 29, 2026

Summary

Two independent improvements bundled in one PR:

  1. Fault injection coverage for LeaseNotFound (410/1022) and gateway metadata-read failures — adds end-to-end tests that inject the lease-not-found substatus on point operations and gateway PKRange/Address metadata reads, exercising cross-region retry and address-cache invalidation paths that previously had no fault-injection coverage.
  2. Operation-level CancellationToken honor in the gateway HTTP retry loop and query pagination call stack — the customer-observable fix. Inter-attempt Task.Delay backoffs and entry-point checks now respect the caller's token so a cancelled operation surfaces OperationCanceledException within the caller's deadline.

Why this matters — walkthrough of the customer scenario

ICM 21000001041367 (Costco Wallet Gateway Service): customer query with a 30 s CancellationToken returned TaskCanceledException after ~100 s wall-clock during a partition merge on paymentshub.wallets. Backend telemetry showed the legitimate LeaseNotFound (410/1022) window during lease handoff cutover was only 18 s. The remaining ~80 s of post-deadline time was inside the SDK.

Where the 80 s actually live

410/1022 returns ShouldRetryResult.RetryAfter(TimeSpan.Zero) — there is no deliberate backoff for this substatus. So why ~80 s?

The time isn't in RetryAfter. It's in HTTP transport timeouts between regions that the SDK sits on inside its own Task.Delays — and FeedIterator prefetch fans out parallel retry chains, each unwinding independently.

Per retry attempt Time Cancellable before fix? Cancellable after fix?
DNS / TCP / TLS reuse ~1 ms yes (socket await) yes
Cross-region HTTP round-trip 50–80 ms yes (socket await) yes
HttpTimeoutPolicy retry wait inside CosmosHttpClientCore.ProcessMessageAsync up to 5 s per attempt ❌ NO (Task.Delay(delay) had no token) ✅ YES (Task.Delay(delay, ct))
Consistency-barrier Task.Delay inside StoreReader up to several seconds ❌ NO ✅ YES (covered by FaultInjectionMetadataBarrierRequestDelay_HonorsCancellationToken)
GetEffectiveRangesAsync / routing ms, but uncancellable ❌ NO ✅ YES (entry-point ThrowIfCancellationRequested)

Multiply that uncancellable per-attempt budget by N parallel prefetch chains × M retry rounds before the policy gives up = the ~80 s Costco saw.

Before vs. after

sequenceDiagram
    autonumber
    participant App as Customer App (CT = 30s)
    participant SDK as SDK Retry Loop
    participant E as East US (source, merging)
    participant W as West US 2 (target)

    Note over App,W: t=0s  Query starts, CT armed for 30s
    App->>SDK: ReadFeedAsync(ct)
    SDK->>E: GET (attempt 1)
    E-->>SDK: 410.1022 (LeaseNotFound) - fast
    Note over SDK: RetryAfter(Zero) -> mark EUS<br/>unavailable, retry on WUS2
    SDK->>W: GET (attempt 2)
    W-->>SDK: 410.1022
    SDK->>E: GET (attempt 3, prefetch chain 2)
    E-->>SDK: 410.1022

    Note over App: t=30s  CT FIRES
    rect rgb(255, 230, 230)
        Note over SDK: BEFORE FIX:<br/>SDK is inside await Task.Delay(delayForNextRequest)<br/>with NO token. Sleeps the full HttpTimeoutPolicy budget.
        SDK->>W: GET (attempt 4)
        W-->>SDK: 410.1022 (5s later)
        SDK->>E: GET (attempt 5)
        E-->>SDK: 410.1022 (5s later)
        Note over SDK: Parallel prefetch chains<br/>(FeedIterator fans out 4-8)<br/>each unwind independently
    end
    Note over App: t~100s  All chains drain<br/>TaskCanceledException (70s LATE)
    SDK-->>App: TaskCanceledException
Loading
sequenceDiagram
    autonumber
    participant App as Customer App (CT = 30s)
    participant SDK as SDK Retry Loop
    participant E as East US
    participant W as West US 2

    Note over App,W: t=0s  Query starts, CT armed for 30s
    App->>SDK: ReadFeedAsync(ct)
    SDK->>E: GET (attempt 1)
    E-->>SDK: 410.1022
    SDK->>W: GET (attempt 2)
    W-->>SDK: 410.1022
    SDK->>E: GET (attempt 3)
    E-->>SDK: 410.1022

    Note over App: t=30s  CT FIRES
    rect rgb(220, 255, 220)
        Note over SDK: AFTER FIX:<br/>await Task.Delay(delay, ct) throws immediately.<br/>Next GetEffectiveRangesAsync also throws at entry-point.
        SDK--xSDK: OperationCanceledException
        Note over SDK: Each prefetch chain hits the same<br/>cancel point on its very next await
    end
    Note over App: t~30.x s
    SDK-->>App: TaskCanceledException (on time)
Loading

Net: the fix doesn't shorten any individual attempt. It makes the inter-attempt Task.Delays and the routing entry points respond to the CT — so the moment the customer's deadline fires, every chain bails out on its very next await instead of waiting out the full HTTP timeout first.

Cancellation token fixes — what changed

Customer-facing fix (transport layer)

  • CosmosHttpClientCore.ProcessMessageAsync: the gateway HTTP retry loop's inter-attempt delay was await Task.Delay(delayForNextRequest). It is now await Task.Delay(delayForNextRequest, cancellationToken). Per-operation gateway requests (point ops, queries, etc.) flow the caller's token into this delay so a cancellation during backoff no longer overruns the caller's deadline. Background metadata refreshes that call this same path pass CancellationToken.None, so they are unaffected.

Entry-point fast-fail

  • FeedRangeInternal.GetEffectiveRangesAsync (virtual + Epk / PartitionKey / PartitionKeyRange overrides): added cancellationToken.ThrowIfCancellationRequested() at the top so an already-cancelled caller fails fast before doing any cache work.

Caller-stack plumbing

CT is now threaded through (and ultimately reaches the FeedRangeInternal.GetEffectiveRangesAsync entry-point check above):

  • ContainerCore.GetFeedRangesAsync / GetPartitionKeyRangesAsync
  • CosmosQueryClient / CosmosQueryClientCore.TryGetTargetPartitionKeyRangeByFeedRangeAsync
  • NetworkAttachedDocumentContainer.MonadicGetChildRangeAsync / pagination paths
  • FeedRangeCompositeContinuation.HandleSplitAsync
  • StandByFeedContinuationToken delegate signature

Deliberately NOT changed

  • PartitionKeyRangeCache populator (the populator delegates inside AsyncCacheNonBlocking): the shared metadata refresh task is fetched once and awaited by all concurrent callers. Letting one caller's token cancel this shared task would cancel the refresh for every other caller (share-of-fate). Callers fast-fail at their own entry points instead; the populator itself runs with CancellationToken.None.
  • CosmosQueryClient.GetTargetPartitionKeyRangesAsync: routes straight into TryGetOverlappingRangesAsync / the shared populator above. No CT parameter — same share-of-fate rationale.
  • Retry policies (ClientRetryPolicy, BulkExecutionRetryPolicy, ResourceThrottleRetryPolicy, WebExceptionRetryPolicy): BackoffRetryUtility.ExecuteAsync already calls cancellationToken.ThrowIfCancellationRequested() before each attempt and uses Task.Delay(backoffTime, cancellationToken) for backoff. Short-circuiting inside ShouldRetryAsync would preempt important side-effects performed during exception classification — most notably the cross-region failover marking in ClientRetryPolicy's HttpRequestException handler (TryMarkEndpointUnavailableForPkRange / MarkEndpointUnavailableForRead). The existing CancellationTokenDoesNotCancelFailover tests pin that contract.

Test coverage

New tests

  • Fault injection (Microsoft.Azure.Cosmos.EmulatorTests): lease-not-found injection on point operations and gateway metadata-read paths; verifies cross-region retry behavior and address-cache invalidation. FaultInjectionMetadataBarrierRequestDelay_HonorsCancellationToken covers the consistency-barrier loop using the new MetadataBarrierRequest operation type + ResponseHeaderOverride server-error result.
  • CT honor regression tests (Microsoft.Azure.Cosmos.Tests): entry-point fast-fail on FeedRangeInternal.GetEffectiveRangesAsync overrides (Epk / PartitionKey / PartitionKeyRange); CT propagation through CosmosHttpClientCore retry-delay path; CT propagation through query pagination call sites; Container.IsFeedRangePartOfAsync end-to-end CT honor (emulator).

Existing tests preserved

  • CancellationTokenDoesNotCancelFailover (Eventual) / (Session) — pins the rule that cross-region failover marking runs to completion even when the caller's token has fired during a retry classification step.
  • All 30 tests matching CancellationToken|WithCancelledToken|PropagatesCancellation|AvailabilityStrategyUnit pass.

Risk

  • Transport-layer Task.Delay(ct) is the only behavioral change for in-flight retries on the gateway path. Cancelled-during-retry-backoff now surfaces OperationCanceledException; previously, it would complete the delay and either retry or surface the underlying exception. This matches the documented CancellationToken contract.
  • No shared-cache behavior changes (PKRC populator unchanged).
  • No retry-policy classification or failover behavior changes.

tvaron3 and others added 3 commits May 28, 2026 17:37
…t types + CT-honor repros for issue #5906

- Add FaultInjectionServerErrorType.LeaseNotFound (410/1022) for injecting the
  lease-handoff failure mode observed during partition merge cutovers.
- Implement the new error type in both Direct (StoreResponse) and Gateway
  (HttpResponseMessage) result paths, modeled on PartitionIsMigrating (410/1008).
- Add FaultInjectionOperationType.MetadataRequest -> OperationType.Head so
  fault rules can target consistency-barrier Head/Collection requests issued
  by the SDK under Strong / Bounded-Staleness consistency.
- Extend gateway and direct smoke [DataRow] tables with the new LeaseNotFound
  status/substatus pair.
- Add two e2e tests that are EXPECTED TO FAIL on master until #5906 ships:
  * FaultInjectionLeaseNotFound_FeedIteratorHonorsCancellationToken: cross-
    partition FeedIterator under a 5 s CancellationToken with 410/1022 in the
    primary region — asserts wall-clock <= 7 s (currently runs much longer due
    to non-CT-aware ClientRetryPolicy + prefetch fan-out).
  * FaultInjectionMetadataRequestDelay_HonorsCancellationToken: Strong-
    consistency multi-region read with a 20 s delay injected on every Head
    barrier in the primary region — asserts the CT is honored within 7 s
    instead of waiting for the barrier to complete.

Both tests reference customer ICM 21000001041367 and #5906.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…st comments

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@tvaron3
Copy link
Copy Markdown
Member Author

tvaron3 commented May 29, 2026

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

tvaron3 and others added 3 commits May 28, 2026 21:51
…Found CT-honor repro

Forces the SDK to fail over from a primary-region LeaseNotFound (410/1022) into
a secondary region that holds the response open with a 120-second delay. With
both regions blocked, a CT-aware code path must surface OperationCanceledException
within the user's CT window; a non-CT-aware path waits in the secondary delay.

Verified locally against single-master account (Strong, CUS write + WUS2 read).
Repro reliably exceeds the 5+2=7s budget by ~2s, demonstrating
CosmosOperationCanceledException surfaces only after the channel-level read
finally observes the CT \u2014 not when the operation-level token first fires.

Also hardens TestCleanup to tolerate partially-initialized fiClient/client.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ader split

Restructures the barrier (Head/Collection) test to maximize the chance of
provoking a Strong-consistency barrier poll: a dedicated writer client pinned
to the write region creates a fresh item to establish a new GCLSN target, then
a reader client with the non-write region preferred immediately issues a
Strong ReadItemAsync. The barrier delay rule fires on any region (no region
filter) since the SDK picks which replica to poll dynamically.

On idle test accounts the read will often satisfy quorum from the initial
response without needing a separate Head poll. The test returns
Assert.Inconclusive in that case; under realistic write load the rule will
fire and the CT-honor assertion will execute.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rier CT-honor test

ResponseHeaderOverride synthesizes a 200 response with caller-supplied
response headers merged in. Useful for deterministically provoking
SDK consistency / barrier paths by lying about LSN / GlobalCommittedLSN /
QuorumAckedLSN / NumberOfReadRegions values without depending on real
cross-region replication lag. Implemented in both Direct (StoreResponse)
and Gateway (HttpResponseMessage) paths.

Updates FaultInjectionMetadataRequestDelay_HonorsCancellationToken to use
the new primitive plus a Direct-mode constraint. The test now mirrors the
proven pattern from BarrierRequestReplicaRetryTests: trigger a write
barrier by setting NumberOfReadRegions=2 + GCLSN < LSN, then fail every
barrier Head/Collection poll with 410/Gone for 60 s. Verified locally
against single-master multi-region account: 4 barrier polls fire, 5 s CT
is honored, CosmosOperationCanceledException surfaces within 7 s budget.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@tvaron3 tvaron3 marked this pull request as ready for review May 29, 2026 06:47
@tvaron3 tvaron3 marked this pull request as draft May 29, 2026 06:48
…g retries

ClientRetryPolicy.ShouldRetryAsync (both overloads) received the caller's
CancellationToken but never consulted it before returning a
ShouldRetryResult.RetryAfter(...) decision. As a result, the SDK could
schedule additional cross-region retry attempts past the user's
cancellation deadline (e.g., after 410/1022 LeaseNotFound during partition
merge / lease handoff).

Adds an early NoRetry() return when the CancellationToken is already
cancelled in both overloads, plus three unit tests in ClientRetryPolicyTests
that pin the contract.

Also caps FaultInjection's RNTBD-mode ResponseDelay / SendDelay at the
client's configured RequestTimeout. The IChaosInterceptor RNTBD callback
signatures do not accept a CancellationToken (closed-source Direct package
limitation), so a non-cancellable Task.Delay could otherwise block past
the caller's CT. Capping at RequestTimeout bounds the worst case and
mirrors the HTTP path behavior.

Adds FaultInjectionLeaseNotFound_FeedIteratorHonorsCancellationToken
e2e test (FaultInjection multi-region account, single-master, Strong)
that injects 410/1022 in the primary region and ResponseDelay in the
secondary, then asserts the FeedIterator surfaces
OperationCanceledException within the user's CT + 2s grace.
Verified locally: previously 9.36s wall-clock under a 5s CT; with the
fix and a 3s RequestTimeout the operation now completes in ~5.75s.

Adds LeaseNotFound smoke [DataRow] in Direct + Gateway smoke tests and
README documentation row for the (upstream-added) LeaseNotFound fault
type, plus TestCleanup hardening to tolerate partially-initialized
fiClient when the test bails early.

Reverts the MetadataRequest operation type, ResponseHeaderOverride
server error type, and FaultInjectionMetadataRequestDelay test that
were intermediate exploration artifacts and are not needed for this fix.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@tvaron3
Copy link
Copy Markdown
Member Author

tvaron3 commented May 29, 2026

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@github-actions
Copy link
Copy Markdown

📝 Changelog reminder (non-blocking)

This PR touches shipped source but does not appear to update the
corresponding changelog.md.

Touched (and missing entry for):

  • Microsoft.Azure.Cosmos/FaultInjection/src/** ⇒ expected an entry in Microsoft.Azure.Cosmos/FaultInjection/changelog.md (### Unreleased)

How to decide

Use the rubric in .github/copilot-instructions.md ("Changelog
classifier") or in CONTRIBUTING.md ("Changelog entry"). Quick
version:

  • Customer-observable change (behavior, perf, memory, API) ⇒ add an entry, even if the PR is [Internal].
  • Test-only / CI-only / doc-only / pure-internal-refactor with zero customer-observable effect ⇒ no entry, add the no-changelog-needed label to silence this check.
  • Unsure? Default to adding a one-line entry under #### Other Changes. Reviewer will adjust.

This check is non-blocking — merge is not gated on it. The
reviewer is responsible for the final classification.

…ional retry/backoff paths

Extends PR #5913 ClientRetryPolicy CT fix to cover the rest of the live-site incident class:

- ResourceThrottleRetryPolicy: CT guard at top of both ShouldRetryAsync overloads
  (default maxWaitTimeInSeconds=60; 429 backoffs could previously be issued past
  the caller's deadline).
- WebExceptionRetryPolicy: CT guard at top (exponential backoff up to ~16s).
- BulkExecutionRetryPolicy: CT guard at top of both overloads.
- CosmosHttpClientCore.SendHttpHelperAsync: pass cancellationToken to
  Task.Delay(delayForNextRequest, ...) so the inter-attempt HttpTimeoutPolicy
  delay is interruptible.

Updates changelog entry to cover the broader scope. Adds unit tests for the
five guard sites.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@tvaron3
Copy link
Copy Markdown
Member Author

tvaron3 commented Jun 1, 2026

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command.

…nKeyRangeCache and query pagination boundaries

Plumbs CancellationToken end-to-end through IRoutingMapProvider,
PartitionKeyRangeCache (TryGetOverlappingRangesAsync,
TryGetPartitionKeyRangeByIdAsync, GetRoutingMapForCollectionAsync),
IRoutingMapProviderExtensions, CosmosQueryClient(Core), and the two
NetworkAttachedDocumentContainer call sites used by query pagination
(MonadicGetChildRangeAsync, MonadicRefreshProviderAsync). Also threads
the token through BulkExecutionRetryPolicy and updates the
StandByFeedContinuationToken delegate signature.

Operations now surface OperationCanceledException within the caller's
token deadline instead of blocking on PKRange cache refreshes between
pages or during split/merge handling.

Adds PartitionKeyRangeCacheTest.TryGetOverlappingRangesAsync_WithCancelledToken_ThrowsOperationCanceledException
to lock in the behavior. Existing test mocks updated to match new
signatures.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@tvaron3 tvaron3 changed the title FaultInjection: Adds LeaseNotFound (410/1022) and MetadataRequest fault types Pagination + RetryPolicies: Fixes CancellationToken not being honored across PartitionKeyRangeCache, query pagination, and retry/backoff paths Jun 4, 2026
@tvaron3 tvaron3 changed the title Pagination + RetryPolicies: Fixes CancellationToken not being honored across PartitionKeyRangeCache, query pagination, and retry/backoff paths Pagination + RetryPolicies: Fixes CancellationToken not honored across retry, pagination, and PartitionKeyRangeCache entry points Jun 4, 2026
@tvaron3 tvaron3 changed the title Pagination + RetryPolicies: Fixes CancellationToken not honored across retry, pagination, and PartitionKeyRangeCache entry points [Internal] CancellationToken: Fixes operation-level token being ignored in gateway HTTP retry delays and query pagination paths Jun 4, 2026
tvaron3 and others added 7 commits June 5, 2026 11:55
…PartitionKeyRangeCache and query pagination boundaries"

This reverts commit aea58a2.
…ss additional retry/backoff paths"

This reverts commit 18b6bfd.
… HTTP retry loop and query pagination entry points

Replaces the previously-reverted PartitionKeyRangeCache and retry-policy
CancellationToken plumbing with a narrowed, share-of-fate-safe fix:

- CosmosHttpClientCore.ProcessMessageAsync: Task.Delay(delayForNextRequest, cancellationToken)
  so inter-attempt gateway-retry backoffs honor the caller's deadline.
  Background metadata refreshes that go through this path pass
  CancellationToken.None, so they are unaffected.

- FeedRangeInternal.GetEffectiveRangesAsync (+ Epk / PartitionKey /
  PartitionKeyRange overrides): adds CancellationToken parameter and
  ThrowIfCancellationRequested at the top so an already-cancelled
  caller fails fast.

- Caller-stack plumbing through ContainerCore, CosmosQueryClient(Core),
  NetworkAttachedDocumentContainer, and DefaultDocumentQueryExecutionContext
  so the operation token reaches the entry-point check above.

Deliberately does NOT change:
- PartitionKeyRangeCache populator (shared via AsyncCacheNonBlocking;
  one caller's token would cancel the refresh for every concurrent caller).
- Retry policies (BackoffRetryUtility already honors CT before each
  attempt and during backoff; short-circuiting inside ShouldRetryAsync
  would preempt cross-region failover marking pinned by the existing
  CancellationTokenDoesNotCancelFailover tests).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tion-leasenotfound-and-metada

# Conflicts:
#	changelog.md
…n tests, removes CT param from GetTargetPartitionKeyRangesAsync

- FaultInjection: rename MetadataRequest -> MetadataBarrierRequest across enum, processor, README, changelog, and direct-mode test.
- FaultInjection: restore empty <param> doc stubs on FaultInjectionServerErrorResultInternal ctor (incl. new headerOverrides slot); rewrite FI changelog bullet to include PR link.
- Tests: add FeedRangePKRangeId_GetEffectiveRangesAsync_WithCancelledToken_ThrowsOperationCanceledException and emulator-side GivenCancelledTokenIsFeedRangePartOfAsyncThrowsOperationCanceledException to cover the two CT-honor gaps.
- Query: remove CancellationToken parameter from CosmosQueryClient.GetTargetPartitionKeyRangesAsync (abstract + override + 4 test mocks). The method routes straight into the shared PKRC populator, which is deliberately non-cancellable (share-of-fate). CT-honor for query routing comes from GetTargetPartitionKeyRangeByFeedRangeAsync -> GetEffectiveRangesAsync entry-point check.
- Query: revert incidental forceRefresh: false literal back to no-arg form in TryGetOverlappingRangesAsync call site.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- changelog.md: removed the verbose Other Changes 5913 entry; shortened the Bugs Fixed bullet to keep just the customer-observable summary + entry-point fast-fail.
- CancellationTokenTests.GatewayProcessMessageAsyncCancelsOnDeadline: now allows derived types on [ExpectedException(typeof(OperationCanceledException))]. Our Task.Delay(delayForNextRequest, cancellationToken) fix in CosmosHttpClientCore throws TaskCanceledException (a subtype of OperationCanceledException), which the pre-existing exact-type assertion rejected.
- IsFeedRangePartOfAsyncTests.GivenCancelledTokenIsFeedRangePartOfAsync...: assert on CosmosOperationCanceledException (the SDK wrapper, also a subtype of OperationCanceledException) and verify the captured token matches the caller's. The container surface wraps the inner OCE in CosmosOperationCanceledException for diagnostics, so the base-type expectation was too narrow.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@tvaron3
Copy link
Copy Markdown
Member Author

tvaron3 commented Jun 6, 2026

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant