Skip to content

Direct: Fixes 410/1002 PartitionKeyRangeGone bubbling up to point-read callers without routing-map refresh #5924

@tvaron3

Description

@tvaron3

Summary

When a partition split / merge / migration causes a PartitionKeyRangeGoneException (410 / 1002) to be thrown from inside GatewayAddressCache.GetAddressesForRangeIdAsync, the exception escapes the SDK retry pipeline and surfaces to the caller as a raw 410 / 1002 on point operations. This is inconsistent with the SDK's own contract for the Gone family:

  1. Customer-facing surface should be 503, not 410. Every other exhaustion path in GoneAndRetryWithRequestRetryPolicy wraps the terminal exception as ServiceUnavailableException (with Server_PartitionKeyRangeGoneExceededRetryLimit = 21002, Server_GoneExceededRetryLimit = 21005, etc.). A bare 410 leaking through to user code makes the failure unclassifiable by standard SDK consumers — most retry / circuit-breaker logic at the application layer is wired for 503, not 410.
  2. The routing-map (pk-range) cache is never refreshed. The address cache for the (now-stale) partitionKeyRangeId is the only cache that gets refreshed on this codepath, leaving the routing map stale until another request happens to walk into AddressResolver.HandleRangeAddressResolutionFailure. This produces a cluster of failures right after a split until enough traffic has trickled through to converge the routing map.

The result for a customer is a hard, non-retryable-looking 410 on a point read, with a stale routing map left behind for the next caller to discover.

Real incident trace

Diagnostics summary surfaced to the app:

"Summary": {
  "DirectCalls":  { "(410, 21005)": 3 },
  "GatewayCalls": { "(200, 0)": 1 }
},
"Point Operation Statistics": {
  "StatusCode": 410,
  "SubStatusCode": 1002,
  "ErrorMessage": "PartitionKeyRange with id '4874' in collection 'some_rid' doesn't exist."
}

Sequence of events (verified against GatewayAddressCache.cs and GoneAndRetryWithRequestRetryPolicy.cs on main @ 0ca35ae)

  1. App issues ReadItemAsync against pk that routes to pkRange 4874.
  2. AddressResolver.ResolveAsync finds pkRange 4874 in its cached routing map (it's stale but still present) and hands it to StoreClient.
  3. ReplicatedResourceClientGoneAndRetryWithRequestRetryPolicy enters the retry loop with a 30 s wall-clock budget.
  4. Attempt 1 — RNTBD to replica IN_27 → backend returns 410 with substatus 0 (E_REPLICA_RECONFIGURATION_PENDING). TransportClient stamps it 21005. Policy sets ForceRefreshPartitionAddresses = true, retries.
  5. Attempt 2 — RNTBD to replica IN_28 → same response, stamped 21005 again.
  6. Attempt 3 — RNTBD to replica IN_135 → same response, stamped 21005 again.
  7. In parallel, the address-cache refresh hits the gateway (GatewayCalls: (200, 0): 1, ~17 ms). The gateway's address feed comes back without an entry for pkRangeId 4874 (it's been split away).
  8. GatewayAddressCache.GetAddressesForRangeIdAsync, line 661-669 throws PartitionKeyRangeGoneException (the "doesn't exist" message).
  9. This 1002 exception escapes TryGetAddressesAsync inside the retry loop's call to fetch new addresses — it is not a response from a backend replica, it's thrown synchronously by the address-resolution call itself.
  10. Back in GoneAndRetryWithRequestRetryPolicy.ShouldRetryAsyncInternal, the policy is just about to start iteration 4. It would normally evaluate remainingMilliseconds <= 0 and convert the exception to 503 — but only 19 ms has elapsed and the budget is 30 s, so that branch is not taken.
  11. The PartitionKeyRangeGoneException propagates up unchanged through ClientRetryPolicy (no 410/1002 branch), NamedCacheRetryHandler (only handles 410/1000), and surfaces to the caller as 410 / 1002.
  12. The routing-map cache is never refreshed on this entire codepath — ForceCollectionRoutingMapRefresh is never set anywhere in the gone-policy or anywhere downstream of it. The next request to the same logical pk will also resolve to (still-stale) pkRange 4874 from the cached routing map, and will go through the same dance.

Root cause

GoneAndRetryWithRequestRetryPolicy only wraps the exception as 503 in the remainingMilliseconds <= 0 branch (see GoneAndRetryWithRequestRetryPolicy.cs:218-310 in bluebird):

if (this.attemptCount++ > 1)
{
    if (remainingMilliseconds <= 0)           // only here
    {
        if (IsBaseGone(...) || IsPartitionKeyRangeGone(...) || ...)
        {
            exceptionToThrow = ServiceUnavailableException.Create(
                exceptionSubStatus, innerException: exception);
        }
    }
}

When PartitionKeyRangeGoneException is thrown by address resolution itself (not by a replica response), the policy never gets a chance to run the wrap path because the exception is raised inside the call that fetches the next batch of addresses, before the time-budget check fires for the next attempt.

Additionally, neither the gone-policy retry path nor the throw site at GatewayAddressCache.cs:669 sets ForceCollectionRoutingMapRefresh = true, so the (now demonstrably stale) routing map stays cached.

Proposed behavior changes

1. Wrap escaping PartitionKeyRangeGoneException from address resolution as 503

When TryGetAddressesAsync (or the equivalent direct path) throws PartitionKeyRangeGoneException, GoneAndRetryWithRequestRetryPolicy (or a thin wrapper) should convert it to ServiceUnavailableException.Create(SubStatusCodes.Server_PartitionKeyRangeGoneExceededRetryLimit /* 21002 */, innerException: …) before surfacing. The existing GetExceptionSubStatusForGoneRetryPolicy mapping already gives us the right substatus.

This makes the surfaced status code align with the rest of the gone-family terminal paths and lets standard client-side 503-retry / circuit-breaker logic kick in.

2. Trigger a routing-map (pk-range) refresh when a 1002 is detected anywhere in the path

When GatewayAddressCache.GetAddressesForRangeIdAsync discovers an empty feed for the requested pkRangeId, it should also flag the routing-map cache for refresh — either by setting request.ForceCollectionRoutingMapRefresh = true before the throw, or by directly calling into PartitionKeyRangeCache.TryLookupAsync(..., forceRefreshCollectionRoutingMap: true, previousValue: <current map>).

Today, AddressResolver.HandleRangeAddressResolutionFailure is the only place that performs this refresh, and it's only entered when the cached routing map doesn't even contain a matching range. In the split-recently-completed scenario, the cached routing map still contains the soon-to-be-deleted range, so this safety net never fires.

3. Optional — special-case PartitionKeyRangeGoneException higher in the client retry pipeline

ClientRetryPolicy could grow a 410/1002 branch that triggers one in-line routing-map refresh + one retry (mirroring PartitionKeyRangeGoneRetryPolicy.cs, which is currently wired only into v2 query / change feed). This would let point reads actually recover within the same request instead of failing the first request and recovering on the second.

Java comparison

Java's GoneAndRetryWithRetryPolicy.GoneRetryPolicy.isNonRetryableException similarly excludes PartitionKeyRangeGoneException — i.e., 410/1002 also bubbles up at the gone-policy in the Java SDK. The Java policy only sets forcePartitionKeyRangeRefresh = true for typed 1007 PartitionKeyRangeIsSplittingException, not for bare Gone or PartitionKeyRangeGone. So this fix would be worth coordinating across SDKs (track Java separately), but the .NET-side fix is independent.

Reproduction signal

Look in customer diagnostics for:

  • Summary.DirectCalls containing (410, 21005) ≥ 1
  • Summary.GatewayCalls containing exactly one (200, 0) (the address-cache refresh)
  • PointOperationStatistics.StatusCode = 410, SubStatusCode = 1002
  • Total elapsed time well under 30 s
  • Error message: "PartitionKeyRange with id '<n>' in collection '<rid>' doesn't exist."

That trio uniquely identifies the address-cache-says-gone, routing-map-still-stale code path.

Affected versions

Verified present on main (commit 0ca35ae9) as of filing. Behavior is the same in 3.60.0P (the version on the incident) and unchanged in every release on main. No mitigating PR identified since 3.60.0.

Acceptance criteria

  • Point-read against a pkRange that the gateway no longer lists surfaces as 503 (with substatus 21002 or equivalent), not 410 / 1002.
  • After a request hits this code path, the next request observes a fresh routing map (PartitionKeyRangeCache reflects the post-split state).
  • New / updated integration test under Microsoft.Azure.Cosmos.EmulatorTests simulating an empty address feed for a requested pkRangeId (mockable via IAddressResolver or test gateway).
  • Add a ### Unreleased Bugs Fixed entry to changelog.md.

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions