Summary
When a partition split / merge / migration causes a PartitionKeyRangeGoneException (410 / 1002) to be thrown from inside GatewayAddressCache.GetAddressesForRangeIdAsync, the exception escapes the SDK retry pipeline and surfaces to the caller as a raw 410 / 1002 on point operations. This is inconsistent with the SDK's own contract for the Gone family:
- Customer-facing surface should be 503, not 410. Every other exhaustion path in
GoneAndRetryWithRequestRetryPolicy wraps the terminal exception as ServiceUnavailableException (with Server_PartitionKeyRangeGoneExceededRetryLimit = 21002, Server_GoneExceededRetryLimit = 21005, etc.). A bare 410 leaking through to user code makes the failure unclassifiable by standard SDK consumers — most retry / circuit-breaker logic at the application layer is wired for 503, not 410.
- The routing-map (pk-range) cache is never refreshed. The address cache for the (now-stale)
partitionKeyRangeId is the only cache that gets refreshed on this codepath, leaving the routing map stale until another request happens to walk into AddressResolver.HandleRangeAddressResolutionFailure. This produces a cluster of failures right after a split until enough traffic has trickled through to converge the routing map.
The result for a customer is a hard, non-retryable-looking 410 on a point read, with a stale routing map left behind for the next caller to discover.
Real incident trace
Diagnostics summary surfaced to the app:
"Summary": {
"DirectCalls": { "(410, 21005)": 3 },
"GatewayCalls": { "(200, 0)": 1 }
},
"Point Operation Statistics": {
"StatusCode": 410,
"SubStatusCode": 1002,
"ErrorMessage": "PartitionKeyRange with id '4874' in collection 'some_rid' doesn't exist."
}
Sequence of events (verified against GatewayAddressCache.cs and GoneAndRetryWithRequestRetryPolicy.cs on main @ 0ca35ae)
- App issues
ReadItemAsync against pk that routes to pkRange 4874.
AddressResolver.ResolveAsync finds pkRange 4874 in its cached routing map (it's stale but still present) and hands it to StoreClient.
ReplicatedResourceClient → GoneAndRetryWithRequestRetryPolicy enters the retry loop with a 30 s wall-clock budget.
- Attempt 1 — RNTBD to replica
IN_27 → backend returns 410 with substatus 0 (E_REPLICA_RECONFIGURATION_PENDING). TransportClient stamps it 21005. Policy sets ForceRefreshPartitionAddresses = true, retries.
- Attempt 2 — RNTBD to replica
IN_28 → same response, stamped 21005 again.
- Attempt 3 — RNTBD to replica
IN_135 → same response, stamped 21005 again.
- In parallel, the address-cache refresh hits the gateway (
GatewayCalls: (200, 0): 1, ~17 ms). The gateway's address feed comes back without an entry for pkRangeId 4874 (it's been split away).
GatewayAddressCache.GetAddressesForRangeIdAsync, line 661-669 throws PartitionKeyRangeGoneException (the "doesn't exist" message).
- This 1002 exception escapes
TryGetAddressesAsync inside the retry loop's call to fetch new addresses — it is not a response from a backend replica, it's thrown synchronously by the address-resolution call itself.
- Back in
GoneAndRetryWithRequestRetryPolicy.ShouldRetryAsyncInternal, the policy is just about to start iteration 4. It would normally evaluate remainingMilliseconds <= 0 and convert the exception to 503 — but only 19 ms has elapsed and the budget is 30 s, so that branch is not taken.
- The
PartitionKeyRangeGoneException propagates up unchanged through ClientRetryPolicy (no 410/1002 branch), NamedCacheRetryHandler (only handles 410/1000), and surfaces to the caller as 410 / 1002.
- The routing-map cache is never refreshed on this entire codepath —
ForceCollectionRoutingMapRefresh is never set anywhere in the gone-policy or anywhere downstream of it. The next request to the same logical pk will also resolve to (still-stale) pkRange 4874 from the cached routing map, and will go through the same dance.
Root cause
GoneAndRetryWithRequestRetryPolicy only wraps the exception as 503 in the remainingMilliseconds <= 0 branch (see GoneAndRetryWithRequestRetryPolicy.cs:218-310 in bluebird):
if (this.attemptCount++ > 1)
{
if (remainingMilliseconds <= 0) // only here
{
if (IsBaseGone(...) || IsPartitionKeyRangeGone(...) || ...)
{
exceptionToThrow = ServiceUnavailableException.Create(
exceptionSubStatus, innerException: exception);
}
}
}
When PartitionKeyRangeGoneException is thrown by address resolution itself (not by a replica response), the policy never gets a chance to run the wrap path because the exception is raised inside the call that fetches the next batch of addresses, before the time-budget check fires for the next attempt.
Additionally, neither the gone-policy retry path nor the throw site at GatewayAddressCache.cs:669 sets ForceCollectionRoutingMapRefresh = true, so the (now demonstrably stale) routing map stays cached.
Proposed behavior changes
1. Wrap escaping PartitionKeyRangeGoneException from address resolution as 503
When TryGetAddressesAsync (or the equivalent direct path) throws PartitionKeyRangeGoneException, GoneAndRetryWithRequestRetryPolicy (or a thin wrapper) should convert it to ServiceUnavailableException.Create(SubStatusCodes.Server_PartitionKeyRangeGoneExceededRetryLimit /* 21002 */, innerException: …) before surfacing. The existing GetExceptionSubStatusForGoneRetryPolicy mapping already gives us the right substatus.
This makes the surfaced status code align with the rest of the gone-family terminal paths and lets standard client-side 503-retry / circuit-breaker logic kick in.
2. Trigger a routing-map (pk-range) refresh when a 1002 is detected anywhere in the path
When GatewayAddressCache.GetAddressesForRangeIdAsync discovers an empty feed for the requested pkRangeId, it should also flag the routing-map cache for refresh — either by setting request.ForceCollectionRoutingMapRefresh = true before the throw, or by directly calling into PartitionKeyRangeCache.TryLookupAsync(..., forceRefreshCollectionRoutingMap: true, previousValue: <current map>).
Today, AddressResolver.HandleRangeAddressResolutionFailure is the only place that performs this refresh, and it's only entered when the cached routing map doesn't even contain a matching range. In the split-recently-completed scenario, the cached routing map still contains the soon-to-be-deleted range, so this safety net never fires.
3. Optional — special-case PartitionKeyRangeGoneException higher in the client retry pipeline
ClientRetryPolicy could grow a 410/1002 branch that triggers one in-line routing-map refresh + one retry (mirroring PartitionKeyRangeGoneRetryPolicy.cs, which is currently wired only into v2 query / change feed). This would let point reads actually recover within the same request instead of failing the first request and recovering on the second.
Java comparison
Java's GoneAndRetryWithRetryPolicy.GoneRetryPolicy.isNonRetryableException similarly excludes PartitionKeyRangeGoneException — i.e., 410/1002 also bubbles up at the gone-policy in the Java SDK. The Java policy only sets forcePartitionKeyRangeRefresh = true for typed 1007 PartitionKeyRangeIsSplittingException, not for bare Gone or PartitionKeyRangeGone. So this fix would be worth coordinating across SDKs (track Java separately), but the .NET-side fix is independent.
Reproduction signal
Look in customer diagnostics for:
Summary.DirectCalls containing (410, 21005) ≥ 1
Summary.GatewayCalls containing exactly one (200, 0) (the address-cache refresh)
PointOperationStatistics.StatusCode = 410, SubStatusCode = 1002
- Total elapsed time well under 30 s
- Error message:
"PartitionKeyRange with id '<n>' in collection '<rid>' doesn't exist."
That trio uniquely identifies the address-cache-says-gone, routing-map-still-stale code path.
Affected versions
Verified present on main (commit 0ca35ae9) as of filing. Behavior is the same in 3.60.0P (the version on the incident) and unchanged in every release on main. No mitigating PR identified since 3.60.0.
Acceptance criteria
Summary
When a partition split / merge / migration causes a
PartitionKeyRangeGoneException(410 / 1002) to be thrown from insideGatewayAddressCache.GetAddressesForRangeIdAsync, the exception escapes the SDK retry pipeline and surfaces to the caller as a raw 410 / 1002 on point operations. This is inconsistent with the SDK's own contract for the Gone family:GoneAndRetryWithRequestRetryPolicywraps the terminal exception asServiceUnavailableException(withServer_PartitionKeyRangeGoneExceededRetryLimit = 21002,Server_GoneExceededRetryLimit = 21005, etc.). A bare 410 leaking through to user code makes the failure unclassifiable by standard SDK consumers — most retry / circuit-breaker logic at the application layer is wired for 503, not 410.partitionKeyRangeIdis the only cache that gets refreshed on this codepath, leaving the routing map stale until another request happens to walk intoAddressResolver.HandleRangeAddressResolutionFailure. This produces a cluster of failures right after a split until enough traffic has trickled through to converge the routing map.The result for a customer is a hard, non-retryable-looking 410 on a point read, with a stale routing map left behind for the next caller to discover.
Real incident trace
Diagnostics summary surfaced to the app:
Sequence of events (verified against
GatewayAddressCache.csandGoneAndRetryWithRequestRetryPolicy.csonmain@0ca35ae)ReadItemAsyncagainst pk that routes to pkRange 4874.AddressResolver.ResolveAsyncfinds pkRange 4874 in its cached routing map (it's stale but still present) and hands it toStoreClient.ReplicatedResourceClient→GoneAndRetryWithRequestRetryPolicyenters the retry loop with a 30 s wall-clock budget.IN_27→ backend returns 410 with substatus 0 (E_REPLICA_RECONFIGURATION_PENDING).TransportClientstamps it 21005. Policy setsForceRefreshPartitionAddresses = true, retries.IN_28→ same response, stamped 21005 again.IN_135→ same response, stamped 21005 again.GatewayCalls: (200, 0): 1, ~17 ms). The gateway's address feed comes back without an entry for pkRangeId 4874 (it's been split away).GatewayAddressCache.GetAddressesForRangeIdAsync, line 661-669 throwsPartitionKeyRangeGoneException(the "doesn't exist" message).TryGetAddressesAsyncinside the retry loop's call to fetch new addresses — it is not a response from a backend replica, it's thrown synchronously by the address-resolution call itself.GoneAndRetryWithRequestRetryPolicy.ShouldRetryAsyncInternal, the policy is just about to start iteration 4. It would normally evaluateremainingMilliseconds <= 0and convert the exception to 503 — but only 19 ms has elapsed and the budget is 30 s, so that branch is not taken.PartitionKeyRangeGoneExceptionpropagates up unchanged throughClientRetryPolicy(no 410/1002 branch),NamedCacheRetryHandler(only handles 410/1000), and surfaces to the caller as 410 / 1002.ForceCollectionRoutingMapRefreshis never set anywhere in the gone-policy or anywhere downstream of it. The next request to the same logical pk will also resolve to (still-stale) pkRange 4874 from the cached routing map, and will go through the same dance.Root cause
GoneAndRetryWithRequestRetryPolicyonly wraps the exception as 503 in theremainingMilliseconds <= 0branch (see GoneAndRetryWithRequestRetryPolicy.cs:218-310 inbluebird):When
PartitionKeyRangeGoneExceptionis thrown by address resolution itself (not by a replica response), the policy never gets a chance to run the wrap path because the exception is raised inside the call that fetches the next batch of addresses, before the time-budget check fires for the next attempt.Additionally, neither the gone-policy retry path nor the throw site at
GatewayAddressCache.cs:669setsForceCollectionRoutingMapRefresh = true, so the (now demonstrably stale) routing map stays cached.Proposed behavior changes
1. Wrap escaping
PartitionKeyRangeGoneExceptionfrom address resolution as 503When
TryGetAddressesAsync(or the equivalent direct path) throwsPartitionKeyRangeGoneException,GoneAndRetryWithRequestRetryPolicy(or a thin wrapper) should convert it toServiceUnavailableException.Create(SubStatusCodes.Server_PartitionKeyRangeGoneExceededRetryLimit /* 21002 */, innerException: …)before surfacing. The existingGetExceptionSubStatusForGoneRetryPolicymapping already gives us the right substatus.This makes the surfaced status code align with the rest of the gone-family terminal paths and lets standard client-side 503-retry / circuit-breaker logic kick in.
2. Trigger a routing-map (pk-range) refresh when a 1002 is detected anywhere in the path
When
GatewayAddressCache.GetAddressesForRangeIdAsyncdiscovers an empty feed for the requested pkRangeId, it should also flag the routing-map cache for refresh — either by settingrequest.ForceCollectionRoutingMapRefresh = truebefore the throw, or by directly calling intoPartitionKeyRangeCache.TryLookupAsync(..., forceRefreshCollectionRoutingMap: true, previousValue: <current map>).Today,
AddressResolver.HandleRangeAddressResolutionFailureis the only place that performs this refresh, and it's only entered when the cached routing map doesn't even contain a matching range. In the split-recently-completed scenario, the cached routing map still contains the soon-to-be-deleted range, so this safety net never fires.3. Optional — special-case
PartitionKeyRangeGoneExceptionhigher in the client retry pipelineClientRetryPolicycould grow a 410/1002 branch that triggers one in-line routing-map refresh + one retry (mirroringPartitionKeyRangeGoneRetryPolicy.cs, which is currently wired only into v2 query / change feed). This would let point reads actually recover within the same request instead of failing the first request and recovering on the second.Java comparison
Java's
GoneAndRetryWithRetryPolicy.GoneRetryPolicy.isNonRetryableExceptionsimilarly excludesPartitionKeyRangeGoneException— i.e., 410/1002 also bubbles up at the gone-policy in the Java SDK. The Java policy only setsforcePartitionKeyRangeRefresh = truefor typed 1007PartitionKeyRangeIsSplittingException, not for bare Gone or PartitionKeyRangeGone. So this fix would be worth coordinating across SDKs (track Java separately), but the .NET-side fix is independent.Reproduction signal
Look in customer diagnostics for:
Summary.DirectCallscontaining(410, 21005)≥ 1Summary.GatewayCallscontaining exactly one(200, 0)(the address-cache refresh)PointOperationStatistics.StatusCode = 410,SubStatusCode = 1002"PartitionKeyRange with id '<n>' in collection '<rid>' doesn't exist."That trio uniquely identifies the address-cache-says-gone, routing-map-still-stale code path.
Affected versions
Verified present on
main(commit0ca35ae9) as of filing. Behavior is the same in 3.60.0P (the version on the incident) and unchanged in every release onmain. No mitigating PR identified since 3.60.0.Acceptance criteria
503(with substatus21002or equivalent), not410 / 1002.PartitionKeyRangeCachereflects the post-split state).Microsoft.Azure.Cosmos.EmulatorTestssimulating an empty address feed for a requested pkRangeId (mockable viaIAddressResolveror test gateway).### UnreleasedBugs Fixedentry tochangelog.md.