Skip to content

[COSMOS] Direct: Fixes 410/1002 PartitionKeyRangeGone bubbling up to point-read callers without routing-map refresh #49381

@tvaron3

Description

@tvaron3

Summary

When a partition split / merge / migration causes a PartitionKeyRangeGoneException (410 / 1002) to be thrown from inside GatewayAddressCache while resolving addresses for a now-deleted partition key range, the exception escapes the SDK retry pipeline and surfaces to the caller as a raw 410 / 1002 on point operations. This is inconsistent with how the SDK handles every other Gone-family exhaustion path:

  1. Customer-facing surface should be 503, not 410. The terminal paths of GoneAndRetryWithRetryPolicy wrap exhausted Gone-family failures as ServiceUnavailableException with the corresponding "ExceededRetryLimit" substatus. A bare 410 leaking through to user code makes the failure unclassifiable by standard SDK consumers — most application-layer retry / circuit-breaker logic is wired for 503, not 410.
  2. The routing-map (pk-range) cache is never refreshed. The address cache for the (now-stale) partitionKeyRangeId is the only cache that gets refreshed on this codepath, leaving the routing map stale until another request happens to walk into the resolver's failure-handler path. This produces a cluster of failures right after a split until enough traffic has trickled through to converge the routing map.

This is a peer issue to the .NET v3 SDK report: Azure/azure-cosmos-dotnet-v3#5924. Both SDKs share the same architecture here and exhibit the same gap — coordinating the fix is the goal of filing it on both repos.

Affected code paths

GoneAndRetryWithRetryPolicy.GoneRetryPolicy (under sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/):

  • isNonRetryableException excludes PartitionKeyRangeGoneException — so a 410/1002 thrown during address resolution is treated as non-retryable at this layer and bubbles up unchanged.
  • handleException dispatches by typed exception:
    • GoneExceptionPair.of(null, true) — address-cache refresh only
    • PartitionIsMigratingException → sets forceCollectionRoutingMapRefresh = true
    • PartitionKeyRangeIsSplittingException → sets forcePartitionKeyRangeRefresh = true
    • PartitionKeyRangeGoneException → not in the switch; no flag is set, exception is rethrown

GatewayAddressCache (under sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/directconnectivity/):

  • When the gateway's address feed returns no entry for the requested pkRangeId, a PartitionKeyRangeGoneException is thrown with the message "PartitionKeyRange with id '<n>' in collection '<rid>' doesn't exist." This is the proximate source of the 410/1002 that escapes.

Sequence of events (typical scenario)

  1. App issues readItem / queryItems against a pk that routes to a recently-split pkRange.
  2. AddressResolverImpl (or GlobalAddressResolver) finds the soon-to-be-deleted pkRange in its cached routing map and hands it to the store client.
  3. StoreClient enters GoneAndRetryWithRetryPolicy with the configured wall-clock budget.
  4. RNTBD attempts to existing replicas return 410 with substatus 0 (E_REPLICA_RECONFIGURATION_PENDING, HRESULT 0x800A09CC) → typed as GoneException.
  5. Policy sets forceRefreshAddressCache = true and retries. The gateway address-cache refresh comes back without an entry for the requested pkRangeId.
  6. GatewayAddressCache throws PartitionKeyRangeGoneException ("doesn't exist").
  7. This 1002 exception is raised inside the address-resolution call that the retry policy made for the next iteration. The policy's exhaustion-wrap path (which would normally convert to 503) only fires when the wall-clock budget is elapsed — typically not the case here, since the address-resolution refresh happens fast.
  8. The PartitionKeyRangeGoneException propagates up unchanged through ClientRetryPolicy and surfaces to the caller as 410 / 1002.
  9. The routing-map cache is never refreshed on this entire codepath — no code in the gone-policy or address-cache throw site sets forceCollectionRoutingMapRefresh / forcePartitionKeyRangeRefresh. The next request to the same logical pk will also resolve to (still-stale) pkRange from the cached routing map.

Proposed behavior changes

1. Wrap escaping PartitionKeyRangeGoneException from address resolution as 503

When GatewayAddressCache (or the equivalent direct path) throws PartitionKeyRangeGoneException, GoneAndRetryWithRetryPolicy (or a thin wrapper) should convert it to a ServiceUnavailableException with the Server_PartitionKeyRangeGoneExceededRetryLimit (21002) substatus before surfacing. This aligns the surfaced status with the rest of the Gone-family terminal paths and lets standard 503-retry / circuit-breaker logic kick in on the customer side.

2. Trigger a routing-map (pk-range) refresh when a 1002 is detected anywhere in the path

When GatewayAddressCache discovers an empty address feed for the requested pkRangeId, it should also flag the routing-map cache for refresh — either by setting forceCollectionRoutingMapRefresh = true on the request before the throw, or by directly invoking RxPartitionKeyRangeCache.tryLookupAsync(..., forceRefreshCollectionRoutingMap = true, previousValue = <current map>).

Today, this refresh only happens when AddressResolverImpl cannot find a matching range in the cached map at all. In the split-recently-completed scenario, the cached routing map still contains the soon-to-be-deleted range, so this safety net never fires.

3. Optional — special-case PartitionKeyRangeGoneException higher in the client retry pipeline

ClientRetryPolicy could grow a 410/1002 branch that triggers one in-line routing-map refresh + one retry. This would let point reads actually recover within the same request instead of failing the first request and recovering on the second.

.NET comparison

The .NET v3 SDK's GoneAndRetryWithRequestRetryPolicy only wraps Gone-family exceptions as 503 in its remainingMilliseconds <= 0 branch, so the same gap exists there for PartitionKeyRangeGoneException thrown by the address-resolution call inside the loop. See Azure/azure-cosmos-dotnet-v3#5924 for the .NET-side issue, full incident trace, and code links. The proposed fix is symmetric on both SDKs.

Reproduction signal

In customer diagnostics for a Java SDK trace:

  • Direct-mode store responses with status 410 and the failure originating from address resolution (not a backend replica response)
  • A gateway address-cache refresh shortly before the failure
  • Final surfaced exception: CosmosException with statusCode = 410, subStatusCode = 1002
  • Error message: "PartitionKeyRange with id '<n>' in collection '<rid>' doesn't exist."
  • Total elapsed time well under the gone-policy wall-clock budget

Customer impact

  • Any workload that does high-volume point reads against a collection that has recently undergone a split / merge / migration.
  • Customer-side retry policies wired only for 503 (the conventional Azure "service unavailable, please retry" status) will not retry this failure.
  • Even retry-on-410 policies will see the request fail because the routing map isn't refreshed, so the retry will hit the same stale range and fail the same way until the routing map happens to be refreshed by some other path.

Acceptance criteria

  • Point operation against a pkRange that the gateway no longer lists surfaces as CosmosException with statusCode = 503 and substatus 21002 (or equivalent), not 410 / 1002.
  • After a request hits this code path, the next request observes a fresh routing map (RxPartitionKeyRangeCache reflects the post-split state).
  • New / updated integration or unit test under sdk/cosmos/azure-cosmos/src/test/ simulating an empty address feed for a requested pkRangeId.
  • Add a CHANGELOG.md entry under sdk/cosmos/azure-cosmos/CHANGELOG.md (Unreleased / Bugs Fixed section).

Metadata

Metadata

Assignees

No one assigned

    Labels

    ClientThis issue points to a problem in the data-plane of the library.Cosmos

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions