Direct: Fixes 410/1002 PartitionKeyRangeGone bubbling up to point-read callers without routing-map refresh

## Summary

When a partition split / merge / migration causes a `PartitionKeyRangeGoneException` (410 / 1002) to be thrown from inside `GatewayAddressCache.GetAddressesForRangeIdAsync`, the exception escapes the SDK retry pipeline and surfaces to the caller as a raw **410 / 1002** on point operations. This is inconsistent with the SDK's own contract for the Gone family:

1. **Customer-facing surface should be 503, not 410.** Every other exhaustion path in `GoneAndRetryWithRequestRetryPolicy` wraps the terminal exception as `ServiceUnavailableException` (with `Server_PartitionKeyRangeGoneExceededRetryLimit = 21002`, `Server_GoneExceededRetryLimit = 21005`, etc.). A bare 410 leaking through to user code makes the failure unclassifiable by standard SDK consumers — most retry / circuit-breaker logic at the application layer is wired for 503, not 410.
2. **The routing-map (pk-range) cache is never refreshed.** The address cache for the (now-stale) `partitionKeyRangeId` is the only cache that gets refreshed on this codepath, leaving the routing map stale until *another* request happens to walk into `AddressResolver.HandleRangeAddressResolutionFailure`. This produces a cluster of failures right after a split until enough traffic has trickled through to converge the routing map.

The result for a customer is a hard, non-retryable-looking 410 on a point read, with a stale routing map left behind for the next caller to discover.

## Real incident trace

Diagnostics summary surfaced to the app:

```json
"Summary": {
  "DirectCalls":  { "(410, 21005)": 3 },
  "GatewayCalls": { "(200, 0)": 1 }
},
"Point Operation Statistics": {
  "StatusCode": 410,
  "SubStatusCode": 1002,
  "ErrorMessage": "PartitionKeyRange with id '4874' in collection 'some_rid' doesn't exist."
}
```

## Sequence of events (verified against `GatewayAddressCache.cs` and `GoneAndRetryWithRequestRetryPolicy.cs` on `main` @ `0ca35ae`)

1. App issues `ReadItemAsync` against pk that routes to pkRange 4874.
2. `AddressResolver.ResolveAsync` finds pkRange 4874 in its cached routing map (it's stale but still present) and hands it to `StoreClient`.
3. `ReplicatedResourceClient` → `GoneAndRetryWithRequestRetryPolicy` enters the retry loop with a 30 s wall-clock budget.
4. **Attempt 1** — RNTBD to replica `IN_27` → backend returns 410 with substatus 0 (E_REPLICA_RECONFIGURATION_PENDING). `TransportClient` stamps it 21005. Policy sets `ForceRefreshPartitionAddresses = true`, retries.
5. **Attempt 2**  — RNTBD to replica `IN_28` → same response, stamped 21005 again.
6. **Attempt 3**  — RNTBD to replica `IN_135` → same response, stamped 21005 again.
7. In parallel, the address-cache refresh hits the gateway (`GatewayCalls: (200, 0): 1`, ~17 ms). The gateway's address feed comes back **without** an entry for pkRangeId 4874 (it's been split away).
8. [`GatewayAddressCache.GetAddressesForRangeIdAsync`, line 661-669](https://github.com/Azure/azure-cosmos-dotnet-v3/blob/0ca35ae9d4882a63180e1e927344d068bf9ba702/Microsoft.Azure.Cosmos/src/Routing/GatewayAddressCache.cs#L657-L670) throws `PartitionKeyRangeGoneException` (the "doesn't exist" message).
9. This 1002 exception escapes `TryGetAddressesAsync` *inside* the retry loop's call to fetch new addresses — it is **not** a response from a backend replica, it's thrown synchronously by the address-resolution call itself.
10. Back in `GoneAndRetryWithRequestRetryPolicy.ShouldRetryAsyncInternal`, the policy is just about to start iteration 4. It would normally evaluate `remainingMilliseconds <= 0` and convert the exception to 503 — but only **19 ms** has elapsed and the budget is 30 s, so that branch is not taken.
11. The `PartitionKeyRangeGoneException` propagates up unchanged through `ClientRetryPolicy` (no 410/1002 branch), `NamedCacheRetryHandler` (only handles 410/1000), and surfaces to the caller as **410 / 1002**.
12. **The routing-map cache is never refreshed** on this entire codepath — `ForceCollectionRoutingMapRefresh` is never set anywhere in the gone-policy or anywhere downstream of it. The next request to the same logical pk will *also* resolve to (still-stale) pkRange 4874 from the cached routing map, and will go through the same dance.

## Root cause

`GoneAndRetryWithRequestRetryPolicy` only wraps the exception as 503 in the `remainingMilliseconds <= 0` branch (see [GoneAndRetryWithRequestRetryPolicy.cs:218-310 in `bluebird`](https://github.com/Azure/Cosmos/blob/main/Product/Microsoft.Azure.Documents/SharedFiles/GoneAndRetryWithRequestRetryPolicy.cs#L218-L310)):

```csharp
if (this.attemptCount++ > 1)
{
    if (remainingMilliseconds <= 0)           // only here
    {
        if (IsBaseGone(...) || IsPartitionKeyRangeGone(...) || ...)
        {
            exceptionToThrow = ServiceUnavailableException.Create(
                exceptionSubStatus, innerException: exception);
        }
    }
}
```

When `PartitionKeyRangeGoneException` is thrown by **address resolution itself** (not by a replica response), the policy never gets a chance to run the wrap path because the exception is raised inside the call that fetches the next batch of addresses, before the time-budget check fires for the next attempt.

Additionally, neither the gone-policy retry path nor the throw site at `GatewayAddressCache.cs:669` sets `ForceCollectionRoutingMapRefresh = true`, so the (now demonstrably stale) routing map stays cached.

## Proposed behavior changes

### 1. Wrap escaping `PartitionKeyRangeGoneException` from address resolution as 503

When `TryGetAddressesAsync` (or the equivalent direct path) throws `PartitionKeyRangeGoneException`, `GoneAndRetryWithRequestRetryPolicy` (or a thin wrapper) should convert it to `ServiceUnavailableException.Create(SubStatusCodes.Server_PartitionKeyRangeGoneExceededRetryLimit /* 21002 */, innerException: …)` before surfacing. The existing `GetExceptionSubStatusForGoneRetryPolicy` mapping already gives us the right substatus.

This makes the surfaced status code align with the rest of the gone-family terminal paths and lets standard client-side 503-retry / circuit-breaker logic kick in.

### 2. Trigger a routing-map (pk-range) refresh when a 1002 is detected anywhere in the path

When `GatewayAddressCache.GetAddressesForRangeIdAsync` discovers an empty feed for the requested pkRangeId, it should also flag the routing-map cache for refresh — either by setting `request.ForceCollectionRoutingMapRefresh = true` before the throw, or by directly calling into `PartitionKeyRangeCache.TryLookupAsync(..., forceRefreshCollectionRoutingMap: true, previousValue: <current map>)`.

Today, `AddressResolver.HandleRangeAddressResolutionFailure` is the only place that performs this refresh, and it's only entered when the cached routing map doesn't even contain a matching range. In the split-recently-completed scenario, the cached routing map still contains the soon-to-be-deleted range, so this safety net never fires.

### 3. Optional — special-case `PartitionKeyRangeGoneException` higher in the client retry pipeline

`ClientRetryPolicy` could grow a 410/1002 branch that triggers one in-line routing-map refresh + one retry (mirroring `PartitionKeyRangeGoneRetryPolicy.cs`, which is currently wired only into v2 query / change feed). This would let point reads actually recover within the same request instead of failing the first request and recovering on the second.

## Java comparison

Java's `GoneAndRetryWithRetryPolicy.GoneRetryPolicy.isNonRetryableException` similarly excludes `PartitionKeyRangeGoneException` — i.e., 410/1002 also bubbles up at the gone-policy in the Java SDK. The Java policy only sets `forcePartitionKeyRangeRefresh = true` for typed 1007 `PartitionKeyRangeIsSplittingException`, not for bare Gone or PartitionKeyRangeGone. So this fix would be worth coordinating across SDKs (track Java separately), but the .NET-side fix is independent.

## Reproduction signal

Look in customer diagnostics for:

- `Summary.DirectCalls` containing `(410, 21005)` ≥ 1
- `Summary.GatewayCalls` containing exactly one `(200, 0)` (the address-cache refresh)
- `PointOperationStatistics.StatusCode = 410`, `SubStatusCode = 1002`
- Total elapsed time well under 30 s
- Error message: `"PartitionKeyRange with id '<n>' in collection '<rid>' doesn't exist."`

That trio uniquely identifies the address-cache-says-gone, routing-map-still-stale code path.

## Affected versions

Verified present on `main` (commit `0ca35ae9`) as of filing. Behavior is the same in 3.60.0P (the version on the incident) and unchanged in every release on `main`. No mitigating PR identified since 3.60.0.


## Acceptance criteria

- [ ] Point-read against a pkRange that the gateway no longer lists surfaces as `503` (with substatus `21002` or equivalent), not `410 / 1002`.
- [ ] After a request hits this code path, the next request observes a fresh routing map (`PartitionKeyRangeCache` reflects the post-split state).
- [ ] New / updated integration test under `Microsoft.Azure.Cosmos.EmulatorTests` simulating an empty address feed for a requested pkRangeId (mockable via `IAddressResolver` or test gateway).
- [ ] Add a `### Unreleased` `Bugs Fixed` entry to `changelog.md`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Direct: Fixes 410/1002 PartitionKeyRangeGone bubbling up to point-read callers without routing-map refresh #5924

Summary

Real incident trace

Sequence of events (verified against `GatewayAddressCache.cs` and `GoneAndRetryWithRequestRetryPolicy.cs` on `main` @ `0ca35ae`)

Root cause

Proposed behavior changes

1. Wrap escaping `PartitionKeyRangeGoneException` from address resolution as 503

2. Trigger a routing-map (pk-range) refresh when a 1002 is detected anywhere in the path

3. Optional — special-case `PartitionKeyRangeGoneException` higher in the client retry pipeline

Java comparison

Reproduction signal

Affected versions

Acceptance criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Direct: Fixes 410/1002 PartitionKeyRangeGone bubbling up to point-read callers without routing-map refresh #5924

Description

Summary

Real incident trace

Sequence of events (verified against GatewayAddressCache.cs and GoneAndRetryWithRequestRetryPolicy.cs on main @ 0ca35ae)

Root cause

Proposed behavior changes

1. Wrap escaping PartitionKeyRangeGoneException from address resolution as 503

2. Trigger a routing-map (pk-range) refresh when a 1002 is detected anywhere in the path

3. Optional — special-case PartitionKeyRangeGoneException higher in the client retry pipeline

Java comparison

Reproduction signal

Affected versions

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Sequence of events (verified against `GatewayAddressCache.cs` and `GoneAndRetryWithRequestRetryPolicy.cs` on `main` @ `0ca35ae`)

1. Wrap escaping `PartitionKeyRangeGoneException` from address resolution as 503

3. Optional — special-case `PartitionKeyRangeGoneException` higher in the client retry pipeline