Skip to content

[Internal] PPAF: Adds Hub Region Caching Per Partition Level#5648

Draft
aavasthy wants to merge 48 commits intomainfrom
users/aavasthy/hubregioncaching
Draft

[Internal] PPAF: Adds Hub Region Caching Per Partition Level#5648
aavasthy wants to merge 48 commits intomainfrom
users/aavasthy/hubregioncaching

Conversation

@aavasthy
Copy link
Copy Markdown
Contributor

@aavasthy aavasthy commented Mar 3, 2026

Description

This PR introduces per-partition hub region caching for the Azure Cosmos DB .NET SDK v3. On single-master accounts encountering repeated 404/1002 (ReadSessionNotAvailable) errors, the SDK now discovers the hub region via a 403/3 (WriteForbidden) discovery chain and caches the result. Subsequent requests to the same partition route directly to the cached hub, eliminating redundant discovery round-trips.

Follow-up of: PR #5447
Branch: users/aavasthy/hubregioncaching
Type: New feature (non-breaking)
Lines changed: +1,966 / −363


Problem

Issue Impact
After 2× 404/1002 on a single-master account, the SDK gave up and returned the error to the caller Read requests failed unnecessarily
No hub region discovery existed for non-PPAF accounts Non-PPAF single-master accounts had no recovery mechanism
Every request repeated the full hub discovery chain — no caching Unnecessary latency on every request hitting the same partition

Solution Summary

  1. After 2 consecutive 404/1002 errors on a single-master account, set the x-ms-cosmos-hub-region-processing-only header.
  2. Non-hub regions receiving this header return 403/3 (WriteForbidden) — the SDK retries to the next region (the 403/3 discovery chain).
  3. Once the hub region responds with 200 OK, the SDK caches the hub region URI for that partition in PartitionKeyRangeToLocationForWrite.
  4. Future requests for the same partition route directly to the cached hub, skipping the discovery chain entirely.
  5. Works for both PPAF and non-PPAF accounts by reusing the existing partition-level failover infrastructure.

Existing vs. New Behavior

Aspect Before (Existing) After (New)
Max retries on 404/1002 (single-master) 2, then fail 2 + hub discovery chain
Hub header trigger After 1st 404/1002 After 2nd 404/1002
403/3 on read path with hub header Not handled Retries to continue discovery
Partition-level cache on warm path Only PPAF accounts Both PPAF and non-PPAF
Non-PPAF cache routing Not available Available via checkHubRegionOverrideInCache
GatewayStoreModel PKRange resolution Only when PPAF enabled Also when hub header is present

End-to-End Flow

Cold Cache — First-Time Hub Region Discovery

Read Request (Partition P1)
│
├─ Attempt 1: Route to preferred read region (e.g., East US)
│  └─ Response: 404/1002 (ReadSessionNotAvailable)
│     → sessionTokenRetryCount = 1
│     → ShouldRetryOnSessionNotAvailable retries to write region
│
├─ Attempt 2: Route to write region (e.g., West US)
│  └─ Response: 404/1002 (ReadSessionNotAvailable)
│     → sessionTokenRetryCount ≥ 1, single-master
│     → Set addHubRegionProcessingOnlyHeader = true
│     → Check cache → MISS (cold cache)
│     → Fall back to region cycling
│
├─ Attempt 3: Route to write regions in-order WITH hub header
│  └─ Response: 403/3 (WriteForbidden) — this region is not the hub
│     → TryMarkEndpointUnavailableForPartitionKeyRange → advances failover cache
│     → Retry to next region
│
├─ Attempt N: Route to hub region WITH hub header
│  └─ Response: 200 OK
│     → Cache populated: PartitionKeyRangeToLocationForWrite[P1] = hub URI

Warm Cache — Subsequent Request

Read Request (Partition P1)
│
├─ Attempt 1: Route to preferred read region (normal ResolveServiceEndpoint)
│  └─ Response: 404/1002
│     → sessionTokenRetryCount = 1
│
├─ Attempt 2: Route to write region
│  └─ Response: 404/1002
│     → addHubRegionProcessingOnlyHeader = true
│     → Check cache → HIT! (warm cache)
│     → TryAddPartitionLevelLocationOverride routes to cached hub
│
├─ Attempt 3: Route directly to cached hub region WITH hub header
│  └─ Response: 200 OK (no 403/3 chain needed)

Sequence Diagram — Cold Cache (First Discovery)

sequenceDiagram
    participant App as Application
    participant CRP as ClientRetryPolicy
    participant GW as GatewayStoreModel
    participant GPEM as GlobalPartitionEndpoint<br/>ManagerCore
    participant R1 as Preferred Region<br/>(East US)
    participant R2 as Write Region<br/>(West US)
    participant R3 as Non-Hub Region<br/>(North EU)
    participant Hub as Hub Region<br/>(Central US)

    App->>CRP: ReadItemAsync(P1)
    CRP->>GW: OnBeforeSendRequest()
    GW->>R1: GET /P1 (no hub header)
    R1-->>GW: 404/1002 (ReadSessionNotAvailable)
    GW-->>CRP: ShouldRetryOnSessionNotAvailable()
    Note over CRP: sessionTokenRetryCount = 1<br/>Single-master → retry to write region

    CRP->>GW: OnBeforeSendRequest()
    GW->>R2: GET /P1 (no hub header)
    R2-->>GW: 404/1002 (ReadSessionNotAvailable)
    GW-->>CRP: ShouldRetryOnSessionNotAvailable()
    Note over CRP: sessionTokenRetryCount > 1<br/>Set addHubRegionProcessingOnlyHeader = true<br/>Check cache → MISS

    CRP->>GW: OnBeforeSendRequest() [hub header set]
    GW->>GPEM: IsHubRegionRoutingActive() → true
    GPEM-->>GW: Resolve PKRange, TryAddPartitionLevelLocationOverride → no cache
    GW->>R3: GET /P1 + x-ms-cosmos-hub-region-processing-only
    R3-->>GW: 403/3 (WriteForbidden)
    GW-->>CRP: ShouldRetryInternal(403/3)
    CRP->>GPEM: TryMarkEndpointUnavailableForPartitionKeyRange()
    Note over GPEM: Cache advanced:<br/>P1 → next region

    CRP->>GW: OnBeforeSendRequest() [hub header set]
    GW->>GPEM: IsHubRegionRoutingActive() → true
    GPEM-->>GW: TryAddPartitionLevelLocationOverride → route to Hub
    GW->>Hub: GET /P1 + x-ms-cosmos-hub-region-processing-only
    Hub-->>GW: 200 OK
    GW-->>CRP: Success
    Note over GPEM: Cache populated:<br/>P1 → Hub URI

    CRP-->>App: 200 OK
Loading

Sequence Diagram — Warm Cache (Subsequent Request)

sequenceDiagram
    participant App as Application
    participant CRP as ClientRetryPolicy
    participant GW as GatewayStoreModel
    participant GPEM as GlobalPartitionEndpoint<br/>ManagerCore
    participant R1 as Preferred Region<br/>(East US)
    participant Hub as Hub Region<br/>(Central US)

    App->>CRP: ReadItemAsync(P1)
    CRP->>GW: OnBeforeSendRequest()
    GW->>R1: GET /P1 (no hub header)
    R1-->>GW: 404/1002 (ReadSessionNotAvailable)
    GW-->>CRP: ShouldRetryOnSessionNotAvailable()
    Note over CRP: sessionTokenRetryCount = 1<br/>Single-master → retry to write region

    CRP->>GW: OnBeforeSendRequest()
    GW->>R1: GET /P1 (routed to write region)
    R1-->>GW: 404/1002 (ReadSessionNotAvailable)
    GW-->>CRP: ShouldRetryOnSessionNotAvailable()
    Note over CRP: sessionTokenRetryCount > 1<br/>addHubRegionProcessingOnlyHeader = true<br/>TryAddPartitionLevelLocationOverride<br/>→ CACHE HIT!

    CRP->>GW: OnBeforeSendRequest() [hub header + cache override]
    GW->>GPEM: IsHubRegionRoutingActive() → true
    GPEM-->>GW: TryAddPartitionLevelLocationOverride → route to cached Hub
    GW->>Hub: GET /P1 + x-ms-cosmos-hub-region-processing-only
    Hub-->>GW: 200 OK
    GW-->>CRP: Success
    CRP-->>App: 200 OK
Loading

Component Architecture

flowchart TB
    subgraph "Retry Layer"
        CRP["ClientRetryPolicy"]
        CRP -->|"OnBeforeSendRequest()"| HDR["Set x-ms-cosmos-hub-region-<br/>processing-only header"]
        CRP -->|"ShouldRetryOnSessionNotAvailable()"| DEC{"sessionTokenRetryCount > 1<br/>+ single-master<br/>+ hub processing enabled?"}
        DEC -->|Yes| FLAG["addHubRegionProcessingOnlyHeader = true"]
        FLAG --> CACHE_CHECK{"Cache hit?<br/>TryAddPartitionLevel<br/>LocationOverride()"}
        CACHE_CHECK -->|Yes - Warm Path| DIRECT["Route to cached hub"]
        CACHE_CHECK -->|No - Cold Path| CYCLE["Region cycling<br/>(403/3 discovery)"]
        DEC -->|No| ORIG["Original retry behavior<br/>(RetryOnSessionNotAvailableRouteToWriteRegion)"]
    end

    subgraph "Gateway Layer"
        GW["GatewayStoreModel"]
        GW -->|"IsHubRegionRoutingActive()"| PKRANGE["Resolve PKRange"]
        PKRANGE --> OVERRIDE["TryAddPartitionLevel<br/>LocationOverride()"]
    end

    subgraph "Partition Cache Layer"
        GPEM["GlobalPartitionEndpoint<br/>ManagerCore"]
        GPEM --> WRITE_CACHE["PartitionKeyRangeToLocationForWrite<br/>(ConcurrentDictionary)"]
        GPEM -->|"TryMarkEndpoint<br/>Unavailable()"| ADVANCE["Advance cache to<br/>next region"]
        GPEM -->|"IsHubRegionRoutingActive()"| CHECK["Check hub header<br/>+ hub property"]
        GPEM -->|"IsRequestEligibleFor<br/>PartitionOrHubRegionFailover()"| GATE["Gate: PPAF OR circuit breaker<br/>OR checkHubRegionOverrideInCache"]
    end

    CRP --> GW
    GW --> GPEM
    DIRECT --> GW
    CYCLE --> GW
Loading

Key Files Changed

File Change Summary
ClientRetryPolicy.cs Core retry logic: hub header trigger after 2× 404/1002, 403/3 handling on read path, cache check via TryAddPartitionLevelLocationOverride, feature flag gate
GlobalPartitionEndpointManagerCore.cs Cache storage (PartitionKeyRangeToLocationForWrite), IsHubRegionRoutingActive(), IsHubRegionHeaderPresentInRequest(), IsHubRegionPropertyPresentInRequest(), eligibility gate update, IsHubRegionProcessingEnabled()
GlobalPartitionEndpointManager.cs Abstract interface additions: HubRegionOverridePresentInCache constant, IsHubRegionProcessingEnabled() abstract method, checkHubRegionOverrideInCache parameter
GlobalPartitionEndpointManagerNoOp.cs NoOp implementation of IsHubRegionProcessingEnabled()
GatewayStoreModel.cs PKRange resolution and TryAddPartitionLevelLocationOverride call extended for IsHubRegionRoutingActive (enables non-PPAF accounts)
ConfigurationManager.cs IsHubRegionProcessingEnabled() — reads environment variable once at initialization
ClientRetryPolicyTests.cs Unit tests for hub header injection, full caching flow, cache isolation
LocationCacheTests.cs Retry count updated from 2 to 3 attempts for hub header scenarios
CosmosItemIntegrationTests.cs Integration tests: hub header injection via HTTP interception, full 403/3 flow

Feature Flag

The entire hub region processing feature is gated by:

  • Preprocessor guard: #if !INTERNAL — internal builds retain the existing 2-retry behavior
  • Runtime flag: ConfigurationManager.IsHubRegionProcessingEnabled() — read once at GlobalPartitionEndpointManagerCore initialization from an environment variable

Review Highlights from PR Discussion

Reviewer Topic Resolution
@kundadebdatta Caching should only happen on successful hub discovery (200), not preemptively on 403/3 ❌ Not in scope for now. This requires broader discussion and significant design changes
@kundadebdatta Subsequent reads should NOT route directly to hub on first attempt; should still start from preferred region ✅ Fixed: warm cache routes to hub only after 2× 404/1002 (not on 1st attempt)
@kundadebdatta ShouldRetryOnEndpointFailureAsync marks read endpoints unavailable — unintended side effect ✅ Fixed: passes overwriteEndpointDiscovery: true to skip endpoint marking
@kundadebdatta Feature flag reads env var on every request (I/O per request) ✅ Refactored: read once at initialization
@kundadebdatta Code duplication between #if !INTERNAL and #else blocks ✅ Modularized into RetryOnSessionNotAvailableRouteToWriteRegion()
@kirankumarkolli Hub caching logic for PPAF vs hub region should be combined since identical code ✅ Combined in TryAddPartitionLevelLocationOverride
@kirankumarkolli Header value should be checked (not just presence) ✅ Updated to case-insensitive "True" comparison
@NaluTripician sessionTokenRetryCount not incremented in hub bypass path ✅Addressed: counter check is >= 1, and after hub header the code path diverges (403/3 or 200), capped by failoverRetryCount
@jeet1995 End-to-end tests? ✅ Added e2e tests in CosmosItemIntegrationTests.cs. PPAF drill will also serve as E2E validation

Closing issues

To automatically close an issue: closes #IssueNumber

Copy link
Copy Markdown
Contributor

@NaluTripician NaluTripician left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question, other than that LGTM

Comment thread Microsoft.Azure.Cosmos/src/ClientRetryPolicy.cs Outdated
Copy link
Copy Markdown
Member

@tvaron3 tvaron3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Deep Review — Per Partition Hub Region Caching

Overall: The caching architecture is clean and the feature intent is solid. However, there is a correctness issue with sessionTokenRetryCount not being incremented when bypassing ShouldRetryOnSessionNotAvailable (reinforces @NaluTripician's existing comment), and the hub cache has no eviction mechanism which could lead to unbounded growth.

Existing comments: 1 comment from @NaluTripician about sessionTokenRetryCount — overlaps with finding #1.

# Severity Summary
1 🔴 Blocking sessionTokenRetryCount not incremented — potential unbounded retry
2 🟡 Unbounded cache growth — no eviction or TTL
3 🟡 Stale cache entries after partition splits
4 🟡 documentServiceRequest may be null
5 🟡 Concrete type check breaks retry handler abstraction
6 🟢 Cache lookup on every request (minor perf note)

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

Comment thread Microsoft.Azure.Cosmos/src/ClientRetryPolicy.cs Outdated
Comment thread Microsoft.Azure.Cosmos/src/Routing/GlobalPartitionEndpointManagerCore.cs Outdated
Comment thread Microsoft.Azure.Cosmos/src/Routing/GlobalPartitionEndpointManagerCore.cs Outdated
Comment thread Microsoft.Azure.Cosmos/src/Handler/AbstractRetryHandler.cs Outdated
Comment thread Microsoft.Azure.Cosmos/src/Handler/AbstractRetryHandler.cs Outdated
Comment thread Microsoft.Azure.Cosmos/src/ClientRetryPolicy.cs Outdated
tvaron3
tvaron3 previously approved these changes Mar 5, 2026
@kirankumarkolli
Copy link
Copy Markdown
Member

Syncedup offline about write-location == Hub for SM which is scope right now.

Comment thread Microsoft.Azure.Cosmos/src/ClientRetryPolicy.cs Outdated
Comment thread Microsoft.Azure.Cosmos/src/ClientRetryPolicy.cs Outdated
Comment thread Microsoft.Azure.Cosmos/src/ClientRetryPolicy.cs Outdated
Comment thread Microsoft.Azure.Cosmos/src/ClientRetryPolicy.cs Outdated
Comment thread Microsoft.Azure.Cosmos/src/Routing/GlobalPartitionEndpointManager.cs Outdated
/// For single master accounts, the write region is the hub region, so this
/// stores the hub URI in the existing write-location cache.
/// </summary>
public abstract void CacheDiscoveredHubRegionForPartition(PartitionKeyRange partitionKeyRange, Uri hubRegion, string collectionRid);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarification: Non-PPAF accounts what's the expected behavior?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A non-PPAF single-master account gets: 404/1002 × 2 → hub header set → region cycling via normal retry on 403/3 retries currently work → it gets next region info from AccountProperties → eventually gets 200 response →updates location cache

Comment thread Microsoft.Azure.Cosmos/src/Handler/AbstractRetryHandler.cs Outdated
Comment thread Microsoft.Azure.Cosmos/src/Handler/AbstractRetryHandler.cs Outdated
return;
}
else

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't else needed? With-out it RouteToHub work will be overriden by index based flow

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably relying on below to reset index to ZERO. I think its better to not take dependency on that behavior.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the code logic, fixed this by adding an explicit return after RouteToLocation(hubUri) and setting this.locationEndpoint. Choosing over else because else because ResolveServiceEndpoint further down would still override the hub routing. The return exits the method entirely since routing is fully resolved at that point.

// a previous request already discovered the hub region for this partition.
// If cached, route directly there (skipping the 403/3 discovery chain).
// Normal first-attempt requests never enter this block (addHubRegionProcessingOnlyHeader = false).
if (!this.canUseMultipleWriteLocations && this.addHubRegionProcessingOnlyHeader)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be purely based on addHubRegionProcessingOnlyHeader.
addHubRegionProcessingOnlyHeader decision can include MM in its decision.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but we dont have support for HubRegion header in multi master

{
request.RequestContext.RouteToLocation(this.globalEndpointManager.GetHubUri());

this.locationEndpoint = request.RequestContext.LocationEndpointToRoute;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this.locationEndpoint is getting updated for MM also, what are the side-affects of it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment has been addressed already. The updated code always sets this.locationEndpoint by ResolveServiceEndpoint at the end of the method for all paths including multimaster. The hub caching early return is in a separate block guarded by addHubRegionProcessingOnlyHeader, which is only set for single-master, so multimaster never reaches it.

/// </summary>
private static bool IsHubRegionRoutingActive(DocumentServiceRequest request)
{
return request?.Headers?[HttpConstants.HttpHeaders.ShouldProcessOnlyInHubRegion] != null;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about check the value as well, so make it more bounded.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated IsHubRegionRoutingActive to check the value equals "True" (case-insensitive) instead of just checking for header presence.

request,
this.PartitionKeyRangeToLocationForWrite);
}
else if (GlobalPartitionEndpointManagerCore.IsHubRegionRoutingActive(request))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about club this if caluse with above one?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

combined the PPAF and hub region branches in TryAddPartitionLevelLocationOverride since they had identical code

request,
this.PartitionKeyRangeToLocationForWrite);
}
else if (GlobalPartitionEndpointManagerCore.IsHubRegionRoutingActive(request))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How different is it from above if block?
If they are same how about just inlne the if condtion with existing one?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are different in behavior as the PPAF block calls TryAddOrUpdatePartitionFailoverInfoAndMoveToNextLocation, which caches the failed location and picks the next region from the endpoint list as the failover target, then returns true so the retry routes to that cached location. The hub block does the opposite,it calls TryRemove to delete any stale cached hub entry and returns false. This is intentional because during hub discovery, a 403/3 tells us a region is NOT the hub, but it does not tell us which region IS the hub. Caching the next region from the endpoint list would be incorrect because that region is just the next one in ordering, not necessarily the hub.

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

@kundadebdatta kundadebdatta self-assigned this Apr 11, 2026
@kundadebdatta kundadebdatta changed the title Per Partition Automatic Failover: Adds Hub Region caching at per partition level. [Internal] PPAF: Adds Hub Region Caching Per Partition Level Apr 11, 2026
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

Copy link
Copy Markdown
Member

@kirankumarkolli kirankumarkolli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can TryMarkEndpointUnavailableForPkRange be leveraged on getting 403.3?

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

@kundadebdatta kundadebdatta marked this pull request as draft April 15, 2026 00:39

// Note: This can be triggered by the read requests as well. In that case, we will set the isReadRequest to
// false to ensure that we mark the endpoint unavailable for writes only.
return await this.ShouldRetryOnEndpointFailureAsync(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Keeping 403.3 markdown logic common for both read and writes, may force refresh the locations (account properties) by calling await this.globalEndpointManager.RefreshLocationAsync(forceRefresh);

This can be a behavior change where, now the reads will have the ablity to do force refresh locations in the event of getting a 403.3 Considering this as an edge case, the impact will be minimal though.

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

Comment thread Microsoft.Azure.Cosmos/src/Regions.cs Outdated
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Change to existing functional behavior (perf, logging, etc.) PerPartitionAutomaticFailover

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants