Skip to content

Per Partition Automatic Failover: Adds Hub Region Processing Only While Routing Requests Failed with 404/1002 for single master accounts.#5447

Merged
aavasthy merged 26 commits intomasterfrom
users/aavasthy/404_1002
Jan 26, 2026
Merged

Per Partition Automatic Failover: Adds Hub Region Processing Only While Routing Requests Failed with 404/1002 for single master accounts.#5447
aavasthy merged 26 commits intomasterfrom
users/aavasthy/404_1002

Conversation

@aavasthy
Copy link
Copy Markdown
Contributor

@aavasthy aavasthy commented Oct 14, 2025

Pull Request Template

Description

This change is limited to single master accounts. During partition-level failover and failback under session consistency, a timing gap can cause read requests to fail with 404/1002 errors. When a partition temporarily fails over to a secondary region and later begins failing back to the primary region, the SDK’s read circuit breaker (PPCB) may start routing reads back to the primary region before it has fully caught up with the writes from the failover region. As a result, reads using session tokens from the previous write region may fail because the primary region does not yet have the corresponding session state. Since the SDK currently does not perform cross-regional retries for 404/1002 responses, these reads continue to fail until the primary region is fully synchronized. The goal is to leverage the new backend header x-ms-cosmos-hub-region-processing-only to detect such conditions and route retry requests to the correct write (hub) region, ensuring successful session-consistent reads during the failback window.

As part of this change, the hub region header is added to the second retry request that failed with 404/1002 Read Session Not Available. In order to avoid running into a loop of retries to find the hub region, the first retry request for 404/1002 does not container header and retries on the available write region from location cache. If that request also fails with 404/1002 then subsequent retry requests all contain the header. If the request has the header and it does not hit the primary hub region then backend return 403/3. SDK will retry the subsequent 403/3 retry requests with the hub region header added.

Type of change

Please delete options that are not relevant.

  • [] Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • [] Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • [] This change requires a documentation update

Closing issues

To automatically close an issue: closes #5440

@aavasthy aavasthy self-assigned this Oct 14, 2025
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good!

@aavasthy aavasthy marked this pull request as ready for review October 14, 2025 17:18
@aavasthy aavasthy changed the title [Per Partition Automatic Failover] Use Hub Region Processing Only While Routing Requests Failed with 404/1002. Per Partition Automatic Failover: Adds Hub Region Processing Only While Routing Requests Failed with 404/1002. Oct 14, 2025
Comment thread Microsoft.Azure.Cosmos/src/ClientRetryPolicy.cs Outdated
ananth7592
ananth7592 previously approved these changes Oct 14, 2025
Copy link
Copy Markdown
Member

@ananth7592 ananth7592 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with a comment

Comment thread Microsoft.Azure.Cosmos/src/ClientRetryPolicy.cs Outdated
Comment thread Microsoft.Azure.Cosmos/src/ClientRetryPolicy.cs Outdated
This was referenced Apr 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge Enables automation to merge PRs

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[Per Partition Automatic Failover] Use Hub Region Processing Only While Routing Requests Failed with 404/1002 for single master account.

6 participants