Skip to content

Conversation

@aavasthy
Copy link
Contributor

Pull Request Template

Description

During partition-level failover and failback under session consistency, a timing gap can cause read requests to fail with 404/1002 errors. When a partition temporarily fails over to a secondary region and later begins failing back to the primary region, the SDK’s read circuit breaker (PPCB) may start routing reads back to the primary region before it has fully caught up with the writes from the failover region. As a result, reads using session tokens from the previous write region may fail because the primary region does not yet have the corresponding session state. Since the SDK currently does not perform cross-regional retries for 404/1002 responses, these reads continue to fail until the primary region is fully synchronized. The goal is to leverage the new backend header x-ms-cosmos-hub-region-processing-only to detect such conditions and route retry requests to the correct write (hub) region, ensuring successful session-consistent reads during the failback window.

Type of change

Please delete options that are not relevant.

  • [] Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • [] Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • [] This change requires a documentation update

Closing issues

To automatically close an issue: closes #5440

@aavasthy aavasthy self-assigned this Oct 14, 2025
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good!

@aavasthy aavasthy marked this pull request as ready for review October 14, 2025 17:18
@aavasthy aavasthy changed the title [Per Partition Automatic Failover] Use Hub Region Processing Only While Routing Requests Failed with 404/1002. Per Partition Automatic Failover: Adds Hub Region Processing Only While Routing Requests Failed with 404/1002. Oct 14, 2025
ananth7592
ananth7592 previously approved these changes Oct 14, 2025
Copy link
Contributor

@ananth7592 ananth7592 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with a comment

}

if (statusCode == HttpStatusCode.NotFound
&& subStatusCode == SubStatusCodes.ReadSessionNotAvailable)
Copy link
Member

@xinlian12 xinlian12 Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also double check: does this change also targeted for MM as well? For MM, writes can happen in any region, also enable this for MM might cause regression

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confirmed this with backend team and this change is not intended to be used in multi-master.

if (this.addHubRegionProcessingOnlyHeader)
{
request.Headers[HubRegionHeader] = bool.TrueString;
this.addHubRegionProcessingOnlyHeader = false; // reset after applying
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would be the errors returned if SDK try to read from non-hub region?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also falling back to new hub for that partition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SDK ends up getting 403/3(WriteForbidden) if it tries to read from a non-hub region. Then SDK has internal logic where every 403/3 response leads to new endpoint discovery. The SDK then retries the request to the hub region.

Copy link
Member

@kirankumarkolli kirankumarkolli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waiting for design document

private const int MaxRetryCount = 120;
private const int MaxServiceUnavailableRetryCount = 1;
private const int MaxServiceUnavailableRetryCount = 1;
private const string HubRegionHeader = "x-ms-cosmos-hub-region-processing-only";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any central place to host these types of headers in the code base?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes we are actually using the central file HttpConstants.cs. I removed this line as part of code clean up.

@aavasthy
Copy link
Contributor Author

Waiting for design document

Waiting for design document

#5440

@aavasthy aavasthy force-pushed the users/aavasthy/404_1002 branch from f989f93 to d51b105 Compare December 18, 2025 22:53
@aavasthy aavasthy added the auto-merge Enables automation to merge PRs label Dec 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge Enables automation to merge PRs

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

[Per Partition Automatic Failover] Use Hub Region Processing Only While Routing Requests Failed with 404/1002

6 participants