Skip to content

[Internal] PPAF: Fixes Hub region header never sent due to premature NoRetry on single-master 404/1002.#5792

Merged
kundadebdatta merged 9 commits intomainfrom
users/aavasthy/hubregioncaching-fix
May 2, 2026
Merged

[Internal] PPAF: Fixes Hub region header never sent due to premature NoRetry on single-master 404/1002.#5792
kundadebdatta merged 9 commits intomainfrom
users/aavasthy/hubregioncaching-fix

Conversation

@aavasthy
Copy link
Copy Markdown
Contributor

Pull Request Template

Description

PR Description

Summary

Fixes a bug in ClientRetryPolicy.ShouldRetryOnSessionNotAvailable where the hub region processing header (x-ms-cosmos-hub-region-processing-only) was set but never actually sent to the backend on single-master accounts. The retry threshold sessionTokenRetryCount > 1 caused NoRetry immediately after the header flag was set, preventing the retry that would carry the header.

Fix

Changed sessionTokenRetryCount > 1sessionTokenRetryCount > 2.

Corrected Flow

Attempt count before method Guard sets header? count++ Check count > 2 Result Header on request?
1st request Sent to preferred read region ❌ No
1st 404/1002 0 No (0 >= 1 = false) → 1 1 > 2 = false ✅ Retry to write region (index=0) ❌ No
2nd 404/1002 1 Yes (1 >= 1 = true) → 2 2 > 2 = false ✅ Retry to write region (index=0) Yes
3rd 404/1002 2 Yes (already true) → 3 3 > 2 = true ⛔ NoRetry — hub confirmed not available ✅ Yes

End-to-End Hub Region Discovery (Happy Path with 403/3)

Step Request Response Hub Header? What happens
1 Read to preferred region 404/1002 Session not available → retry to write region
2 Read to write region 404/1002 Write region also returns session not available → header flag set → retry
3 Read to write region 403/3 Non-hub region rejects hub-only request → PPAF marks endpoint unavailable → retry to next region
4 Read to next region 200 OK Hub region found, request succeeds, hub cached for partition

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)

Closing issues

To automatically close an issue: closes #IssueNumber

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

@aavasthy aavasthy changed the title Per Partition Automatic Failover: Fixes Hub region header never sent due to premature NoRetry on single-master 404/1002. [Internal] PPAF: Fixes Hub region header never sent due to premature NoRetry on single-master 404/1002. Apr 27, 2026
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

Comment thread Microsoft.Azure.Cosmos/src/ClientRetryPolicy.cs
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

kundadebdatta
kundadebdatta previously approved these changes Apr 28, 2026
Copy link
Copy Markdown
Member

@kundadebdatta kundadebdatta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Praveen-Msft
Praveen-Msft previously approved these changes Apr 28, 2026
Comment thread Microsoft.Azure.Cosmos/src/ClientRetryPolicy.cs
Comment thread Microsoft.Azure.Cosmos/src/ClientRetryPolicy.cs
Copy link
Copy Markdown
Member

@kushagraThapar kushagraThapar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is specifically fixed for single master, I wonder if we can add some tests for multi-master account? making sure the retries are same in multi-master and are correct.
@jeet1995 can you please review this PR as well, thanks!

@aavasthy aavasthy dismissed stale reviews from Praveen-Msft and kundadebdatta via f8f93e5 April 29, 2026 17:29
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

@aavasthy
Copy link
Copy Markdown
Contributor Author

Since this is specifically fixed for single master, I wonder if we can add some tests for multi-master account? making sure the retries are same in multi-master and are correct. @jeet1995 can you please review this PR as well, thanks!

We have multi-master tests already added ensuring that this header is not there
https://github.com/Azure/azure-cosmos-dotnet-v3/blob/f8f93e55c65a435fec2de7b7a566ecdca8d80dee/Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.Tests/ClientRetryPolicyTests.cs?plain=1#L415C23-L415C23

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

@aavasthy aavasthy requested a review from xinlian12 as a code owner April 30, 2026 00:25
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

1 similar comment
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

@aavasthy aavasthy force-pushed the users/aavasthy/hubregioncaching-fix branch from 3e36025 to 3172747 Compare April 30, 2026 00:33
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

@NaluTripician
Copy link
Copy Markdown
Contributor

@sdkReviewAgent-2

Copy link
Copy Markdown
Member

@kundadebdatta kundadebdatta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM Now.

@kundadebdatta kundadebdatta enabled auto-merge (squash) May 1, 2026 21:49
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

@aavasthy aavasthy added the auto-merge Enables automation to merge PRs label May 1, 2026
Comment thread Microsoft.Azure.Cosmos/src/ClientRetryPolicy.cs
@xinlian12
Copy link
Copy Markdown
Member

Review complete (22:21)

Posted 2 inline comment(s).

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

@kundadebdatta kundadebdatta merged commit 562d2d5 into main May 2, 2026
32 checks passed
@kundadebdatta kundadebdatta deleted the users/aavasthy/hubregioncaching-fix branch May 2, 2026 01:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge Enables automation to merge PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants