Spec: hub-region-processing-only header for 404/1002 retries#4310
Spec: hub-region-processing-only header for 404/1002 retries#4310NaluTripician wants to merge 8 commits intoAzure:mainfrom
Conversation
Adds a design spec proposing parity with .NET PR Azure/azure-cosmos-dotnet-v3#5447: emit `x-ms-cosmos-hub-region-processing-only: true` on second-and-later 404/1002 retries for single-master accounts to mitigate session-token storms during region failback. Spec only — no implementation. Mirrors the in-repo convention established by FEED_OPERATIONS_SPEC.md (Azure#4261): a Markdown design document at `sdk/cosmos/<crate>/docs/<NAME>_SPEC.md` with status header, numbered TOC, and numbered sections covering goals, architecture, behavior, code changes, side effects, alternatives, testing, and open questions. Open questions are pre-resolved in §8: Day 1 ships .NET parity only (no skip-set rotation), and the CHANGELOG entry will append under the existing `## 0.32.0 (Unreleased)` block. Tracks Azure#4303. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds a design-only specification for introducing x-ms-cosmos-hub-region-processing-only: true on second-and-later 404/1002 retries (single-master) to mitigate session-token storms during region failback, mirroring the .NET behavior.
Changes:
- Introduces a new Markdown design spec describing motivation, intended retry/latch behavior, and proposed code touchpoints.
- Documents suggested implementation details (constants, retry policy latch + header injection), risks, alternatives, and test matrix.
AI Review SummaryReviewed this spec PR against Headline: The central design — latch on the 2nd 1002, header rides the next retry — is structurally incompatible with Rust's current single-master retry budget. 11 inline findings posted: 3 🔴 Blocking, 4 🟡 Recommendation, 3 🟢 Suggestion, 1 💬 Observation. Four of these reinforce existing unresolved comments by Top items to resolve before code lands:
|
Major spec revision in response to 15 review threads on PR Azure#4310. All threads accepted; zero rejections — every comment was traced to a specific code anchor and verified. Key correctness fixes: - Latch trigger moved from `session_token_retry_count >= 2` to `== 1`. The current single-master budget at line 294 returns `DoNotRetry` once the counter exceeds 1, so the previous trigger flipped the latch on the same call that ended the operation — the header would never have reached the wire. This is a deliberate parity gap with .NET; full retry-count parity now captured as ALT-5 / Future Work. - Single-master gate switched from the per-request `can_use_multiple_write_locations` field (always false for reads) to the account-level `LocationCache::can_use_multiple_write_locations()`. The previous gate would have fired the latch on multi-master reads too. - Header value changed from `"true"` to `"True"` to match .NET's `bool.TrueString` wire format. The local `ALLOW_TENTATIVE_WRITES` precedent is a different header with an independent backend parser; we cannot assume case-insensitive parsing without a contract reference. - Sub-status label corrected: 1002 in the 404 context is `READ_SESSION_NOT_AVAILABLE`, not `PARTITION_KEY_RANGE_GONE` (the latter is 410/1002). - §4.3 CHANGELOG instructions rewritten to add `### Features Added` under the existing `## 0.32.0 (Unreleased)` block, not create a new block. OQ-2 and §4.3 no longer contradict. Documentation/maintainability fixes: - Problem Statement (§1.1) rewritten to accurately describe the current single-retry-via-write-locations behavior. - Inline-comment requirement at the increment site to pin the trigger boundary against silent breakage. - AC-3 added: boundary regression test at counter == 1. - Plain-`bool` (vs .NET `volatile bool`) justification stated inline in §3.1. - AC-3/AC-6/AC-7 (post-latch 503 / 410/1022 retries) intentionally dropped — those retries don't exist under the current budget; AC-6 covers connection-failure persistence which does still exist. - SE-003 promoted from "Phase 3 implementer must verify" to a formal §7.4 Acceptance Gates block (AG-1..AG-4). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Review feedback addressed — spec revision pushed (
|
| Thread | Path/Line | Fix |
|---|---|---|
| Latch has no outbound effect (Findings #1 & #3 in AI summary) | §3.1, §3.4, §4.2(c), §7.1 | Trigger moved from session_token_retry_count >= 2 to == 1. Verified client_retry_policy.rs:294 returns DoNotRetry once > 1, so the prior trigger flipped on the call that ended the operation — header never reached the wire. New trigger latches on the same call that returns Retry for the one permitted retry; header rides that retry. |
Wrong gate (can_use_multiple_write_locations per-request) |
§3.1 #3, §4.2(c) | Replaced with account-level LocationCache::can_use_multiple_write_locations() (verified: location_cache.rs:381, write_endpoints().len() > 1). Spec now explicitly forbids the two per-request variants and explains why. |
@Copilot reinforcement on the same gate |
§3.1 #3 | Same fix — single change resolves both. |
Recommendations (🟡) — fixed
| Thread | Fix |
|---|---|
| Inaccurate baseline in Problem Statement | §1.1 rewritten to describe single-retry-via-write-locations behavior, citing the actual code (!can_use_multiple_write_locations branch returning DoNotRetry on > 1). |
@Copilot reinforcement on same |
Same fix. |
Wire-format mismatch ("true" vs "True") |
§3.2, §3.3, §4.2(b), AC-2 wording: header value now "True" (capitalized) matching .NET bool.TrueString. §3.3 explains why this deliberately diverges from the adjacent ALLOW_TENTATIVE_WRITES "true" precedent (different header, different parser, no shared contract). |
| Sub-status name wrong | §4.2(d), SE-005: 404/1002 = READ_SESSION_NOT_AVAILABLE. The PARTITION_KEY_RANGE_GONE reference (which is 410/1002) is removed. |
| Trigger boundary fragile | §4.2(c): mandatory inline comment at the session_token_retry_count += 1; site cross-referencing the spec. AC-3 added: boundary-pinning regression test that pins counter==1 → header, counter==2 → no header. |
Suggestions (🟢) — fixed
| Thread | Fix |
|---|---|
| Latch placement ambiguity | Moot under the new design — latch site is now a single if immediately after the increment, no branch ambiguity. §4.2(c) shows the exact code. |
Threading model (plain bool vs volatile bool) |
One-sentence justification added in §3.1 and §4.2(a) doc comment: "plain bool is correct because ClientRetryPolicy is mutated through &mut self for the duration of an operation." Also surfaced as AG-4 in §7.4 acceptance gates. |
@Copilot reinforcement on §4.3/OQ-2 contradiction |
§4.3 rewritten: insert ### Features Added above the existing ### Breaking Changes in the existing ## 0.32.0 (Unreleased) block. No new version block. Matches OQ-2. |
Observation (💬) — fixed
| Thread | Fix |
|---|---|
| Promote SE-003 to acceptance gate | New §7.4 Acceptance Gates block (AG-1..AG-4) makes per-operation construction (AG-1), cross-operation isolation test (AG-2), within-operation persistence (AG-3), and the &mut self-only invariant (AG-4) blocking for the implementation PR. |
Architectural decision: parity vs retry-budget
The headline blocker (Finding #1 in the AI review) gave us two paths:
- (A) Extend Rust's read-retry budget to mirror .NET's
sessionTokenRetryCount > ReadEndpoints.Countand keep the latch on the second 1002. - (B) Latch on the first 1002 and accept that Rust's existing single-master read budget runs the header on the one permitted retry.
Picked (B). Rationale documented in the new §1.2 Goals (re-scoped to "header-emission parity within Rust's existing retry budget") and as a new ALT-5 in §6 with full reasoning. (A) is preserved as a Future Work item with a concrete implementation outline. Bundling a retry-budget extension with header injection couples two unrelated risks and changes the read contract for every read in the SDK, not just 404/1002 — the same scope-discipline argument that kept skip-set rotation out of OQ-1.
CI
The Build Analyze failure is the only red on this PR; all 5 build/test matrix jobs (windows_stable, ubuntu_stable, ubuntu_nightly, ubuntu_msrv, macos_stable) pass. Logs are on Azure DevOps and not fetchable from outside, but spec-only PRs with no Rust source change typically fail Build Analyze on workspace-wide compliance scans (no version bump in any crate, no Cargo.lock change). Per the original PR scope (spec only) I'm leaving this as known-noise; happy to investigate the actual log if you can paste it.
Diff stats
4ae40ee76: 1 file changed, 199 insertions(+), 64 deletions(-).
All four findings accepted; verified each against `main` source. - New-1 (blocker): §4.2(c) gate accessor rewritten. Verified `global_endpoint_manager.rs:31` declares `location_cache: Mutex<LocationCache>`; an accessor that returns `&LocationCache` cannot outlive the lock guard, so the previous `location_cache().can_use_multiple_write_locations()` chain would not compile. Spec now mandates a forwarding method `GlobalEndpointManager::account_supports_multi_write()` that encapsulates `self.location_cache.lock().unwrap().can_use_multiple_write_locations()`, matching the `.lock().unwrap()` pattern already used at lines 283 and 354. - New-2 (blocker): §4.2(c) sample replaces `... == false` with `!...`. The previous form would trip clippy's `bool_comparison` lint, and §7.3 mandates `cargo clippy -- -D warnings`. Spec example now passes the same gate it requires of implementations. - New-3 (recommendation): §4.2(c) placement instruction tightened from "before any return" to "after the discovery-enabled early return and the counter increment, before the budget/gating conditionals." Removes the contradiction with §3.1 condition Azure#2 (`enable_endpoint_discovery` early-out at the top of the function). - New-4 (suggestion): §3.2 and AC-6 row reworded — "non-counted under `MAX_RETRY_COUNT_ON_CONNECTION_FAILURE`" → "tracked under a separate budget, `MAX_RETRY_COUNT_ON_CONNECTION_FAILURE`, rather than the `session_token_retry_count` budget". Removes the misleading implication that connection-failure retries are unbounded. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Second-pass review addressed —
|
| # | Severity | Finding | Resolution |
|---|---|---|---|
| New-1 | 🔴 Correctness | location_cache: Mutex<LocationCache> so &LocationCache can't escape the lock guard |
§4.2(c) now mandates a forwarding method on GlobalEndpointManager: pub(crate) fn account_supports_multi_write(&self) -> bool { self.location_cache.lock().unwrap().can_use_multiple_write_locations() }. Pattern matches existing .lock().unwrap() callers at lines 283 and 354. The latch condition is now !self.global_endpoint_manager.account_supports_multi_write(). |
| New-2 | 🔴 Lint | ... == false would trip bool_comparison and §7.3 mandates -D warnings |
§4.2(c) sample uses !.... Spec example now passes the gate it requires of implementations. |
| New-3 | 🟡 Precision | "before any return" contradicts the enable_endpoint_discovery early-return at the top of the function |
§4.2(c) tightened to "after the discovery-enabled early return and the counter increment, before the budget/gating conditionals." |
| New-4 | 🟢 Wording | "non-counted under MAX_RETRY_COUNT_ON_CONNECTION_FAILURE" reads as if connection-failure retries are unbounded |
§3.2 and AC-6 row reworded to "tracked under a separate budget, MAX_RETRY_COUNT_ON_CONNECTION_FAILURE, rather than the session_token_retry_count budget." |
Diff: 4ae40ee76..86dc5f167 (1 file, +40/-22). §3.4, §6 ALT-5, §7.1, §7.4 unchanged.
GitHub renders Mermaid natively; the box-drawing version had alignment drift in raw view (mismatched column widths from variable-width chars in the inner labels). Same nodes, same edges, same captions; just renders cleanly now. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- SE-003 severity changed from the undefined Foundational to the legend-defined Breaking - accepted via AG-1..AG-4. Removes a reviewer distraction; the AG cross-reference makes the acceptance path explicit. - Added a Confidence line under Status: traced the ClientRetryPolicy call graph and the GlobalEndpointManager/LocationCache mutex layering against main, but explicitly call out that the backend's case-sensitivity for the new header value is assumed (matching .NET's bool.TrueString) and not verified against a contract. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Rework spec to acknowledge the azure_data_cosmos / azure_data_cosmos_driver split that landed on main: - Confidence statement now scopes the driver-resident forward target - New `Crate (interim home)` and `Crate (forward target)` headers - New §1.5 Two-crate landscape table mapping concerns to crates and documenting that azure_data_cosmos does not yet depend on the driver - §2 grows a `Forward target` subsection with a second Mermaid diagram showing the driver pipeline shape and the future HubRegionHeaderPolicy - §4 leads with a crate-boundary note explaining why all file paths are under the SDK - New SE-006 (crate-boundary drift) under Side Effects & Risk - New ALT-6 enumerating three concrete prerequisites the driver lacks today (status-code retry, account topology surface, per-operation state model) and a 1:1 cross-walk table for the eventual migration - New AG-5 acceptance gate requiring a // TODO(driver-migration): anchor at the latch declaration and emission sites - New `Driver migration` bullet under §8 Future Work, listed first Wire output and Day-1 implementation are unchanged: the SDK is still the only place where the trigger's prerequisites exist on main, so landing in ClientRetryPolicy is correct today and the migration is a refactor, not a behavior change. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fixes from PR Deep Reviewer pass on commit 36fe8cc: Blocking Driver-crate file paths now use the canonical sdk/cosmos/azure_data_cosmos_driver/src/ prefix (driver/cosmos_driver.rs, driver/transport/pipeline.rs, driver/transport/mod.rs). Path convention noted in confidence header so reviewers can grep with confidence ALT-6 prerequisite Azure#2 reworded from No account topology surface (overstated; Region exists) to No account-metadata fetch or topology tracking, with the underscore-prefix placeholder semantic called out ALT-6 cross-walk gains a Status column. Trigger and single-master gate rows are flagged Design-pending (gated on prerequisites Azure#1 and Azure#2); emission, latch, and constant rows remain Mechanical. Removes the false impression that the migration is copy/paste Replaced the forward-target Mermaid in section 2 with an honest diagram showing the current driver pipeline plus the three prerequisite gaps. The previous diagram implied a 404/1002 feedback hook that does not exist on main Recommendations Section 1.5 dependency row split into its own logical row (SDK consumes driver / Not yet) so the - dashes stop reading as third-crate ambiguity SE-006 mitigation now explains why double-emission cannot occur (SDK to driver migration removes ClientRetryPolicy status-code retry as part of the same change set) Confidence header reconciled with section 8: header-emission relocation is straightforward; only the driver-side retry shape is design-pending Suggestions Section 4 crate-boundary callout trimmed to one sentence AC-6 wording clarifies connection failure is the SDK term Mermaid arrow consistency restored (no more vs mix; new diagram uses straight arrows only) Approach-level critique was a ratification: ship the SDK version, migrate when the driver gains prerequisites. No structural change. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Driver-aligned rework (commits
|
| # | Severity | Finding | Resolution |
|---|---|---|---|
| 1 | Blocking | Driver-crate citations used bare cosmos_driver.rs instead of canonical driver/cosmos_driver.rs |
All paths fixed; convention noted in confidence header |
| 2 | Blocking | ALT-6 prerequisite #2 ("no account topology surface") was overstated since Region exists |
Reworded to "no account-metadata fetch or topology tracking"; underscore-prefix placeholder semantic called out |
| 3 | Blocking | ALT-6 cross-walk hand-waved missing subsystem (driver retry hook) as if mechanical | Added Status column; trigger and single-master-gate rows flagged Design-pending (gated on prerequisites #1/#2) |
| 4 | Blocking | Forward-target Mermaid showed a 404/1002 feedback arrow that doesn't exist on main |
Replaced with honest diagram showing current driver pipeline + three prerequisite gaps as a separate Gap subgraph |
| 5 | Recommend | section 1.5 SDK->Driver dependency row formatting was ambiguous |
Renamed to SDK consumes driver with explicit Not yet |
| 6 | Recommend | SE-006 mitigation didn't explain why double-emission can't occur | Added: SDK->driver migration removes ClientRetryPolicy's status-code retry as part of the same change set |
| 7 | Recommend | Confidence header ("migration not yet designed") contradicted section 8 (which had a design) | Split claim: header-emission relocation is straightforward; only the driver-side retry shape is design-pending |
| 8 | Suggest | section 4 crate-boundary callout was redundant with section 1.5 | Trimmed to one sentence |
| 9 | Suggest | AC-6 mixed "connection failure" / "transport failure" terminology | Clarified "connection failure" as the SDK's term for transport-layer failure |
| 10 | Suggest | Arrow style inconsistency in Mermaid labels | Resolved as part of #4 (ALT-6 diagram rewrite) |
Pass 2 deep-review verdict
Reviewable. All four blockers resolved, all six lower-severity findings resolved, no new findings, no approach-level concerns. The reviewer's approach-level note: "The spec remains correctly scoped: a surgical SDK-side implementation with explicit forward-looking documentation of the driver-side migration, gated on prerequisites that are out of scope for this work item."
Ready for another team pass.
Summary
Adds a design spec only (no implementation) proposing functional parity with .NET PR Azure/azure-cosmos-dotnet-v3#5447: emit
x-ms-cosmos-hub-region-processing-only: trueon the second-and-later 404 / sub-status 1002 retry for single-master Cosmos DB accounts. This mitigates session-token storms during region failback for session-consistency reads.The spec is published at
sdk/cosmos/azure_data_cosmos/docs/HUB_REGION_PROCESSING_HEADER_SPEC.mdso it can be reviewed in repo before any code lands. Tracks #4303.Why a spec PR first
The .NET fix is a small change with subtle semantics — a per-operation latch that only activates after a specific failure shape, plus a header that opts retries into hub-region-only processing. Mirroring it in Rust touches the retry policy, header constants, and CHANGELOG, and there are at least three judgement calls (latch threshold mapping, latch lifetime, scope vs. issue text). Reviewing the spec in isolation lets us settle those before code review.
What's in the spec
BackOffRetryHandler→before_send_request→should_retryflow with ASCII diagram.constants.rs(newcosmos_headers!entry),client_retry_policy.rs(latch field,before_send_requestinjection,should_retry_on_session_not_availableset-point), andCHANGELOG.md.Pre-resolved open questions
## 0.32.0 (Unreleased)already exists onmain; spec instructs implementer to add a new### Features Addedsubsection above the existing### Breaking Changes, no new version block.Convention compliance
Mirrors the in-repo spec convention established by
FEED_OPERATIONS_SPEC.md(#4261): Markdown doc under<crate>/docs/, status header block, numbered TOC, numbered sections. Code snippets follow repo conventions verified against existing source —cosmos_headers!macro registration, lowercase"true"literal (matchesALLOW_TENTATIVE_WRITESprecedent),test_prefix on all tests (file-local convention inclient_retry_policy.rs), CHANGELOG PR-link suffix.Out of scope (explicitly)
Tracks #4303.