Skip to content

CrossRegionalHedging: Adds Initial Design for hedging Metadata Requests#5920

Draft
kundadebdatta wants to merge 3 commits into
mainfrom
users/kundadebdatta/5917_metadata_hedging_spec
Draft

CrossRegionalHedging: Adds Initial Design for hedging Metadata Requests#5920
kundadebdatta wants to merge 3 commits into
mainfrom
users/kundadebdatta/5917_metadata_hedging_spec

Conversation

@kundadebdatta
Copy link
Copy Markdown
Member

Pull Request Template

Description

  • Adds Initial Design for hedging Metadata Requests

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Closing issues

To automatically close an issue: closes #5917

Changelog

  • I have added a changelog entry under ### Unreleased in changelog.md
    for the user-facing impact of this change.
  • [x ] No changelog entry is required because this PR is one of:
    documentation-only, test-only, CI / build-only, or a pure internal refactor
    with no observable customer impact.

If the second box is checked, briefly justify here:

kundadebdatta and others added 2 commits June 1, 2026 17:46
Addresses the findings produced by the PR Deep Reviewer on PR #5920
(excluding Finding #1, which assumed PR #5844 `MetadataDetachedExecutor`
would merge into main; per author guidance, this design proceeds without it).

Structural changes
- §5.3 `ExecuteAsync` rewritten: per-branch CancellationTokenSources
  (no shared linkedCts) so the loser`s OperationCanceledException is
  contained inside `BackgroundCleanupAsync` and cannot reach
  `MetadataRequestThrottleRetryPolicy` (Finding #2 — protects healthy
  secondary from spurious `MarkEndpointUnavailableForRead` post-PR #5780).
- §5.3: primary fault before threshold no longer bypasses hedge — adds
  `primaryTask.Status == RanToCompletion` guard so fast-fail-on-degraded
  primary triggers the hedge (Finding #3).
- §5.3: wait-for-winner is now a loop that filters transient/faulted
  completions; a fast 503 from the hedge can no longer beat a healthy
  200 from the primary (Finding #4).
- §5.3 + §5.4 + §5.5: `as` cast (not hard cast) for
  `MetadataRequestThrottleRetryPolicy`; wrapped/test-double policies
  no longer throw `InvalidCastException` (Finding #7).
- §5.3: added `BackgroundCleanupAsync` that awaits the loser, disposes
  its `DocumentServiceResponse` body (handle-leak fix), records outcome
  via volatile field, and disposes the loser CTS (Finding #11).

Correctness/factual fixes
- §5.10 + §5.6: corrected — `ClientCollectionCache` uses `AsyncCache`
  (not `AsyncCacheNonBlocking`); base-class abstract signature change
  must be defaulted for subclass compat; forbid inferring cold-start from
  `previousValue == null` inside the factory (Finding #5).
- §5.2 + §6.1: added `HasHedgedThisOperation` flag (set via
  `Interlocked.Exchange`); fixes the broken §6.1 claim that retries
  wouldn`t re-hedge because the cache had a `previousValue` (false —
  cache is only populated when the loop exits) (Finding #8).
- §5.2: `ConcurrentDictionary<Uri, byte>` replaces `HashSet<Uri>`;
  volatile `LoserOutcome` field for cross-thread updates (Finding #6).
- §5.9: added `HttpTimeoutPolicy.FirstAttemptTimeout` accessor design
  — `TimeoutsAndDelays` is private today (Finding #9).
- §5.7.4: sketched the per-index resolve loop for
  `IncrementRetryIndexOnUnavailableEndpointForMetadataRead` — today
  it`s a 1-line counter that never resolves an endpoint (Finding #10).

New sections
- §5.7 (4 subsections): coordination with PR #5780, structural invariant
  that hedge-loser OCE never reaches retry policy, shared
  `RetryUtility.IsRegionalFailure` helper, attempted-endpoints skip loop.
- §5.12: net472 stack-unwind discipline (`SendOneAsync` middle-layer
  seam + `ExceptionDispatchInfo`) — adopts the PR #5870 lesson
  (Finding #12).
- §5.13: per-auth-mode handling in `CloneForHedge`; hedge-401/403
  guard for RBAC-role-assignment-missing-in-secondary case (Finding #13).
- §7.1: wiring step for `isHedgingDisabledByGateway` from
  `DocumentClient` into the cache constructors via `Func<bool>`
  (Finding #15 bundle).
- §9.1: `EventSource`/`Meter` counters for fire-rate, win-rate,
  budget-exhaustion, late-loser, hedge-fired-elapsed-ms (Finding #15 bundle).

API/rollout
- §5.1: `EnableMetadataHedgingForColdStart` becomes tri-state `bool?`;
  `MetadataHedgingOptions` promoted to public so customers can tune
  `PerClientConcurrencyBudget` for high-container-cardinality startups
  (Finding #14).
- §12: Phase 3 no longer removes the opt-in (binary break avoided);
  only the phase default changes (Finding #14).

Smaller items (Finding #15 bundle)
- §5.3: drop `closest secondary` framing (SDK has no proximity measure);
  use `Wait(TimeSpan.Zero)` instead of `WaitAsync(TimeSpan.Zero)`
  (no Task allocation); add `EvaluateEligibility`-vs-budget-check note.
- §5.4: defaulted `isColdStart = false` on the abstract method to avoid
  breaking subclass overrides (e.g., encryption-mirrored caches).
- §6: added eligibility rules 8 (`ExcludeRegions` hard filter), 9
  (`HasHedgedThisOperation`), 10 (single-master account guard).
- §10: reconciled `Both branches fault` with §5.3 (consistent
  `ExceptionDispatchInfo` semantics).
- §11: tests added for loser-cancellation-doesn`t-poison-secondary,
  loser-disposal, no-re-hedge-across-retries, cross-policy-type,
  net472 SO regression; mirrors PR #5787 `senderCallCount` assertions.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…uggestions) to metadata-hedging design

Critical:
- B1 §5.13: CloneForHedge reuses primary Authorization + x-ms-date verbatim;
  master-key auth-table row updated.
- B2 §5.13 + §5.3 + §5.2: per-branch IsAcceptableWinner(resp, branch) helper +
  HedgeBranch enum; 401/plain-403 from the hedge branch are rejected so a fast
  hedge-401 cannot beat a slow primary-200. HedgeOutcome diagnostic field added
  with Volatile read/write.
- B3 §5.4: introduce protected virtual GetByNameAsync(..., isColdStart, ct)
  overload on CollectionCache; existing protected abstract is unchanged so
  encryption-mirrored subclasses stay source-compatible (hedge-disabled).
- B4 §1/§5.7/§6.1/§10: enumerate InternalServerError (500) everywhere
  alongside 503; new §10 edge case for primary 500/503 before threshold
  documenting the §5.7.1 asymmetry.

Recommended:
- R1 §5.7.1.1: paragraph on the lost primary regional-failure signal when the
  hedge wins (the next request closes the LocationCache gap; telemetry exposes
  the window).
- R2 §5.7.1.2: race-window walkthrough (t=1.4s primary fails / t=1.5s timer
  fires); HasHedgedThisOperation Interlocked.CompareExchange is the
  serialization point.
- R3 §5.3 + §5.8: CTS allocation moved inside outer try; loserCts ownership
  transferred to BackgroundCleanupAsync via local-ref null-out; outer finally
  disposes all three CTSs with ?.Dispose() null-guards.
- R4 §5.5: thread the caller CancellationToken through
  ExecutePartitionKeyRangeReadChangeFeedAsync into
  MetadataHedgingStrategy.ExecuteAsync (was CancellationToken.None).
- R5 §11: add 4 regression tests (hedge-401/403 per-branch overlay,
  mid-flight kill-switch flip, primary-500-hedge-wins-then-next-request-marks-
  unavailable, session-token RID-change cleanup).
- R6 §9.1: full telemetry rewrite. Meter renamed to
  Azure.Cosmos.Client.MetadataHedging; instruments use the
  azure.cosmosdb.client.metadata_hedging.* prefix; hedge_wins counter replaces
  the (0..1) histogram; Metrics (§9.1.1) and EventSource (§9.1.2) cleanly
  separated.

Suggestions:
- S1 §5.7.4: pseudocode uses this.request; writes
  this.retryContext.RetryLocationIndex per return-true path; side-effect
  invariant documented.
- S2 §5.2/§5.3/§5.7.4: AttemptedEndpoints switched to
  ConcurrentDictionary<string, byte> keyed on Uri.AbsoluteUri.
- S3 §5.2: WinningEndpoint/WinningRegion converted to backing fields +
  RecordWinner using Interlocked.CompareExchange (first-publication wins).
- S4 §5.6: full walk-through table for isColdStart = !forceRefresh
  (cache-hit / cache-miss / force-refresh).

Stats: 1133 -> 1379 lines (+335 / -89). All 28 section headers intact.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Cross Regional Hedging] Add Support to Hedge Metadata Requests

1 participant