Commit 7dbe8c2
Availability Strategy: Fixes HedgeContext diagnostics to only appear when hedging occurs (#5665)
## Summary
Fixes misleading Hedge Context diagnostics in
CrossRegionHedgingAvailabilityStrategy. Previously, HedgeContext was
**always** populated when a request went through the hedging strategy
code path — even when no cross-region hedging actually occurred (primary
request completed before the threshold). This caused customer confusion
because:
1. A single-element HedgeContext (e.g., `["East US 2"]`) was
indistinguishable from "no hedging" without counting elements
2. When PPAF (Per-Partition Automatic Failover) transparently redirected
a request at the transport layer, HedgeContext showed the hedging
strategy's intended target — not the actual servicing region — creating
a mismatch with ContactedReplicas
## Root Cause
In
`CrossRegionHedgingAvailabilityStrategy.ExecuteAvailabilityStrategyAsync`,
the for-loop exit path unconditionally set `HedgeContext =
hedgeRegions.Take(requestNumber + 1)` (line 226-228). When
`requestNumber=0` (primary request completed before the threshold timer
fired), this yielded a single-element list — making it look like hedging
occurred when it didn't.
## Fix
Added `if (requestNumber > 0)` guard around the `HedgeContext` and
`Response Region` writes in the for-loop exit path. `Hedge Config` is
always written (pre-computed string, zero overhead).
**New semantics:**
- `HedgeContext` **present** = cross-region hedging was triggered
(threshold timer fired, additional requests dispatched to other regions)
- `HedgeContext` **absent** = no hedging occurred (primary request
completed before threshold)
- `Hedge Config` always present when hedging strategy code path is used
**Impact:** One `if` statement added. Zero new fields, zero new
allocations, zero additional diagnostics overhead.
## Customer Scenario (Investigation)
An internal customer reported two cases where hedging diagnostics were
confusing:
**Case 1 — Latency < threshold:** Request completed in ~38ms (threshold:
1000ms). Customer expected `HedgeContext` to be empty since no hedging
occurred. Actual: `HedgeContext: ["Central US"]` (single region =
primary completed first, no hedging).
**Case 2 — PPAF redirect:** Request was sent to East US 2 but PPAF
detected quorum loss and redirected to Central US at the transport layer
(not via hedging). Customer expected `HedgeContext: ["East US 2",
"Central US"]`. Actual: `HedgeContext: ["East US 2"]` because the
hedging strategy didn't know about the PPAF redirect.
Both cases showed **expected SDK behavior** but the diagnostics were
misleading. This fix addresses Case 1 by making `HedgeContext` absent
when no hedging occurs. Case 2 is inherent to the layered architecture
(PPAF operates below hedging) and would require a separate cross-layer
change if needed.
## Changes
| File | Change |
|------|--------|
| `CrossRegionHedgingAvailabilityStrategy.cs` | Added `if (requestNumber
> 0)` guard around `HedgeContext`/`ResponseRegion` writes |
| `AvailabilityStrategyUnitTests.cs` | Added 2 new tests:
`PrimaryCompletesBeforeThreshold_HedgeContextContainsSingleRegion`
(asserts `HedgeContext` absent) and
`HedgeTriggered_HedgeContextContainsMultipleRegions` (asserts
`HedgeContext` present with 2+ regions) |
| `RegionFailoverTests.cs` | Updated PPAF test assertion —
`HedgeContext` is now correctly absent when PPAF handles failover
internally (no cross-region hedging) |
## Testing
- **13/13** `AvailabilityStrategyUnitTests` pass
- **5/5** `RegionFailoverTests` pass (mock-based PPAF scenarios)
- **2/2** `CosmosItemIntegrationTests.ReadItemAsync_WithPPAFEnabled...`
pass (live multi-region account)
- **All** `CosmosAvailabilityStrategyTests` unaffected (they inject
faults that exceed threshold → hedging always triggers → `requestNumber
> 0`)
## Before/After Diagnostics
### Scenario 1: Primary request completes before threshold (no hedging)
**Before:**
```json
{
"name": "ReadItemAsync",
"duration in milliseconds": 37.95,
"data": {
"Hedge Config": "t:1000ms, s:500ms, w:False",
"Hedge Context": ["Central US"],
"Response Region": "Central US"
}
}
```
**After:**
```json
{
"name": "ReadItemAsync",
"duration in milliseconds": 37.95,
"data": {
"Hedge Config": "t:1000ms, s:500ms, w:False"
}
}
```
> `Hedge Context` and `Response Region` are absent — clearly indicating
no hedging occurred. `Hedge Config` remains so the configuration is
always visible.
### Scenario 2: Hedging triggered (primary slow, hedge responds first)
**Before (unchanged):**
```json
{
"name": "ReadItemAsync",
"duration in milliseconds": 1693.89,
"data": {
"Hedge Config": "t:1000ms, s:500ms, w:False",
"Hedge Context": ["West US 2", "East US", "Central US"],
"Response Region": "East US"
}
}
```
**After (same — no change when hedging occurs):**
```json
{
"name": "ReadItemAsync",
"duration in milliseconds": 1693.89,
"data": {
"Hedge Config": "t:1000ms, s:500ms, w:False",
"Hedge Context": ["West US 2", "East US", "Central US"],
"Response Region": "East US"
}
}
```
> When hedging is triggered, diagnostics are identical to before. `Hedge
Context` lists all regions that had requests dispatched, `Response
Region` shows which region's response was used.
### Scenario 3: PPAF redirects internally (no cross-region hedging)
**Before:**
```json
{
"name": "ReadItemAsync",
"duration in milliseconds": 317.51,
"data": {
"Hedge Config": "t:1000ms, s:500ms, w:False",
"Hedge Context": ["East US 2"],
"Response Region": "East US 2"
},
"children": [{ "...ContactedReplicas showing Central US replicas..." }]
}
```
**After:**
```json
{
"name": "ReadItemAsync",
"duration in milliseconds": 317.51,
"data": {
"Hedge Config": "t:1000ms, s:500ms, w:False"
},
"children": [{ "...ContactedReplicas showing Central US replicas..." }]
}
```
> The misleading `Hedge Context: ["East US 2"]` and `Response Region:
East US 2` are no longer present. The PPAF redirect is still visible in
the `ContactedReplicas` and `StoreResponseStatistics` within the
diagnostics children — which is the correct layer for this information.
---------
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Kiran Kumar Kolli <kirankk@microsoft.com>1 parent 9e16469 commit 7dbe8c2
4 files changed
Lines changed: 159 additions & 18 deletions
File tree
- Microsoft.Azure.Cosmos
- src/Routing/AvailabilityStrategy
- tests
- Microsoft.Azure.Cosmos.EmulatorTests
- Microsoft.Azure.Cosmos.Tests
- PartitionKeyRangeFailoverTests
Lines changed: 13 additions & 8 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
222 | 222 | | |
223 | 223 | | |
224 | 224 | | |
225 | | - | |
226 | | - | |
227 | | - | |
228 | | - | |
229 | | - | |
230 | | - | |
231 | | - | |
232 | | - | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
233 | 238 | | |
234 | 239 | | |
235 | 240 | | |
| |||
Lines changed: 4 additions & 4 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1713 | 1713 | | |
1714 | 1714 | | |
1715 | 1715 | | |
1716 | | - | |
1717 | | - | |
1718 | | - | |
1719 | | - | |
| 1716 | + | |
| 1717 | + | |
| 1718 | + | |
| 1719 | + | |
1720 | 1720 | | |
1721 | 1721 | | |
1722 | 1722 | | |
| |||
Lines changed: 137 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
4 | 4 | | |
5 | 5 | | |
6 | 6 | | |
| 7 | + | |
7 | 8 | | |
8 | 9 | | |
9 | 10 | | |
10 | 11 | | |
11 | 12 | | |
| 13 | + | |
| 14 | + | |
12 | 15 | | |
13 | 16 | | |
14 | 17 | | |
| |||
632 | 635 | | |
633 | 636 | | |
634 | 637 | | |
635 | | - | |
| 638 | + | |
| 639 | + | |
| 640 | + | |
| 641 | + | |
| 642 | + | |
| 643 | + | |
| 644 | + | |
| 645 | + | |
| 646 | + | |
| 647 | + | |
| 648 | + | |
| 649 | + | |
| 650 | + | |
| 651 | + | |
| 652 | + | |
| 653 | + | |
| 654 | + | |
| 655 | + | |
| 656 | + | |
| 657 | + | |
| 658 | + | |
| 659 | + | |
| 660 | + | |
| 661 | + | |
| 662 | + | |
| 663 | + | |
| 664 | + | |
| 665 | + | |
| 666 | + | |
| 667 | + | |
| 668 | + | |
| 669 | + | |
| 670 | + | |
| 671 | + | |
| 672 | + | |
| 673 | + | |
| 674 | + | |
| 675 | + | |
| 676 | + | |
| 677 | + | |
| 678 | + | |
| 679 | + | |
| 680 | + | |
| 681 | + | |
| 682 | + | |
| 683 | + | |
| 684 | + | |
| 685 | + | |
| 686 | + | |
| 687 | + | |
| 688 | + | |
| 689 | + | |
| 690 | + | |
| 691 | + | |
| 692 | + | |
| 693 | + | |
| 694 | + | |
| 695 | + | |
| 696 | + | |
| 697 | + | |
| 698 | + | |
| 699 | + | |
| 700 | + | |
| 701 | + | |
| 702 | + | |
| 703 | + | |
| 704 | + | |
| 705 | + | |
| 706 | + | |
| 707 | + | |
| 708 | + | |
| 709 | + | |
| 710 | + | |
| 711 | + | |
| 712 | + | |
| 713 | + | |
| 714 | + | |
| 715 | + | |
| 716 | + | |
| 717 | + | |
| 718 | + | |
| 719 | + | |
| 720 | + | |
| 721 | + | |
| 722 | + | |
| 723 | + | |
| 724 | + | |
| 725 | + | |
| 726 | + | |
| 727 | + | |
| 728 | + | |
| 729 | + | |
| 730 | + | |
| 731 | + | |
| 732 | + | |
| 733 | + | |
| 734 | + | |
| 735 | + | |
| 736 | + | |
| 737 | + | |
| 738 | + | |
| 739 | + | |
| 740 | + | |
| 741 | + | |
| 742 | + | |
| 743 | + | |
| 744 | + | |
| 745 | + | |
| 746 | + | |
| 747 | + | |
| 748 | + | |
| 749 | + | |
| 750 | + | |
| 751 | + | |
| 752 | + | |
| 753 | + | |
| 754 | + | |
| 755 | + | |
| 756 | + | |
| 757 | + | |
| 758 | + | |
| 759 | + | |
| 760 | + | |
| 761 | + | |
| 762 | + | |
| 763 | + | |
| 764 | + | |
| 765 | + | |
| 766 | + | |
| 767 | + | |
| 768 | + | |
| 769 | + | |
| 770 | + | |
| 771 | + | |
636 | 772 | | |
637 | 773 | | |
Lines changed: 5 additions & 5 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
355 | 355 | | |
356 | 356 | | |
357 | 357 | | |
358 | | - | |
359 | | - | |
360 | | - | |
361 | | - | |
362 | | - | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
| 361 | + | |
| 362 | + | |
363 | 363 | | |
364 | 364 | | |
365 | 365 | | |
| |||
0 commit comments