Skip to content

Commit 7dbe8c2

Browse files
NaluTripicianCopilotkirankumarkolli
authored
Availability Strategy: Fixes HedgeContext diagnostics to only appear when hedging occurs (#5665)
## Summary Fixes misleading Hedge Context diagnostics in CrossRegionHedgingAvailabilityStrategy. Previously, HedgeContext was **always** populated when a request went through the hedging strategy code path — even when no cross-region hedging actually occurred (primary request completed before the threshold). This caused customer confusion because: 1. A single-element HedgeContext (e.g., `["East US 2"]`) was indistinguishable from "no hedging" without counting elements 2. When PPAF (Per-Partition Automatic Failover) transparently redirected a request at the transport layer, HedgeContext showed the hedging strategy's intended target — not the actual servicing region — creating a mismatch with ContactedReplicas ## Root Cause In `CrossRegionHedgingAvailabilityStrategy.ExecuteAvailabilityStrategyAsync`, the for-loop exit path unconditionally set `HedgeContext = hedgeRegions.Take(requestNumber + 1)` (line 226-228). When `requestNumber=0` (primary request completed before the threshold timer fired), this yielded a single-element list — making it look like hedging occurred when it didn't. ## Fix Added `if (requestNumber > 0)` guard around the `HedgeContext` and `Response Region` writes in the for-loop exit path. `Hedge Config` is always written (pre-computed string, zero overhead). **New semantics:** - `HedgeContext` **present** = cross-region hedging was triggered (threshold timer fired, additional requests dispatched to other regions) - `HedgeContext` **absent** = no hedging occurred (primary request completed before threshold) - `Hedge Config` always present when hedging strategy code path is used **Impact:** One `if` statement added. Zero new fields, zero new allocations, zero additional diagnostics overhead. ## Customer Scenario (Investigation) An internal customer reported two cases where hedging diagnostics were confusing: **Case 1 — Latency < threshold:** Request completed in ~38ms (threshold: 1000ms). Customer expected `HedgeContext` to be empty since no hedging occurred. Actual: `HedgeContext: ["Central US"]` (single region = primary completed first, no hedging). **Case 2 — PPAF redirect:** Request was sent to East US 2 but PPAF detected quorum loss and redirected to Central US at the transport layer (not via hedging). Customer expected `HedgeContext: ["East US 2", "Central US"]`. Actual: `HedgeContext: ["East US 2"]` because the hedging strategy didn't know about the PPAF redirect. Both cases showed **expected SDK behavior** but the diagnostics were misleading. This fix addresses Case 1 by making `HedgeContext` absent when no hedging occurs. Case 2 is inherent to the layered architecture (PPAF operates below hedging) and would require a separate cross-layer change if needed. ## Changes | File | Change | |------|--------| | `CrossRegionHedgingAvailabilityStrategy.cs` | Added `if (requestNumber > 0)` guard around `HedgeContext`/`ResponseRegion` writes | | `AvailabilityStrategyUnitTests.cs` | Added 2 new tests: `PrimaryCompletesBeforeThreshold_HedgeContextContainsSingleRegion` (asserts `HedgeContext` absent) and `HedgeTriggered_HedgeContextContainsMultipleRegions` (asserts `HedgeContext` present with 2+ regions) | | `RegionFailoverTests.cs` | Updated PPAF test assertion — `HedgeContext` is now correctly absent when PPAF handles failover internally (no cross-region hedging) | ## Testing - **13/13** `AvailabilityStrategyUnitTests` pass - **5/5** `RegionFailoverTests` pass (mock-based PPAF scenarios) - **2/2** `CosmosItemIntegrationTests.ReadItemAsync_WithPPAFEnabled...` pass (live multi-region account) - **All** `CosmosAvailabilityStrategyTests` unaffected (they inject faults that exceed threshold → hedging always triggers → `requestNumber > 0`) ## Before/After Diagnostics ### Scenario 1: Primary request completes before threshold (no hedging) **Before:** ```json { "name": "ReadItemAsync", "duration in milliseconds": 37.95, "data": { "Hedge Config": "t:1000ms, s:500ms, w:False", "Hedge Context": ["Central US"], "Response Region": "Central US" } } ``` **After:** ```json { "name": "ReadItemAsync", "duration in milliseconds": 37.95, "data": { "Hedge Config": "t:1000ms, s:500ms, w:False" } } ``` > `Hedge Context` and `Response Region` are absent — clearly indicating no hedging occurred. `Hedge Config` remains so the configuration is always visible. ### Scenario 2: Hedging triggered (primary slow, hedge responds first) **Before (unchanged):** ```json { "name": "ReadItemAsync", "duration in milliseconds": 1693.89, "data": { "Hedge Config": "t:1000ms, s:500ms, w:False", "Hedge Context": ["West US 2", "East US", "Central US"], "Response Region": "East US" } } ``` **After (same — no change when hedging occurs):** ```json { "name": "ReadItemAsync", "duration in milliseconds": 1693.89, "data": { "Hedge Config": "t:1000ms, s:500ms, w:False", "Hedge Context": ["West US 2", "East US", "Central US"], "Response Region": "East US" } } ``` > When hedging is triggered, diagnostics are identical to before. `Hedge Context` lists all regions that had requests dispatched, `Response Region` shows which region's response was used. ### Scenario 3: PPAF redirects internally (no cross-region hedging) **Before:** ```json { "name": "ReadItemAsync", "duration in milliseconds": 317.51, "data": { "Hedge Config": "t:1000ms, s:500ms, w:False", "Hedge Context": ["East US 2"], "Response Region": "East US 2" }, "children": [{ "...ContactedReplicas showing Central US replicas..." }] } ``` **After:** ```json { "name": "ReadItemAsync", "duration in milliseconds": 317.51, "data": { "Hedge Config": "t:1000ms, s:500ms, w:False" }, "children": [{ "...ContactedReplicas showing Central US replicas..." }] } ``` > The misleading `Hedge Context: ["East US 2"]` and `Response Region: East US 2` are no longer present. The PPAF redirect is still visible in the `ContactedReplicas` and `StoreResponseStatistics` within the diagnostics children — which is the correct layer for this information. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Kiran Kumar Kolli <kirankk@microsoft.com>
1 parent 9e16469 commit 7dbe8c2

4 files changed

Lines changed: 159 additions & 18 deletions

File tree

Microsoft.Azure.Cosmos/src/Routing/AvailabilityStrategy/CrossRegionHedgingAvailabilityStrategy.cs

Lines changed: 13 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -222,14 +222,19 @@ internal override async Task<ResponseMessage> ExecuteAvailabilityStrategyAsync(
222222
((CosmosTraceDiagnostics)hedgeResponse.ResponseMessage.Diagnostics).Value.AddOrUpdateDatum(
223223
HedgeConfig,
224224
this.HedgeConfigText);
225-
//Take is not inclusive, so we need to add 1 to the request number which starts at 0
226-
((CosmosTraceDiagnostics)hedgeResponse.ResponseMessage.Diagnostics).Value.AddOrUpdateDatum(
227-
HedgeContext,
228-
hedgeRegions.Take(requestNumber + 1));
229-
// Note that the target region can be seperate than the actual region that serviced the request depending on the scenario
230-
((CosmosTraceDiagnostics)hedgeResponse.ResponseMessage.Diagnostics).Value.AddOrUpdateDatum(
231-
ResponseRegion,
232-
hedgeResponse.TargetRegionName);
225+
226+
if (requestNumber > 0)
227+
{
228+
//Take is not inclusive, so we need to add 1 to the request number which starts at 0
229+
((CosmosTraceDiagnostics)hedgeResponse.ResponseMessage.Diagnostics).Value.AddOrUpdateDatum(
230+
HedgeContext,
231+
hedgeRegions.Take(requestNumber + 1));
232+
// Note that the target region can be seperate than the actual region that serviced the request depending on the scenario
233+
((CosmosTraceDiagnostics)hedgeResponse.ResponseMessage.Diagnostics).Value.AddOrUpdateDatum(
234+
ResponseRegion,
235+
hedgeResponse.TargetRegionName);
236+
}
237+
233238
return hedgeResponse.ResponseMessage;
234239
}
235240
}

Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.EmulatorTests/CosmosItemIntegrationTests.cs

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1713,10 +1713,10 @@ public async Task ReadItemAsync_WithPPAFDynamicOverride_ShouldEnableOrDisablePPA
17131713

17141714
traceDiagnostic.Value.Data.TryGetValue("Hedge Context", out object hedgeContext);
17151715

1716-
Assert.IsNotNull(hedgeContext);
1717-
List<string> hedgedRegions = ((IEnumerable<string>)hedgeContext).ToList();
1718-
1719-
Assert.IsTrue(hedgedRegions.Count >= 1, "Since the first region is not available, the request should atleast hedge to the next region.");
1716+
// When PPAF is enabled, the primary request handles failover internally
1717+
// (retrying to another region at the transport layer). No cross-region
1718+
// hedging occurs, so HedgeContext should be absent.
1719+
Assert.IsNull(hedgeContext);
17201720
Assert.IsTrue(cosmosClient.DocumentClient.PartitionKeyRangeLocation.IsPartitionLevelAutomaticFailoverEnabled());
17211721

17221722
// Disable PPAF At the Gateway Layer.

Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.Tests/AvailabilityStrategyUnitTests.cs

Lines changed: 137 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,14 @@
44
using System.Collections.Generic;
55
using System.Collections.ObjectModel;
66
using System.IO;
7+
using System.Linq;
78
using System.Net;
89
using System.Net.Http;
910
using System.Threading;
1011
using System.Threading.Tasks;
1112
using Microsoft.Azure.Cosmos;
13+
using Microsoft.Azure.Cosmos.Diagnostics;
14+
using Microsoft.Azure.Cosmos.Tracing;
1215
using Microsoft.Azure.Documents;
1316
using Microsoft.VisualStudio.TestTools.UnitTesting;
1417

@@ -632,6 +635,139 @@ public async Task FaultedHedgeTask_DoesNotAbortWhenOtherRegionSucceeds()
632635
Assert.IsTrue(senderCallCount >= 2, "Expected a second hedge request to complete successfully.");
633636
}
634637

635-
638+
/// <summary>
639+
/// Verifies that when a request completes before the hedge threshold, HedgeContext
640+
/// contains exactly 1 region (the primary). This confirms no hedging occurred even
641+
/// though HedgeContext is non-empty. A single-element HedgeContext is the expected
642+
/// indicator that the primary request completed without triggering any hedge.
643+
/// </summary>
644+
[TestMethod]
645+
public async Task PrimaryCompletesBeforeThreshold_HedgeContextContainsSingleRegion()
646+
{
647+
// Arrange: high threshold ensures no hedging fires
648+
CrossRegionHedgingAvailabilityStrategy availabilityStrategy = new CrossRegionHedgingAvailabilityStrategy(
649+
threshold: TimeSpan.FromMilliseconds(5000),
650+
thresholdStep: TimeSpan.FromMilliseconds(5000));
651+
652+
// Use a real trace so AddOrUpdateDatum actually persists data (NoOpTrace discards it)
653+
using ITrace rootTrace = Trace.GetRootTrace("HedgeContextTest");
654+
using RequestMessage request = new RequestMessage(
655+
HttpMethod.Get,
656+
"/dbs/testdb/colls/testcontainer/docs/testId",
657+
rootTrace)
658+
{
659+
ResourceType = ResourceType.Document,
660+
OperationType = OperationType.Read
661+
};
662+
using CosmosClient mockCosmosClient = CreateMockClientWithRegions(3);
663+
664+
int senderCallCount = 0;
665+
666+
Func<RequestMessage, CancellationToken, Task<ResponseMessage>> sender = (req, ct) =>
667+
{
668+
Interlocked.Increment(ref senderCallCount);
669+
ResponseMessage response = new ResponseMessage(HttpStatusCode.OK)
670+
{
671+
Trace = req.Trace
672+
};
673+
return Task.FromResult(response);
674+
};
675+
676+
// Act
677+
ResponseMessage response = await availabilityStrategy.ExecuteAvailabilityStrategyAsync(
678+
sender, mockCosmosClient, request, CancellationToken.None);
679+
680+
// Assert
681+
Assert.AreEqual(HttpStatusCode.OK, response.StatusCode);
682+
Assert.AreEqual(1, senderCallCount,
683+
"Only the primary request should be sent when it returns before the hedge timer fires.");
684+
685+
CosmosTraceDiagnostics traceDiagnostic = response.Diagnostics as CosmosTraceDiagnostics;
686+
Assert.IsNotNull(traceDiagnostic);
687+
688+
if (traceDiagnostic.Value is Trace concreteTrace)
689+
{
690+
concreteTrace.SetWalkingStateRecursively();
691+
}
692+
693+
Assert.IsFalse(traceDiagnostic.Value.Data.TryGetValue("Hedge Context", out _),
694+
"HedgeContext should be absent when the primary request completes before the threshold (no hedging occurred).");
695+
696+
Assert.IsTrue(traceDiagnostic.Value.Data.TryGetValue("Hedge Config", out _),
697+
"Hedge Config should always be present when the hedging strategy code path is used.");
698+
}
699+
700+
/// <summary>
701+
/// Verifies that when hedging IS triggered (primary is slow, hedge returns first),
702+
/// HedgeContext contains 2 regions — confirming the semantics that HedgeContext count > 1
703+
/// means hedging occurred.
704+
/// </summary>
705+
[TestMethod]
706+
public async Task HedgeTriggered_HedgeContextContainsMultipleRegions()
707+
{
708+
// Arrange: low threshold ensures hedging fires quickly
709+
CrossRegionHedgingAvailabilityStrategy availabilityStrategy = new CrossRegionHedgingAvailabilityStrategy(
710+
threshold: TimeSpan.FromMilliseconds(10),
711+
thresholdStep: TimeSpan.FromMilliseconds(10));
712+
713+
// Use a real trace so AddOrUpdateDatum actually persists data
714+
using ITrace rootTrace = Trace.GetRootTrace("HedgeContextTest");
715+
using RequestMessage request = new RequestMessage(
716+
HttpMethod.Get,
717+
"/dbs/testdb/colls/testcontainer/docs/testId",
718+
rootTrace)
719+
{
720+
ResourceType = ResourceType.Document,
721+
OperationType = OperationType.Read
722+
};
723+
using CosmosClient mockCosmosClient = CreateMockClientWithRegions(3);
724+
725+
int senderCallCount = 0;
726+
727+
Func<RequestMessage, CancellationToken, Task<ResponseMessage>> sender = async (req, ct) =>
728+
{
729+
int callNumber = Interlocked.Increment(ref senderCallCount);
730+
731+
if (callNumber == 1)
732+
{
733+
// Primary: slow enough to trigger hedging
734+
await Task.Delay(TimeSpan.FromSeconds(5), ct).ContinueWith(_ => { });
735+
return new ResponseMessage(HttpStatusCode.ServiceUnavailable);
736+
}
737+
738+
// Hedge request: returns immediately with success, wired to request trace
739+
return new ResponseMessage(HttpStatusCode.OK)
740+
{
741+
Trace = req.Trace
742+
};
743+
};
744+
745+
// Act
746+
ResponseMessage response = await availabilityStrategy.ExecuteAvailabilityStrategyAsync(
747+
sender, mockCosmosClient, request, CancellationToken.None);
748+
749+
// Assert
750+
Assert.AreEqual(HttpStatusCode.OK, response.StatusCode);
751+
Assert.IsTrue(senderCallCount >= 2,
752+
"At least 2 sender calls expected (primary + hedge).");
753+
754+
CosmosTraceDiagnostics traceDiagnostic = response.Diagnostics as CosmosTraceDiagnostics;
755+
Assert.IsNotNull(traceDiagnostic);
756+
757+
if (traceDiagnostic.Value is Trace concreteTrace)
758+
{
759+
concreteTrace.SetWalkingStateRecursively();
760+
}
761+
762+
Assert.IsTrue(traceDiagnostic.Value.Data.TryGetValue("Hedge Context", out object hedgeContext),
763+
"HedgeContext should be present when hedging occurred.");
764+
765+
IEnumerable<string> hedgeRegions = (IEnumerable<string>)hedgeContext;
766+
List<string> hedgeRegionsList = new List<string>(hedgeRegions);
767+
768+
Assert.IsTrue(hedgeRegionsList.Count >= 2,
769+
$"HedgeContext should contain 2+ regions when hedging occurred, but got {hedgeRegionsList.Count}. " +
770+
"Multiple regions in HedgeContext confirms hedging was triggered.");
771+
}
636772
}
637773
}

Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.Tests/PartitionKeyRangeFailoverTests/RegionFailoverTests.cs

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -355,11 +355,11 @@ public async Task ReadItemAsync_WithThinClientEnabledAndServiceUnavailableReceiv
355355

356356
if (enablePartitionLevelFailover)
357357
{
358-
Assert.IsNotNull(hedgeContext);
359-
List<string> hedgedRegions = ((IEnumerable<string>)hedgeContext).ToList();
360-
361-
Assert.IsTrue(hedgedRegions.Count >= 1, "Since the first region is not available, the request should atleast hedge to the next region.");
362-
Assert.IsTrue(hedgedRegions.Contains(Regions.EastUS));
358+
// When PPAF is enabled, the primary request handles failover internally
359+
// (retrying to another region). No cross-region hedging occurs, so
360+
// HedgeContext should be absent. The failover is visible in the
361+
// diagnostics children (multiple RouterHandler/TransportHandler nodes).
362+
Assert.IsNull(hedgeContext);
363363
}
364364
else
365365
{

0 commit comments

Comments
 (0)