diff --git a/openspec/changes/ppaf-dynamic-hedging-control/.openspec.yaml b/openspec/changes/ppaf-dynamic-hedging-control/.openspec.yaml new file mode 100644 index 0000000000..2ca4bc8517 --- /dev/null +++ b/openspec/changes/ppaf-dynamic-hedging-control/.openspec.yaml @@ -0,0 +1,2 @@ +schema: spec-driven +created: 2026-03-24 diff --git a/openspec/changes/ppaf-dynamic-hedging-control/design.md b/openspec/changes/ppaf-dynamic-hedging-control/design.md new file mode 100644 index 0000000000..885541a653 --- /dev/null +++ b/openspec/changes/ppaf-dynamic-hedging-control/design.md @@ -0,0 +1,88 @@ +## Context + +PPAF (Partition-Level Failover) is an account-level feature that, when enabled, triggers automatic hedging of read requests with a default threshold of `Min(1000ms, RequestTimeout/2)` and a step of 500ms. The hedging is implemented via `CrossRegionHedgingAvailabilityStrategy`, which is either explicitly configured by the customer on `CosmosClientOptions.AvailabilityStrategy` or automatically created by the SDK in `DocumentClient.InitializePartitionLevelFailoverWithDefaultHedging()`. + +**Current hedging decision flow:** + +1. `RequestInvokerHandler` resolves the active `AvailabilityStrategy`: request-level override → client-level → null. +2. If a strategy is present and `Enabled()`, requests are dispatched through `CrossRegionHedgingAvailabilityStrategy.ExecuteAvailabilityStrategyAsync()`. +3. The strategy decides per-request whether to hedge based on resource type (Document only), operation type (reads always, writes only if multi-write enabled), and available regions. + +**Account properties refresh:** + +- `GlobalEndpointManager.RefreshDatabaseAccountInternalAsync()` periodically fetches `AccountProperties` from the Gateway. +- The `OnEnablePartitionLevelFailoverConfigChanged` event fires when PPAF status changes, triggering `DocumentClient.UpdatePartitionLevelFailoverConfigWithAccountRefresh()` to dynamically enable or disable default hedging. + +**Problem:** There is no mechanism for on-call engineers to temporarily disable hedging for a PPAF account without rolling back PPAF entirely. Rolling back PPAF is expensive and disrupts other benefits. A Gateway-controlled flag is needed as a targeted escape hatch. + +## Goals / Non-Goals + +**Goals:** + +- Add a new boolean account property (`disableCrossRegionalHedging`) read from Gateway responses. +- When the flag is `true`, disable all hedging (both SDK-default PPAF hedging and explicit customer-configured hedging) for that account. +- When the flag is `false` or absent, preserve existing hedging behavior. +- Support dynamic toggling: as clients observe refreshed account properties, hedging state updates accordingly. +- Keep the feature entirely internal — no new public API surface. + +**Non-Goals:** + +- Changing customer-authored hedging strategies or their configuration shape. +- Modifying PPAF enablement or onboarding flows. +- Supporting non-PPAF accounts with this flag (for the immediate term). +- Exposing the flag to end users or making it configurable from the SDK. + +## Decisions + +### 1. Property location: `AccountProperties` with `JsonExtensionData` fallback + +**Decision:** Add a strongly-typed `bool?` property `DisableCrossRegionalHedging` to `AccountProperties` with a `[JsonProperty]` attribute mapped to the Gateway JSON key `"disableCrossRegionalHedging"`. + +**Rationale:** This is consistent with how `EnablePartitionLevelFailover` is already modeled (Line 249 of `AccountProperties.cs`). A strongly-typed property provides compile-time safety and discoverability. The `[JsonExtensionData] AdditionalProperties` dictionary exists as a fallback for unknown fields, but relying on it would lose type safety and require manual parsing. + +**Alternative considered:** Reading from `AdditionalProperties` at evaluation time. Rejected because it introduces fragile string-keyed lookups and inconsistency with the existing pattern for account-level flags. + +### 2. Evaluation point: `GlobalEndpointManager` account-refresh callback + +**Decision:** Evaluate the `disableCrossRegionalHedging` flag in `DocumentClient.UpdatePartitionLevelFailoverConfigWithAccountRefresh()` — the same method that already handles dynamic PPAF enable/disable based on account-property changes. + +**Rationale:** This method is invoked whenever `GlobalEndpointManager` detects a change in PPAF-related account properties. Adding the hedging-disable check here ensures the flag is evaluated on every account-properties refresh, supporting dynamic toggling. It also consolidates all PPAF-related hedging logic in one place. + +**Alternative considered:** Evaluating the flag per-request in `RequestInvokerHandler`. Rejected because it would require propagating account properties into the hot path and adds unnecessary per-request overhead. The account-refresh callback runs infrequently and already handles strategy assignment. + +### 3. Enforcement mechanism: Strategy nullification / replacement + +**Decision:** When the flag is `true`, set `ConnectionPolicy.AvailabilityStrategy` to `null` (or a `DisabledAvailabilityStrategy` sentinel) to disable hedging. When the flag is toggled back to `false`, re-evaluate and restore the appropriate strategy (explicit customer config or PPAF default). + +**Rationale:** `RequestInvokerHandler` already treats a null strategy as "no hedging" (Line 97-98: `strategy != null && strategy.Enabled()`). Using the existing null-check path avoids introducing new conditional logic in the request hot path. The `DisabledAvailabilityStrategy` subclass already exists for explicit opt-out scenarios, though null assignment is simpler. + +**Alternative considered:** Adding an `IsDisabledByGateway` flag to `CrossRegionHedgingAvailabilityStrategy` and checking it in `Enabled()`. Rejected because it would couple the strategy to Gateway concerns and requires changes to the strategy hierarchy. + +### 4. Precedence: Gateway flag overrides all hedging configuration + +**Decision:** When the Gateway flag is `true`, it overrides both SDK-default PPAF hedging AND any explicit customer-configured `AvailabilityStrategy`. The precedence order becomes: + +1. Gateway `disableCrossRegionalHedging = true` → hedging OFF (highest priority) +2. Request-level `AvailabilityStrategy` override +3. Client-level `AvailabilityStrategy` (explicit customer config) +4. PPAF default hedging (if PPAF enabled, no explicit config) + +**Rationale:** The flag is an operational safety valve. If on-call determines hedging is causing issues, it must be a hard kill switch regardless of what the customer configured. This prevents scenarios where explicit customer hedging bypasses the mitigation. + +### 5. State tracking: Store flag and original strategy reference + +**Decision:** Introduce internal fields in `DocumentClient`: +- `bool disableCrossRegionalHedgingFlag` — cached value of the Gateway flag. +- `AvailabilityStrategy customerConfiguredStrategy` — the customer's original explicit strategy (if any), stored before nullification so it can be restored when the flag is toggled back. + +**Rationale:** The SDK must be able to restore the correct hedging behavior when the flag is turned off. Without storing the original strategy, the SDK cannot distinguish between "customer never set a strategy" and "strategy was removed by the flag." + +## Risks / Trade-offs + +- **[Risk] Flag latency** — The flag takes effect only after the next account-properties refresh (default interval is ~5 minutes). → **Mitigation:** On-call can force a faster refresh by restarting the client or waiting for the next refresh cycle. The existing `GlobalEndpointManager` refresh interval is sufficient for non-emergency toggles. + +- **[Risk] Customer confusion if explicit hedging silently disabled** — If a customer configured explicit hedging and on-call disables it via the flag, the customer may observe unexpected behavior. → **Mitigation:** The flag is intended as a temporary operational tool. On-call should communicate with the customer. SDK diagnostics/traces should log when hedging is disabled by the Gateway flag. + +- **[Risk] Strategy restoration correctness** — When restoring hedging after the flag is toggled off, the SDK must correctly reconstruct the PPAF default strategy or restore the customer's explicit strategy. → **Mitigation:** Store the original strategy reference before nullification. Unit-test the toggle cycle (enable → disable → re-enable). + +- **[Trade-off] Non-PPAF accounts ignored** — The flag is only evaluated for PPAF accounts. A future extension could support non-PPAF accounts, but this adds complexity without current demand. diff --git a/openspec/changes/ppaf-dynamic-hedging-control/proposal.md b/openspec/changes/ppaf-dynamic-hedging-control/proposal.md new file mode 100644 index 0000000000..7a92547c87 --- /dev/null +++ b/openspec/changes/ppaf-dynamic-hedging-control/proposal.md @@ -0,0 +1,28 @@ +## Why + +PPAF-enabled Cosmos DB accounts automatically enable hedging with a 1-second threshold to fast-track read-region failover. However, production incidents have shown that implicit hedging of long-running queries can cause unexpected exceptions (e.g., ArgumentException in CallStore). Rolling back PPAF entirely to disable hedging is operationally expensive and disrupts other PPAF benefits, so a targeted, service-side escape hatch is needed to let on-call engineers dynamically disable hedging without customer intervention. + +## What Changes + +- Introduce a new Gateway account property (`disableCrossRegionalHedging`) that the SDK reads from account-property responses. +- When the flag is `true`, hedging is disabled for the PPAF account regardless of any explicit or implicit hedging configuration. +- When the flag is `false` or absent, existing hedging behavior is preserved (explicit customer config honored; PPAF defaults applied if no explicit config). +- The SDK evaluates the flag dynamically on every account-properties refresh, enabling on-call toggle without customer code changes. +- Non-PPAF accounts ignore the flag. + +## Capabilities + +### New Capabilities +- `gateway-hedging-override`: Reads a new Gateway account property flag and enforces it as the highest-precedence control over PPAF hedging behavior, supporting dynamic enable/disable at the SDK layer. + +### Modified Capabilities + + +## Impact + +- **SDK Client layer** (`DocumentClient` / `CosmosClient` internals): hedging-decision logic must incorporate a new precedence check against the Gateway flag before evaluating explicit or default hedging configuration. +- **Account properties model**: new property deserialized from the Gateway response (`AccountProperties` or equivalent DTO). +- **Gateway / service dependency**: the flag is surfaced by the Cosmos DB Gateway; the SDK consumes it read-only. +- **No public API surface changes**: the feature is invisible to end users; no new `CosmosClientOptions` or request-options properties are exposed. +- **Testing**: unit tests for precedence rules; integration tests validating dynamic toggle via mocked account-property responses. diff --git a/openspec/changes/ppaf-dynamic-hedging-control/specs/gateway-hedging-override/spec.md b/openspec/changes/ppaf-dynamic-hedging-control/specs/gateway-hedging-override/spec.md new file mode 100644 index 0000000000..ea102b011e --- /dev/null +++ b/openspec/changes/ppaf-dynamic-hedging-control/specs/gateway-hedging-override/spec.md @@ -0,0 +1,141 @@ +## ADDED Requirements + +### Requirement: Gateway account property for hedging control +The `AccountProperties` model SHALL include a nullable boolean property `DisableCrossRegionalHedging` deserialized from the Gateway JSON key `"disableCrossRegionalHedging"`. The property SHALL default to `null` when absent from the Gateway response. + +#### Scenario: Gateway response includes the flag set to true +- **WHEN** the Gateway account-properties response contains `"disableCrossRegionalHedging": true` +- **THEN** the `AccountProperties.DisableCrossRegionalHedging` property SHALL be `true` + +#### Scenario: Gateway response includes the flag set to false +- **WHEN** the Gateway account-properties response contains `"disableCrossRegionalHedging": false` +- **THEN** the `AccountProperties.DisableCrossRegionalHedging` property SHALL be `false` + +#### Scenario: Gateway response does not include the flag +- **WHEN** the Gateway account-properties response does not contain the `"disableCrossRegionalHedging"` key +- **THEN** the `AccountProperties.DisableCrossRegionalHedging` property SHALL be `null` + +--- + +### Requirement: Gateway flag disables all hedging when true +When the Gateway flag `disableCrossRegionalHedging` is `true`, the SDK SHALL disable all hedging for PPAF-enabled accounts regardless of any explicit or implicit hedging configuration. + +#### Scenario: PPAF account with default hedging and flag set to true +- **WHEN** the account has PPAF enabled with SDK-default hedging active +- **AND** the Gateway flag `disableCrossRegionalHedging` is `true` +- **THEN** the SDK SHALL disable hedging +- **AND** requests SHALL NOT be hedged across regions + +#### Scenario: PPAF account with explicit customer hedging and flag set to true +- **WHEN** the account has PPAF enabled +- **AND** the customer has configured an explicit `AvailabilityStrategy` via `CosmosClientOptions` +- **AND** the Gateway flag `disableCrossRegionalHedging` is `true` +- **THEN** the SDK SHALL disable hedging +- **AND** the explicit customer strategy SHALL NOT be executed + +#### Scenario: PPAF account with request-level hedging override and flag set to true +- **WHEN** the account has PPAF enabled +- **AND** a request has a per-request `AvailabilityStrategy` override set in `RequestOptions` +- **AND** the Gateway flag `disableCrossRegionalHedging` is `true` +- **THEN** the SDK SHALL disable hedging for that request +- **AND** the request-level strategy SHALL NOT be executed + +--- + +### Requirement: Existing behavior preserved when flag is false or absent +When the Gateway flag `disableCrossRegionalHedging` is `false` or absent from the account-properties response, the SDK SHALL preserve existing hedging behavior without any change. + +#### Scenario: PPAF account with flag set to false and no explicit hedging +- **WHEN** the account has PPAF enabled +- **AND** the Gateway flag `disableCrossRegionalHedging` is `false` +- **AND** no explicit customer hedging configuration is set +- **THEN** the SDK SHALL enable the default PPAF hedging strategy with threshold `Min(1000ms, RequestTimeout/2)` and step `500ms` + +#### Scenario: PPAF account with flag absent and explicit hedging configured +- **WHEN** the account has PPAF enabled +- **AND** the Gateway flag `disableCrossRegionalHedging` is absent from the response +- **AND** the customer has configured an explicit `AvailabilityStrategy` +- **THEN** the SDK SHALL honor the customer's explicit hedging configuration + +#### Scenario: PPAF account with flag set to false and explicit hedging configured +- **WHEN** the account has PPAF enabled +- **AND** the Gateway flag `disableCrossRegionalHedging` is `false` +- **AND** the customer has configured an explicit `AvailabilityStrategy` +- **THEN** the SDK SHALL honor the customer's explicit hedging configuration + +--- + +### Requirement: Dynamic toggling via account-properties refresh +The SDK SHALL evaluate the Gateway flag on each account-properties refresh cycle and dynamically enable or disable hedging as the flag value changes, without requiring client restart. + +#### Scenario: Flag toggled from false to true during runtime +- **WHEN** the Gateway flag `disableCrossRegionalHedging` was `false` (or absent) at client initialization +- **AND** hedging was active (default or explicit) +- **AND** the Gateway flag is changed to `true` +- **AND** the SDK observes the updated account properties via the next refresh cycle +- **THEN** the SDK SHALL disable hedging + +#### Scenario: Flag toggled from true to false during runtime +- **WHEN** the Gateway flag `disableCrossRegionalHedging` was `true` and hedging was disabled +- **AND** the Gateway flag is changed to `false` +- **AND** the SDK observes the updated account properties via the next refresh cycle +- **THEN** the SDK SHALL re-enable hedging using the appropriate strategy +- **AND** if the customer had configured an explicit strategy, that strategy SHALL be restored +- **AND** if no explicit strategy was configured, the SDK-default PPAF hedging strategy SHALL be applied + +#### Scenario: Flag toggled from true to false with no prior explicit strategy +- **WHEN** the Gateway flag `disableCrossRegionalHedging` transitions from `true` to `false` +- **AND** the customer did not configure an explicit `AvailabilityStrategy` +- **AND** the account has PPAF enabled +- **THEN** the SDK SHALL re-enable the default PPAF hedging strategy + +--- + +### Requirement: Non-PPAF accounts ignore the flag +The SDK SHALL NOT evaluate or act on the `disableCrossRegionalHedging` flag for accounts that do not have PPAF enabled. + +#### Scenario: Non-PPAF account with flag set to true +- **WHEN** the account does NOT have PPAF enabled (`EnablePartitionLevelFailover` is `false` or absent) +- **AND** the Gateway flag `disableCrossRegionalHedging` is `true` +- **THEN** the SDK SHALL ignore the flag +- **AND** any explicit customer hedging configuration SHALL continue to function normally + +#### Scenario: Non-PPAF account with explicit hedging and flag set to true +- **WHEN** the account does NOT have PPAF enabled +- **AND** the customer has configured an explicit `AvailabilityStrategy` +- **AND** the Gateway flag `disableCrossRegionalHedging` is `true` +- **THEN** the SDK SHALL NOT disable the customer's explicit hedging strategy + +--- + +### Requirement: Feature is invisible to end users +The Gateway hedging override flag SHALL NOT be exposed through any public SDK API surface. There SHALL be no new public properties on `CosmosClientOptions`, `RequestOptions`, or any other user-facing type related to this flag. + +#### Scenario: No public API surface for the flag +- **WHEN** a developer inspects the public API of `CosmosClientOptions`, `ItemRequestOptions`, `QueryRequestOptions`, or `ChangeFeedRequestOptions` +- **THEN** there SHALL be no property or method related to `disableCrossRegionalHedging` + +#### Scenario: Diagnostics logging when hedging is disabled by flag +- **WHEN** hedging is disabled due to the Gateway flag being `true` +- **THEN** the SDK SHALL include a trace or diagnostic entry indicating that hedging was disabled by a Gateway account property +- **AND** this information SHALL be available in SDK diagnostics for supportability + +--- + +### Requirement: Precedence rules for hedging evaluation +The SDK SHALL evaluate hedging configuration using the following strict precedence order: +1. Gateway `disableCrossRegionalHedging = true` → hedging OFF (highest priority) +2. If Gateway flag is `false` or absent → evaluate existing rules (request-level override → client-level strategy → PPAF default) + +#### Scenario: Gateway flag true takes precedence over all other configuration +- **WHEN** the Gateway flag `disableCrossRegionalHedging` is `true` +- **AND** the customer has configured an explicit `AvailabilityStrategy` at the client level +- **AND** a request has a per-request `AvailabilityStrategy` override +- **THEN** the SDK SHALL disable hedging for that request +- **AND** neither the client-level nor request-level strategy SHALL be executed + +#### Scenario: Gateway flag false defers to existing precedence +- **WHEN** the Gateway flag `disableCrossRegionalHedging` is `false` +- **AND** the customer has configured an explicit `AvailabilityStrategy` at the client level +- **AND** a request has a per-request `AvailabilityStrategy` override +- **THEN** the request-level strategy SHALL be used (existing precedence preserved) diff --git a/openspec/changes/ppaf-dynamic-hedging-control/tasks.md b/openspec/changes/ppaf-dynamic-hedging-control/tasks.md new file mode 100644 index 0000000000..60c452b896 --- /dev/null +++ b/openspec/changes/ppaf-dynamic-hedging-control/tasks.md @@ -0,0 +1,38 @@ +## 1. Account Properties Model + +- [ ] 1.1 Add `DisableCrossRegionalHedging` nullable bool property to `AccountProperties.cs` with `[JsonProperty("disableCrossRegionalHedging")]` attribute, following the same pattern as `EnablePartitionLevelFailover` +- [ ] 1.2 Add unit tests for `AccountProperties` deserialization: flag present as `true`, flag present as `false`, and flag absent from JSON + +## 2. DocumentClient State Tracking + +- [ ] 2.1 Add internal field `disableCrossRegionalHedgingFlag` (bool, default false) to `DocumentClient` to cache the current Gateway flag value +- [ ] 2.2 Add internal field to store the customer's original explicit `AvailabilityStrategy` reference (if any) so it can be restored when the flag is toggled back to false + +## 3. Hedging Evaluation in Account-Refresh Callback + +- [ ] 3.1 Modify `GlobalEndpointManager.RefreshDatabaseAccountInternalAsync()` (or the existing PPAF-change callback) to propagate the `DisableCrossRegionalHedging` value to `DocumentClient` +- [ ] 3.2 Modify `DocumentClient.UpdatePartitionLevelFailoverConfigWithAccountRefresh()` to evaluate the Gateway flag: when `true`, store the current strategy and set `ConnectionPolicy.AvailabilityStrategy` to null/disabled; when `false` or absent, restore the appropriate strategy +- [ ] 3.3 Ensure the flag is also evaluated during initial client setup in `DocumentClient.InitializePartitionLevelFailoverWithDefaultHedging()` — if the flag is `true` at initialization time, do not enable default hedging + +## 4. Request-Level Enforcement + +- [ ] 4.1 Update `RequestInvokerHandler` hedging resolution logic to check the Gateway disable flag before evaluating request-level or client-level `AvailabilityStrategy` overrides, ensuring the flag takes absolute precedence when `true` + +## 5. Diagnostics and Tracing + +- [ ] 5.1 Add a trace/diagnostic log entry when hedging is disabled due to the Gateway flag, including the flag value and the action taken (e.g., "Hedging disabled by Gateway account property disableCrossRegionalHedging=true") +- [ ] 5.2 Add a trace/diagnostic log entry when hedging is re-enabled after the flag is toggled back to false + +## 6. Unit Tests + +- [ ] 6.1 Test: PPAF account with default hedging — flag `true` disables hedging, flag toggled to `false` re-enables default hedging +- [ ] 6.2 Test: PPAF account with explicit customer hedging — flag `true` disables hedging, flag toggled to `false` restores customer strategy +- [ ] 6.3 Test: PPAF account with request-level hedging override — flag `true` prevents request-level strategy execution +- [ ] 6.4 Test: Non-PPAF account — flag `true` does not affect explicit customer hedging +- [ ] 6.5 Test: Flag absent from account properties — existing behavior unchanged +- [ ] 6.6 Test: Dynamic toggle cycle — enable → disable → re-enable with correct strategy restoration + +## 7. Integration Verification + +- [ ] 7.1 Validate end-to-end with mocked Gateway responses containing the `disableCrossRegionalHedging` flag in integration/emulator tests +- [ ] 7.2 Verify no public API surface changes — confirm `DisableCrossRegionalHedging` is internal only and not exposed on `CosmosClientOptions`, `RequestOptions`, or related types