Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-03-24
88 changes: 88 additions & 0 deletions openspec/changes/ppaf-dynamic-hedging-control/design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
## Context

PPAF (Partition-Level Failover) is an account-level feature that, when enabled, triggers automatic hedging of read requests with a default threshold of `Min(1000ms, RequestTimeout/2)` and a step of 500ms. The hedging is implemented via `CrossRegionHedgingAvailabilityStrategy`, which is either explicitly configured by the customer on `CosmosClientOptions.AvailabilityStrategy` or automatically created by the SDK in `DocumentClient.InitializePartitionLevelFailoverWithDefaultHedging()`.

**Current hedging decision flow:**

1. `RequestInvokerHandler` resolves the active `AvailabilityStrategy`: request-level override → client-level → null.
2. If a strategy is present and `Enabled()`, requests are dispatched through `CrossRegionHedgingAvailabilityStrategy.ExecuteAvailabilityStrategyAsync()`.
3. The strategy decides per-request whether to hedge based on resource type (Document only), operation type (reads always, writes only if multi-write enabled), and available regions.

**Account properties refresh:**

- `GlobalEndpointManager.RefreshDatabaseAccountInternalAsync()` periodically fetches `AccountProperties` from the Gateway.
- The `OnEnablePartitionLevelFailoverConfigChanged` event fires when PPAF status changes, triggering `DocumentClient.UpdatePartitionLevelFailoverConfigWithAccountRefresh()` to dynamically enable or disable default hedging.

**Problem:** There is no mechanism for on-call engineers to temporarily disable hedging for a PPAF account without rolling back PPAF entirely. Rolling back PPAF is expensive and disrupts other benefits. A Gateway-controlled flag is needed as a targeted escape hatch.

## Goals / Non-Goals

**Goals:**

- Add a new boolean account property (`disableCrossRegionalHedging`) read from Gateway responses.
- When the flag is `true`, disable all hedging (both SDK-default PPAF hedging and explicit customer-configured hedging) for that account.
- When the flag is `false` or absent, preserve existing hedging behavior.
- Support dynamic toggling: as clients observe refreshed account properties, hedging state updates accordingly.
- Keep the feature entirely internal — no new public API surface.

**Non-Goals:**

- Changing customer-authored hedging strategies or their configuration shape.
- Modifying PPAF enablement or onboarding flows.
- Supporting non-PPAF accounts with this flag (for the immediate term).
- Exposing the flag to end users or making it configurable from the SDK.

## Decisions

### 1. Property location: `AccountProperties` with `JsonExtensionData` fallback

**Decision:** Add a strongly-typed `bool?` property `DisableCrossRegionalHedging` to `AccountProperties` with a `[JsonProperty]` attribute mapped to the Gateway JSON key `"disableCrossRegionalHedging"`.

**Rationale:** This is consistent with how `EnablePartitionLevelFailover` is already modeled (Line 249 of `AccountProperties.cs`). A strongly-typed property provides compile-time safety and discoverability. The `[JsonExtensionData] AdditionalProperties` dictionary exists as a fallback for unknown fields, but relying on it would lose type safety and require manual parsing.

**Alternative considered:** Reading from `AdditionalProperties` at evaluation time. Rejected because it introduces fragile string-keyed lookups and inconsistency with the existing pattern for account-level flags.

### 2. Evaluation point: `GlobalEndpointManager` account-refresh callback

**Decision:** Evaluate the `disableCrossRegionalHedging` flag in `DocumentClient.UpdatePartitionLevelFailoverConfigWithAccountRefresh()` — the same method that already handles dynamic PPAF enable/disable based on account-property changes.

**Rationale:** This method is invoked whenever `GlobalEndpointManager` detects a change in PPAF-related account properties. Adding the hedging-disable check here ensures the flag is evaluated on every account-properties refresh, supporting dynamic toggling. It also consolidates all PPAF-related hedging logic in one place.

**Alternative considered:** Evaluating the flag per-request in `RequestInvokerHandler`. Rejected because it would require propagating account properties into the hot path and adds unnecessary per-request overhead. The account-refresh callback runs infrequently and already handles strategy assignment.

### 3. Enforcement mechanism: Strategy nullification / replacement

**Decision:** When the flag is `true`, set `ConnectionPolicy.AvailabilityStrategy` to `null` (or a `DisabledAvailabilityStrategy` sentinel) to disable hedging. When the flag is toggled back to `false`, re-evaluate and restore the appropriate strategy (explicit customer config or PPAF default).

**Rationale:** `RequestInvokerHandler` already treats a null strategy as "no hedging" (Line 97-98: `strategy != null && strategy.Enabled()`). Using the existing null-check path avoids introducing new conditional logic in the request hot path. The `DisabledAvailabilityStrategy` subclass already exists for explicit opt-out scenarios, though null assignment is simpler.

**Alternative considered:** Adding an `IsDisabledByGateway` flag to `CrossRegionHedgingAvailabilityStrategy` and checking it in `Enabled()`. Rejected because it would couple the strategy to Gateway concerns and requires changes to the strategy hierarchy.

### 4. Precedence: Gateway flag overrides all hedging configuration

**Decision:** When the Gateway flag is `true`, it overrides both SDK-default PPAF hedging AND any explicit customer-configured `AvailabilityStrategy`. The precedence order becomes:

1. Gateway `disableCrossRegionalHedging = true` → hedging OFF (highest priority)
2. Request-level `AvailabilityStrategy` override
3. Client-level `AvailabilityStrategy` (explicit customer config)
4. PPAF default hedging (if PPAF enabled, no explicit config)

**Rationale:** The flag is an operational safety valve. If on-call determines hedging is causing issues, it must be a hard kill switch regardless of what the customer configured. This prevents scenarios where explicit customer hedging bypasses the mitigation.

### 5. State tracking: Store flag and original strategy reference

**Decision:** Introduce internal fields in `DocumentClient`:
- `bool disableCrossRegionalHedgingFlag` — cached value of the Gateway flag.
- `AvailabilityStrategy customerConfiguredStrategy` — the customer's original explicit strategy (if any), stored before nullification so it can be restored when the flag is toggled back.

**Rationale:** The SDK must be able to restore the correct hedging behavior when the flag is turned off. Without storing the original strategy, the SDK cannot distinguish between "customer never set a strategy" and "strategy was removed by the flag."

## Risks / Trade-offs

- **[Risk] Flag latency** — The flag takes effect only after the next account-properties refresh (default interval is ~5 minutes). → **Mitigation:** On-call can force a faster refresh by restarting the client or waiting for the next refresh cycle. The existing `GlobalEndpointManager` refresh interval is sufficient for non-emergency toggles.

- **[Risk] Customer confusion if explicit hedging silently disabled** — If a customer configured explicit hedging and on-call disables it via the flag, the customer may observe unexpected behavior. → **Mitigation:** The flag is intended as a temporary operational tool. On-call should communicate with the customer. SDK diagnostics/traces should log when hedging is disabled by the Gateway flag.

- **[Risk] Strategy restoration correctness** — When restoring hedging after the flag is toggled off, the SDK must correctly reconstruct the PPAF default strategy or restore the customer's explicit strategy. → **Mitigation:** Store the original strategy reference before nullification. Unit-test the toggle cycle (enable → disable → re-enable).

- **[Trade-off] Non-PPAF accounts ignored** — The flag is only evaluated for PPAF accounts. A future extension could support non-PPAF accounts, but this adds complexity without current demand.
28 changes: 28 additions & 0 deletions openspec/changes/ppaf-dynamic-hedging-control/proposal.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
## Why

PPAF-enabled Cosmos DB accounts automatically enable hedging with a 1-second threshold to fast-track read-region failover. However, production incidents have shown that implicit hedging of long-running queries can cause unexpected exceptions (e.g., ArgumentException in CallStore). Rolling back PPAF entirely to disable hedging is operationally expensive and disrupts other PPAF benefits, so a targeted, service-side escape hatch is needed to let on-call engineers dynamically disable hedging without customer intervention.

## What Changes

- Introduce a new Gateway account property (`disableCrossRegionalHedging`) that the SDK reads from account-property responses.
- When the flag is `true`, hedging is disabled for the PPAF account regardless of any explicit or implicit hedging configuration.
- When the flag is `false` or absent, existing hedging behavior is preserved (explicit customer config honored; PPAF defaults applied if no explicit config).
- The SDK evaluates the flag dynamically on every account-properties refresh, enabling on-call toggle without customer code changes.
- Non-PPAF accounts ignore the flag.

## Capabilities

### New Capabilities
- `gateway-hedging-override`: Reads a new Gateway account property flag and enforces it as the highest-precedence control over PPAF hedging behavior, supporting dynamic enable/disable at the SDK layer.

### Modified Capabilities
<!-- No existing spec-level capabilities are being modified. The underlying hedging and PPAF plumbing remain unchanged;
only the precedence evaluation gains a new top-level check. -->

## Impact

- **SDK Client layer** (`DocumentClient` / `CosmosClient` internals): hedging-decision logic must incorporate a new precedence check against the Gateway flag before evaluating explicit or default hedging configuration.
- **Account properties model**: new property deserialized from the Gateway response (`AccountProperties` or equivalent DTO).
- **Gateway / service dependency**: the flag is surfaced by the Cosmos DB Gateway; the SDK consumes it read-only.
- **No public API surface changes**: the feature is invisible to end users; no new `CosmosClientOptions` or request-options properties are exposed.
- **Testing**: unit tests for precedence rules; integration tests validating dynamic toggle via mocked account-property responses.
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
## ADDED Requirements

### Requirement: Gateway account property for hedging control
The `AccountProperties` model SHALL include a nullable boolean property `DisableCrossRegionalHedging` deserialized from the Gateway JSON key `"disableCrossRegionalHedging"`. The property SHALL default to `null` when absent from the Gateway response.

#### Scenario: Gateway response includes the flag set to true
- **WHEN** the Gateway account-properties response contains `"disableCrossRegionalHedging": true`
- **THEN** the `AccountProperties.DisableCrossRegionalHedging` property SHALL be `true`

#### Scenario: Gateway response includes the flag set to false
- **WHEN** the Gateway account-properties response contains `"disableCrossRegionalHedging": false`
- **THEN** the `AccountProperties.DisableCrossRegionalHedging` property SHALL be `false`

#### Scenario: Gateway response does not include the flag
- **WHEN** the Gateway account-properties response does not contain the `"disableCrossRegionalHedging"` key
- **THEN** the `AccountProperties.DisableCrossRegionalHedging` property SHALL be `null`

---

### Requirement: Gateway flag disables all hedging when true
When the Gateway flag `disableCrossRegionalHedging` is `true`, the SDK SHALL disable all hedging for PPAF-enabled accounts regardless of any explicit or implicit hedging configuration.

#### Scenario: PPAF account with default hedging and flag set to true
- **WHEN** the account has PPAF enabled with SDK-default hedging active
- **AND** the Gateway flag `disableCrossRegionalHedging` is `true`
- **THEN** the SDK SHALL disable hedging
- **AND** requests SHALL NOT be hedged across regions

#### Scenario: PPAF account with explicit customer hedging and flag set to true
- **WHEN** the account has PPAF enabled
- **AND** the customer has configured an explicit `AvailabilityStrategy` via `CosmosClientOptions`
- **AND** the Gateway flag `disableCrossRegionalHedging` is `true`
- **THEN** the SDK SHALL disable hedging
- **AND** the explicit customer strategy SHALL NOT be executed

#### Scenario: PPAF account with request-level hedging override and flag set to true
- **WHEN** the account has PPAF enabled
- **AND** a request has a per-request `AvailabilityStrategy` override set in `RequestOptions`
- **AND** the Gateway flag `disableCrossRegionalHedging` is `true`
- **THEN** the SDK SHALL disable hedging for that request
- **AND** the request-level strategy SHALL NOT be executed

---

### Requirement: Existing behavior preserved when flag is false or absent
When the Gateway flag `disableCrossRegionalHedging` is `false` or absent from the account-properties response, the SDK SHALL preserve existing hedging behavior without any change.

#### Scenario: PPAF account with flag set to false and no explicit hedging
- **WHEN** the account has PPAF enabled
- **AND** the Gateway flag `disableCrossRegionalHedging` is `false`
- **AND** no explicit customer hedging configuration is set
- **THEN** the SDK SHALL enable the default PPAF hedging strategy with threshold `Min(1000ms, RequestTimeout/2)` and step `500ms`

#### Scenario: PPAF account with flag absent and explicit hedging configured
- **WHEN** the account has PPAF enabled
- **AND** the Gateway flag `disableCrossRegionalHedging` is absent from the response
- **AND** the customer has configured an explicit `AvailabilityStrategy`
- **THEN** the SDK SHALL honor the customer's explicit hedging configuration

#### Scenario: PPAF account with flag set to false and explicit hedging configured
- **WHEN** the account has PPAF enabled
- **AND** the Gateway flag `disableCrossRegionalHedging` is `false`
- **AND** the customer has configured an explicit `AvailabilityStrategy`
- **THEN** the SDK SHALL honor the customer's explicit hedging configuration

---

### Requirement: Dynamic toggling via account-properties refresh
The SDK SHALL evaluate the Gateway flag on each account-properties refresh cycle and dynamically enable or disable hedging as the flag value changes, without requiring client restart.

#### Scenario: Flag toggled from false to true during runtime
- **WHEN** the Gateway flag `disableCrossRegionalHedging` was `false` (or absent) at client initialization
- **AND** hedging was active (default or explicit)
- **AND** the Gateway flag is changed to `true`
- **AND** the SDK observes the updated account properties via the next refresh cycle
- **THEN** the SDK SHALL disable hedging

#### Scenario: Flag toggled from true to false during runtime
- **WHEN** the Gateway flag `disableCrossRegionalHedging` was `true` and hedging was disabled
- **AND** the Gateway flag is changed to `false`
- **AND** the SDK observes the updated account properties via the next refresh cycle
- **THEN** the SDK SHALL re-enable hedging using the appropriate strategy
- **AND** if the customer had configured an explicit strategy, that strategy SHALL be restored
- **AND** if no explicit strategy was configured, the SDK-default PPAF hedging strategy SHALL be applied

#### Scenario: Flag toggled from true to false with no prior explicit strategy
- **WHEN** the Gateway flag `disableCrossRegionalHedging` transitions from `true` to `false`
- **AND** the customer did not configure an explicit `AvailabilityStrategy`
- **AND** the account has PPAF enabled
- **THEN** the SDK SHALL re-enable the default PPAF hedging strategy

---

### Requirement: Non-PPAF accounts ignore the flag
The SDK SHALL NOT evaluate or act on the `disableCrossRegionalHedging` flag for accounts that do not have PPAF enabled.

#### Scenario: Non-PPAF account with flag set to true
- **WHEN** the account does NOT have PPAF enabled (`EnablePartitionLevelFailover` is `false` or absent)
- **AND** the Gateway flag `disableCrossRegionalHedging` is `true`
- **THEN** the SDK SHALL ignore the flag
- **AND** any explicit customer hedging configuration SHALL continue to function normally

#### Scenario: Non-PPAF account with explicit hedging and flag set to true
- **WHEN** the account does NOT have PPAF enabled
- **AND** the customer has configured an explicit `AvailabilityStrategy`
- **AND** the Gateway flag `disableCrossRegionalHedging` is `true`
- **THEN** the SDK SHALL NOT disable the customer's explicit hedging strategy

---

### Requirement: Feature is invisible to end users
The Gateway hedging override flag SHALL NOT be exposed through any public SDK API surface. There SHALL be no new public properties on `CosmosClientOptions`, `RequestOptions`, or any other user-facing type related to this flag.

#### Scenario: No public API surface for the flag
- **WHEN** a developer inspects the public API of `CosmosClientOptions`, `ItemRequestOptions`, `QueryRequestOptions`, or `ChangeFeedRequestOptions`
- **THEN** there SHALL be no property or method related to `disableCrossRegionalHedging`

#### Scenario: Diagnostics logging when hedging is disabled by flag
- **WHEN** hedging is disabled due to the Gateway flag being `true`
- **THEN** the SDK SHALL include a trace or diagnostic entry indicating that hedging was disabled by a Gateway account property
- **AND** this information SHALL be available in SDK diagnostics for supportability

---

### Requirement: Precedence rules for hedging evaluation
The SDK SHALL evaluate hedging configuration using the following strict precedence order:
1. Gateway `disableCrossRegionalHedging = true` → hedging OFF (highest priority)
2. If Gateway flag is `false` or absent → evaluate existing rules (request-level override → client-level strategy → PPAF default)

#### Scenario: Gateway flag true takes precedence over all other configuration
- **WHEN** the Gateway flag `disableCrossRegionalHedging` is `true`
- **AND** the customer has configured an explicit `AvailabilityStrategy` at the client level
- **AND** a request has a per-request `AvailabilityStrategy` override
- **THEN** the SDK SHALL disable hedging for that request
- **AND** neither the client-level nor request-level strategy SHALL be executed

#### Scenario: Gateway flag false defers to existing precedence
- **WHEN** the Gateway flag `disableCrossRegionalHedging` is `false`
- **AND** the customer has configured an explicit `AvailabilityStrategy` at the client level
- **AND** a request has a per-request `AvailabilityStrategy` override
- **THEN** the request-level strategy SHALL be used (existing precedence preserved)
Loading
Loading