Skip to content

[Internal] PPAF: Adds Dynamic Enablement of PPAF#5310

Merged
FabianMeiswinkel merged 29 commits intomasterfrom
users/nalutripician/ppafDynamicEnable
Aug 13, 2025
Merged

[Internal] PPAF: Adds Dynamic Enablement of PPAF#5310
FabianMeiswinkel merged 29 commits intomasterfrom
users/nalutripician/ppafDynamicEnable

Conversation

@NaluTripician
Copy link
Copy Markdown
Contributor

@NaluTripician NaluTripician commented Jul 22, 2025

Pull Request Template

Description

This pull request introduces enhancements to the partition-level failover (PPAF) functionality in the Azure Cosmos SDK. The changes include the addition of a default cross-region hedging strategy, dynamic enablement of PPAF based on the database account configuration without a restart of the SDK client, and new tests to validate these behaviors. Below are the most important changes grouped by theme:

Enhancements to Availability Strategy:

  • Added a new method SDKDefaultCrossRegionHedgingStrategy in AvailabilityStrategy.cs to provide a default hedging strategy for cross-region failover, including support for write requests on multi-region accounts.
  • Introduced an internal flag IsSDKDefaultStrategy in CrossRegionHedgingAvailabilityStrategy to differentiate SDK default strategies from custom ones. Updated the constructor to accept this flag. [1] [2] [3]

Dynamic PPAF Enablement:

  • Updated GlobalEndpointManager.cs to dynamically enable or disable PPAF based on the enablePartitionLevelFailover flag retrieved from the database account properties. Added logic to reset the availability strategy to null if PPAF is disabled and no custom strategy is set. [1] [2]

Default Hedging Thresholds:

  • Changed the visibility of DefaultHedgingThresholdInMilliseconds and DefaultHedgingThresholdStepInMilliseconds in DocumentClient.cs from private to internal for broader accessibility within the SDK.
  • Updated the initialization logic in InitializePartitionLevelFailoverWithDefaultHedging to use the new SDKDefaultCrossRegionHedgingStrategy.

Tests for PPAF Functionality:

  • Added a new integration test ReadItemAsync_WithPPAFDynamicOverride in CosmosItemIntegrationTests.cs to validate dynamic PPAF enablement, hedging behavior, and fallback when PPAF is disabled. This includes fault injection and diagnostics validation.

End to End Validation:

CreateItemAsync: Strong Consistency Account with Direct Mode:

Screenshot 2025-08-07 162428

CreateItemAsync: Strong Consistency Account with Gateway Mode:

image

CreateItemAsync: Session Consistency Account with Direct Mode:

image

CreateItemAsync: Session Consistency Account with Gateway Mode:

image

[Note: The Orange graph indicates the number of requests processed in the North CentralUS region, where as the Blue graph indicates the number of requests processed in the Central US region.]

Type of change

Please delete options that are not relevant.

  • New feature (non-breaking change which adds functionality)

Closing issues

To automatically close an issue: closes #5304

Comment thread Microsoft.Azure.Cosmos/src/Routing/GlobalEndpointManager.cs Outdated
@kundadebdatta kundadebdatta marked this pull request as draft July 30, 2025 01:53
Comment thread Microsoft.Azure.Cosmos/src/Routing/GlobalEndpointManager.cs Outdated
Comment thread Microsoft.Azure.Cosmos/src/Routing/GlobalPartitionEndpointManagerCore.cs Outdated
Comment thread Microsoft.Azure.Cosmos/src/Routing/GlobalEndpointManager.cs
Comment thread Microsoft.Azure.Cosmos/src/Routing/AvailabilityStrategy/AvailabilityStrategy.cs Outdated
Comment thread Microsoft.Azure.Cosmos/src/DocumentClient.cs Outdated
Comment thread Microsoft.Azure.Cosmos/src/Routing/GlobalPartitionEndpointManager.cs Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request introduces dynamic enablement of Partition-level Failover (PPAF) in the Azure Cosmos SDK, allowing the SDK to enable/disable PPAF at runtime based on the database account configuration without requiring a client restart. The changes include a default cross-region hedging strategy for PPAF, thread-safe dynamic configuration updates, and comprehensive test coverage.

  • Adds dynamic PPAF enablement/disablement based on database account properties retrieved during background refresh
  • Introduces SDK default cross-region hedging strategy specifically for PPAF scenarios
  • Updates constructor signatures across multiple components to support the new GlobalPartitionEndpointManager architecture

Reviewed Changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
CosmosItemIntegrationTests.cs Adds comprehensive integration test for dynamic PPAF override behavior with fault injection
UserAgentContainer.cs Updates feature appending logic to handle dynamic feature changes
DocumentClient.cs Implements dynamic PPAF configuration updates and event handling for account property changes
GlobalEndpointManager.cs Adds event for PPAF configuration changes and detection logic during account refresh
AvailabilityStrategy.cs Introduces SDK default cross-region hedging strategy method for PPAF
CrossRegionHedgingAvailabilityStrategy.cs Adds internal flag to identify SDK default strategies
GlobalPartitionEndpointManagerCore.cs Implements thread-safe PPAF/PPCB enablement with atomic operations
GlobalPartitionEndpointManager.cs Defines abstract methods for dynamic PPAF/PPCB configuration
GlobalPartitionEndpointManagerNoOp.cs Implements no-op versions of new PPAF/PPCB methods
GatewayStoreModel.cs, GatewayStoreClient.cs, ThinClientStoreClient.cs Updates constructors to use GlobalPartitionEndpointManager instead of boolean flags
Multiple test files Updates test constructors to accommodate new GlobalPartitionEndpointManager parameter requirements
Comments suppressed due to low confidence (1)

Microsoft.Azure.Cosmos/src/UserAgentContainer.cs:56

  • IndexOf can return -1 if the character is not found, which would cause Substring to throw an exception. The Contains check above should protect against this, but the logic could be more robust.
                        ? this.Suffix.Substring(this.Suffix.IndexOf('|') + 1)

Comment thread Microsoft.Azure.Cosmos/src/UserAgentContainer.cs Outdated
Comment thread Microsoft.Azure.Cosmos/src/DocumentClient.cs
@kundadebdatta kundadebdatta self-assigned this Aug 7, 2025
ananth7592
ananth7592 previously approved these changes Aug 8, 2025
Comment thread Microsoft.Azure.Cosmos/src/UserAgentContainer.cs
Copy link
Copy Markdown
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for one small question

Comment thread Microsoft.Azure.Cosmos/src/UserAgentContainer.cs Outdated
Copy link
Copy Markdown
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for the regex compilation comment

Comment thread Microsoft.Azure.Cosmos/src/UserAgentContainer.cs Outdated
Copy link
Copy Markdown
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - Thanks!

@kundadebdatta kundadebdatta changed the title PPAF: Adds Dynamic Enablement of PPAF [Internal] PPAF: Adds Dynamic Enablement of PPAF Aug 12, 2025
@NaluTripician NaluTripician added auto-merge Enables automation to merge PRs and removed auto-merge Enables automation to merge PRs labels Aug 13, 2025
@FabianMeiswinkel FabianMeiswinkel merged commit 9fc8b0a into master Aug 13, 2025
28 checks passed
@FabianMeiswinkel FabianMeiswinkel deleted the users/nalutripician/ppafDynamicEnable branch August 13, 2025 13:04
ananth7592 added a commit that referenced this pull request May 1, 2026
…ind PPAF

When ExcludeRegions filters out all preferred read regions and PPAF
(Partition Level Failover) is enabled, GetApplicableEndpoints now falls back
to WriteEndpoints[0] (dynamic, tracks current write region) instead of
this.defaultEndpoint (static, region-agnostic URI set once at init).

The fix is gated behind isPartitionLevelFailoverEnabled (Func<bool>) wired
from ConnectionPolicy.EnablePartitionLevelFailover through GlobalEndpointManager,
supporting dynamic enablement per PR #5310.

When PPAF is disabled, original behavior (defaultEndpoint fallback) is preserved.

Fixes #5821

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge Enables automation to merge PRs PerPartitionAutomaticFailover

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Per Partition Automatic Failover] - Enable PPAF Dynamically upon change on Account Properties Metadata Response

5 participants