Skip to content

[Per Partition Automatic Failover] Enable PPAF Dynamically Using Targeted Event-Based Updates with Thread-Safe Operations#5326

Closed
Copilot wants to merge 12 commits intomasterfrom
copilot/fix-5304
Closed

[Per Partition Automatic Failover] Enable PPAF Dynamically Using Targeted Event-Based Updates with Thread-Safe Operations#5326
Copilot wants to merge 12 commits intomasterfrom
copilot/fix-5304

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Jul 30, 2025

This PR implements dynamic Per Partition Automatic Failover (PPAF) refresh capability that enables the Azure Cosmos DB .NET SDK to update PPAF settings at runtime without requiring SDK restart when account properties change.

Problem

Currently, the SDK fetches the EnablePartitionLevelFailover flag only during initialization via the GET ACCOUNT metadata call. If the account updates this flag after SDK initialization, customers need to restart the SDK to take the change into effect, creating operational challenges.

Solution

Leveraging Existing Background Refresh Infrastructure

Instead of creating a separate background task, this implementation enhances the existing GlobalEndpointManager.StartLocationBackgroundRefreshLoop() which already refreshes account properties every 5 minutes. This approach:

  • Avoids duplicate background tasks - Uses the well-tested existing refresh mechanism
  • Reduces resource consumption - Single centralized background refresh task
  • Maintains consistency - All account property updates flow through the same pathway

Targeted Event-Based Architecture

GlobalEndpointManager Enhancement:

  • Added OnEnablePartitionLevelFailoverConfigChanged event that fires only when PPAF status changes
  • Enhanced RefreshDatabaseAccountInternalAsync() to track previous PPAF values and emit targeted events
  • Eliminates unnecessary event invocations when PPAF settings haven't changed

DocumentClient Direct Integration:

  • Subscribes directly to GlobalEndpointManager's targeted OnEnablePartitionLevelFailoverConfigChanged events
  • Simplified HandleEnablePartitionLevelFailoverConfigChanged() method processes only actual PPAF changes
  • Handles dynamic PPAF configuration updates efficiently when changes are detected

Dynamic PPAF Configuration Updates

When EnablePartitionLevelFailover changes in account properties, the SDK automatically:

  1. Updates Connection Policy: Sets EnablePartitionLevelFailover and EnablePartitionLevelCircuitBreaker flags
  2. Configures Read Hedging: Enables default hedging strategy when PPAF is enabled (respects existing user configurations)
  3. Updates Circuit Breaker: Enables per-partition circuit breaker when PPAF is enabled
  4. Recreates Endpoint Manager: Creates new GlobalPartitionEndpointManagerCore instance with updated settings using thread-safe atomic operations
  5. Updates User Agent Features: Refreshes user agent to reflect new PPAF configuration for proper telemetry

Thread Safety

Atomic Operations: Used Interlocked.Exchange for atomic updates to PartitionKeyRangeLocation to prevent thread contention during dynamic updates. The implementation:

  • Converts auto-property to field with property accessor for atomic reference updates
  • Uses Interlocked.Exchange(ref this.partitionKeyRangeLocation, newValue) to atomically swap references
  • Ensures thread-safe disposal of old instances while preventing race conditions during concurrent request processing

Usage

// No code changes required - everything happens automatically
CosmosClient client = new CosmosClient(connectionString);

// SDK now automatically:
// 1. Uses existing GlobalEndpointManager background refresh (every 5 minutes)
// 2. Detects PPAF enablement changes with targeted event handling
// 3. Updates read hedging and circuit breaker settings only when needed
// 4. Recreates partition endpoint manager with thread-safe atomic operations
// 5. Updates user agent features to reflect dynamic changes

Implementation Details

  • Efficient Event Handling: Events are only fired when PPAF status actually changes, reducing unnecessary processing
  • Thread Safety: All operations properly synchronized with Interlocked.Exchange for atomic reference updates, preventing thread contention during dynamic updates
  • Backward Compatibility: Zero breaking changes, respects client-level override settings
  • Resource Efficiency: Single background refresh task with targeted event handling
  • Observability: Comprehensive trace logging for monitoring dynamic updates
  • User Agent Telemetry: Automatically updates user agent features when PPAF configuration changes for accurate service telemetry

Testing

Updated unit tests to reflect the targeted event architecture:

  • Direct property update validation with change detection
  • Proper handling of null values and edge cases
  • Validation of thread-safe atomic operations

The implementation satisfies all acceptance criteria:

  • ✅ Read hedging automatically enabled/configured when PPAF is enabled
  • ✅ Per-partition circuit breaker enabled when PPAF is enabled
  • ✅ Dynamic updates without SDK restart required
  • ✅ Preserves user-configured availability strategies
  • ✅ Thread-safe with proper resource management using atomic operations
  • ✅ Efficient resource utilization with targeted event architecture
  • ✅ User agent features updated to reflect dynamic PPAF changes for accurate telemetry

Fixes #5304.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Co-authored-by: kundadebdatta <87335885+kundadebdatta@users.noreply.github.com>
Copilot AI changed the title [WIP] [Per Partition Automatic Failover] - Enable PPAF Dynamically upon change on Account Properties Metadata Response [Per Partition Automatic Failover] Enable PPAF Dynamically upon Account Properties Metadata Response Changes Jul 30, 2025
Copilot AI requested a review from kundadebdatta July 30, 2025 16:49
/// <summary>
/// Initializes and starts the background account properties refresh task
/// </summary>
private void InitializeBackgroundRefresh()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already a background task in GlobalEndpointManager.InitializeAccountPropertiesAndStartBackgroundRefresh.

Let's try to leverage that instead of creating a new one.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're absolutely right! I've refactored the implementation to leverage the existing background task in GlobalEndpointManager.StartLocationBackgroundRefreshLoop() instead of creating a duplicate one.

Changes made:

  • Removed separate background refresh task from CosmosAccountServiceConfiguration
  • Enhanced existing GlobalEndpointManager.RefreshDatabaseAccountInternalAsync() to raise OnAccountPropertiesRefreshed events
  • Updated CosmosAccountServiceConfiguration to receive updates via UpdateAccountProperties() method
  • Modified DocumentClient to subscribe to GlobalEndpointManager events for PPAF changes

This approach is much cleaner as it:

  • ✅ Avoids duplicate background tasks
  • ✅ Reduces resource consumption
  • ✅ Leverages the existing, well-tested refresh mechanism
  • ✅ Keeps responsibilities clear

The functionality remains the same - dynamic PPAF updates when account properties change - but now uses a single, centralized background refresh task. (commit: 6941697)

…ground task

Co-authored-by: kundadebdatta <87335885+kundadebdatta@users.noreply.github.com>
Copilot AI changed the title [Per Partition Automatic Failover] Enable PPAF Dynamically upon Account Properties Metadata Response Changes [Per Partition Automatic Failover] Enable PPAF Dynamically Using Existing GlobalEndpointManager Background Refresh Jul 30, 2025
Copilot AI requested a review from kundadebdatta July 30, 2025 17:28
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow the required format: "[Internal] Category: (Adds|Fixes|Refactors|Removes) Description"

Internal should be used for PRs that have no customer impact. This flag is used to help generate the changelog to know which PRs should be included. Examples:
Diagnostics: Adds GetElapsedClientLatency to CosmosDiagnostics
PartitionKey: Fixes null reference when using default(PartitionKey)
[v4] Client Encryption: Refactors code to external project
[Internal] Query: Adds code generator for CosmosNumbers for easy additions in the future.

/// <summary>
/// Event that is raised when account properties are refreshed and PPAF enablement status changes
/// </summary>
internal event Action<bool?> OnEnablePartitionLevelFailoverChanged;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this event anymore ? Can we not leverage the event OnAccountPropertiesRefreshed from GlobalEndpointManager directly ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're absolutely right! I've removed the redundant OnEnablePartitionLevelFailoverChanged event and simplified the architecture to leverage the existing OnAccountPropertiesRefreshed event from GlobalEndpointManager directly.

Changes made:

  • Removed OnEnablePartitionLevelFailoverChanged event from CosmosAccountServiceConfiguration
  • Enhanced DocumentClient.HandleAccountPropertiesRefreshed() to directly compare previous and new PPAF values and handle changes
  • Simplified CosmosAccountServiceConfiguration.UpdateAccountProperties() to only update properties without event firing
  • Updated unit tests to reflect the new direct architecture

This eliminates the intermediate event layer and provides a cleaner, more direct flow:
GlobalEndpointManager.OnAccountPropertiesRefreshedDocumentClient.HandleAccountPropertiesRefreshed() → direct PPAF change handling

The functionality remains the same but with reduced complexity and fewer event subscriptions. (commit: 395c4bd)

@kundadebdatta
Copy link
Copy Markdown
Member

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

…rage GlobalEndpointManager event directly

Co-authored-by: kundadebdatta <87335885+kundadebdatta@users.noreply.github.com>
Copilot AI changed the title [Per Partition Automatic Failover] Enable PPAF Dynamically Using Existing GlobalEndpointManager Background Refresh [Per Partition Automatic Failover] Enable PPAF Dynamically Using GlobalEndpointManager Event Directly Jul 30, 2025
Copilot AI requested a review from kundadebdatta July 30, 2025 18:36
/// </summary>
/// <param name="defaultValue">An integer containing the default value for the refresh interval in seconds.</param>
/// <returns>An integer representing the refresh interval in seconds.</returns>
public static int GetAccountPropertiesRefreshIntervalInSeconds(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no usage as we got rid of the redundant code for account refresh. Please remove this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct! I've removed the unused GetAccountPropertiesRefreshIntervalInSeconds method and the associated AccountPropertiesRefreshIntervalInSeconds constant since they're no longer needed after refactoring to use the existing GlobalEndpointManager background task. (commit: 47dcb90)

GlobalEndpointManager.ParseThinClientLocationsFromAdditionalProperties(accountProperties);

GlobalEndpointManager.ParseThinClientLocationsFromAdditionalProperties(accountProperties);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really a change ? Please fix the code diff.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the code diff by reverting the file to original state and reapplying only the necessary changes. Removed the unnecessary whitespace-only change that was appearing in the diff. (commit: 526d0d8)

/// A read-only string containing the environment variable name for capturing the account properties refresh interval
/// in seconds. The default value for this interval is 300 seconds (5 minutes).
/// </summary>
internal static readonly string AccountPropertiesRefreshIntervalInSeconds = "AZURE_COSMOS_ACCOUNT_PROPERTIES_REFRESH_INTERVAL_IN_SECONDS";
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this as well.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the empty line that was left after removing the AccountPropertiesRefreshIntervalInSeconds constant. (commit: 526d0d8)

… constant

Co-authored-by: kundadebdatta <87335885+kundadebdatta@users.noreply.github.com>
/// Updates the account properties
/// This method is called by GlobalEndpointManager when account properties are refreshed
/// </summary>
internal void UpdateAccountProperties(AccountProperties newProperties)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed. Let's get this removed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the UpdateAccountProperties method and updated the event handler to focus only on PPAF handling without updating accountServiceConfiguration. Also simplified the tests accordingly. (commit: e610a29)

…e only on PPAF changes

Co-authored-by: kundadebdatta <87335885+kundadebdatta@users.noreply.github.com>
Copilot AI changed the title [Per Partition Automatic Failover] Enable PPAF Dynamically Using GlobalEndpointManager Event Directly [Per Partition Automatic Failover] Enable PPAF Dynamically Using Targeted Event-Based Updates Jul 30, 2025
Copilot AI requested a review from kundadebdatta July 30, 2025 19:33
…erviceConfiguration

Co-authored-by: kundadebdatta <87335885+kundadebdatta@users.noreply.github.com>
@kundadebdatta
Copy link
Copy Markdown
Member

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).


// Update the GlobalPartitionEndpointManager
this.UpdateGlobalPartitionEndpointManager();

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the user agent features here as well using this.ConnectionPolicy.UserAgentContainer.AppendFeatures

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the user agent features update using this.ConnectionPolicy.UserAgentContainer.AppendFeatures(this.GetUserAgentFeatures()) when PPAF configuration changes dynamically. This ensures the user agent reflects the current PPAF state after dynamic updates. (commit: 49b3135)

Co-authored-by: kundadebdatta <87335885+kundadebdatta@users.noreply.github.com>
@kundadebdatta
Copy link
Copy Markdown
Member

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

}

// Update the partition key range location
this.PartitionKeyRangeLocation = newPartitionKeyRangeLocation;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it cause any thread contention ? Can we use Interlocked.CompareExchange to alter the object ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed thread contention issue by using Interlocked.Exchange for atomic updates to PartitionKeyRangeLocation. Changed from auto-property to field with property accessor and used Interlocked.Exchange(ref this.partitionKeyRangeLocation, newPartitionKeyRangeLocation) to atomically update the reference while avoiding race conditions during concurrent access. (commit: ec9fd26)

…dates

Co-authored-by: kundadebdatta <87335885+kundadebdatta@users.noreply.github.com>
Copilot AI changed the title [Per Partition Automatic Failover] Enable PPAF Dynamically Using Targeted Event-Based Updates [Per Partition Automatic Failover] Enable PPAF Dynamically Using Targeted Event-Based Updates with Thread-Safe Operations Jul 31, 2025
Copilot AI requested a review from kundadebdatta July 31, 2025 01:39
@kundadebdatta
Copy link
Copy Markdown
Member

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@kundadebdatta
Copy link
Copy Markdown
Member

This is no longer needed. The issue was closed with PR-5310.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Per Partition Automatic Failover] - Enable PPAF Dynamically upon change on Account Properties Metadata Response

2 participants