[Per Partition Automatic Failover] Enable PPAF Dynamically Using Targeted Event-Based Updates with Thread-Safe Operations#5326
[Per Partition Automatic Failover] Enable PPAF Dynamically Using Targeted Event-Based Updates with Thread-Safe Operations#5326
Conversation
Co-authored-by: kundadebdatta <87335885+kundadebdatta@users.noreply.github.com>
| /// <summary> | ||
| /// Initializes and starts the background account properties refresh task | ||
| /// </summary> | ||
| private void InitializeBackgroundRefresh() |
There was a problem hiding this comment.
There is already a background task in GlobalEndpointManager.InitializeAccountPropertiesAndStartBackgroundRefresh.
Let's try to leverage that instead of creating a new one.
There was a problem hiding this comment.
You're absolutely right! I've refactored the implementation to leverage the existing background task in GlobalEndpointManager.StartLocationBackgroundRefreshLoop() instead of creating a duplicate one.
Changes made:
- Removed separate background refresh task from
CosmosAccountServiceConfiguration - Enhanced existing
GlobalEndpointManager.RefreshDatabaseAccountInternalAsync()to raiseOnAccountPropertiesRefreshedevents - Updated
CosmosAccountServiceConfigurationto receive updates viaUpdateAccountProperties()method - Modified
DocumentClientto subscribe to GlobalEndpointManager events for PPAF changes
This approach is much cleaner as it:
- ✅ Avoids duplicate background tasks
- ✅ Reduces resource consumption
- ✅ Leverages the existing, well-tested refresh mechanism
- ✅ Keeps responsibilities clear
The functionality remains the same - dynamic PPAF updates when account properties change - but now uses a single, centralized background refresh task. (commit: 6941697)
…ground task Co-authored-by: kundadebdatta <87335885+kundadebdatta@users.noreply.github.com>
There was a problem hiding this comment.
Please follow the required format: "[Internal] Category: (Adds|Fixes|Refactors|Removes) Description"
Internal should be used for PRs that have no customer impact. This flag is used to help generate the changelog to know which PRs should be included. Examples:
Diagnostics: Adds GetElapsedClientLatency to CosmosDiagnostics
PartitionKey: Fixes null reference when using default(PartitionKey)
[v4] Client Encryption: Refactors code to external project
[Internal] Query: Adds code generator for CosmosNumbers for easy additions in the future.
| /// <summary> | ||
| /// Event that is raised when account properties are refreshed and PPAF enablement status changes | ||
| /// </summary> | ||
| internal event Action<bool?> OnEnablePartitionLevelFailoverChanged; |
There was a problem hiding this comment.
Do we really need this event anymore ? Can we not leverage the event OnAccountPropertiesRefreshed from GlobalEndpointManager directly ?
There was a problem hiding this comment.
You're absolutely right! I've removed the redundant OnEnablePartitionLevelFailoverChanged event and simplified the architecture to leverage the existing OnAccountPropertiesRefreshed event from GlobalEndpointManager directly.
Changes made:
- Removed
OnEnablePartitionLevelFailoverChangedevent fromCosmosAccountServiceConfiguration - Enhanced
DocumentClient.HandleAccountPropertiesRefreshed()to directly compare previous and new PPAF values and handle changes - Simplified
CosmosAccountServiceConfiguration.UpdateAccountProperties()to only update properties without event firing - Updated unit tests to reflect the new direct architecture
This eliminates the intermediate event layer and provides a cleaner, more direct flow:
GlobalEndpointManager.OnAccountPropertiesRefreshed → DocumentClient.HandleAccountPropertiesRefreshed() → direct PPAF change handling
The functionality remains the same but with reduced complexity and fewer event subscriptions. (commit: 395c4bd)
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…rage GlobalEndpointManager event directly Co-authored-by: kundadebdatta <87335885+kundadebdatta@users.noreply.github.com>
| /// </summary> | ||
| /// <param name="defaultValue">An integer containing the default value for the refresh interval in seconds.</param> | ||
| /// <returns>An integer representing the refresh interval in seconds.</returns> | ||
| public static int GetAccountPropertiesRefreshIntervalInSeconds( |
There was a problem hiding this comment.
There is no usage as we got rid of the redundant code for account refresh. Please remove this.
There was a problem hiding this comment.
You're correct! I've removed the unused GetAccountPropertiesRefreshIntervalInSeconds method and the associated AccountPropertiesRefreshIntervalInSeconds constant since they're no longer needed after refactoring to use the existing GlobalEndpointManager background task. (commit: 47dcb90)
| GlobalEndpointManager.ParseThinClientLocationsFromAdditionalProperties(accountProperties); | ||
|
|
||
| GlobalEndpointManager.ParseThinClientLocationsFromAdditionalProperties(accountProperties); |
There was a problem hiding this comment.
Is this really a change ? Please fix the code diff.
There was a problem hiding this comment.
Fixed the code diff by reverting the file to original state and reapplying only the necessary changes. Removed the unnecessary whitespace-only change that was appearing in the diff. (commit: 526d0d8)
| /// A read-only string containing the environment variable name for capturing the account properties refresh interval | ||
| /// in seconds. The default value for this interval is 300 seconds (5 minutes). | ||
| /// </summary> | ||
| internal static readonly string AccountPropertiesRefreshIntervalInSeconds = "AZURE_COSMOS_ACCOUNT_PROPERTIES_REFRESH_INTERVAL_IN_SECONDS"; |
There was a problem hiding this comment.
Removed the empty line that was left after removing the AccountPropertiesRefreshIntervalInSeconds constant. (commit: 526d0d8)
… constant Co-authored-by: kundadebdatta <87335885+kundadebdatta@users.noreply.github.com>
| /// Updates the account properties | ||
| /// This method is called by GlobalEndpointManager when account properties are refreshed | ||
| /// </summary> | ||
| internal void UpdateAccountProperties(AccountProperties newProperties) |
There was a problem hiding this comment.
This is not needed. Let's get this removed.
There was a problem hiding this comment.
Removed the UpdateAccountProperties method and updated the event handler to focus only on PPAF handling without updating accountServiceConfiguration. Also simplified the tests accordingly. (commit: e610a29)
…e only on PPAF changes Co-authored-by: kundadebdatta <87335885+kundadebdatta@users.noreply.github.com>
…erviceConfiguration Co-authored-by: kundadebdatta <87335885+kundadebdatta@users.noreply.github.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
|
||
| // Update the GlobalPartitionEndpointManager | ||
| this.UpdateGlobalPartitionEndpointManager(); | ||
|
|
There was a problem hiding this comment.
Please update the user agent features here as well using this.ConnectionPolicy.UserAgentContainer.AppendFeatures
There was a problem hiding this comment.
Added the user agent features update using this.ConnectionPolicy.UserAgentContainer.AppendFeatures(this.GetUserAgentFeatures()) when PPAF configuration changes dynamically. This ensures the user agent reflects the current PPAF state after dynamic updates. (commit: 49b3135)
Co-authored-by: kundadebdatta <87335885+kundadebdatta@users.noreply.github.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
| } | ||
|
|
||
| // Update the partition key range location | ||
| this.PartitionKeyRangeLocation = newPartitionKeyRangeLocation; |
There was a problem hiding this comment.
Could it cause any thread contention ? Can we use Interlocked.CompareExchange to alter the object ?
There was a problem hiding this comment.
Fixed thread contention issue by using Interlocked.Exchange for atomic updates to PartitionKeyRangeLocation. Changed from auto-property to field with property accessor and used Interlocked.Exchange(ref this.partitionKeyRangeLocation, newPartitionKeyRangeLocation) to atomically update the reference while avoiding race conditions during concurrent access. (commit: ec9fd26)
…dates Co-authored-by: kundadebdatta <87335885+kundadebdatta@users.noreply.github.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
This is no longer needed. The issue was closed with PR-5310. |
This PR implements dynamic Per Partition Automatic Failover (PPAF) refresh capability that enables the Azure Cosmos DB .NET SDK to update PPAF settings at runtime without requiring SDK restart when account properties change.
Problem
Currently, the SDK fetches the
EnablePartitionLevelFailoverflag only during initialization via theGET ACCOUNTmetadata call. If the account updates this flag after SDK initialization, customers need to restart the SDK to take the change into effect, creating operational challenges.Solution
Leveraging Existing Background Refresh Infrastructure
Instead of creating a separate background task, this implementation enhances the existing
GlobalEndpointManager.StartLocationBackgroundRefreshLoop()which already refreshes account properties every 5 minutes. This approach:Targeted Event-Based Architecture
GlobalEndpointManager Enhancement:
OnEnablePartitionLevelFailoverConfigChangedevent that fires only when PPAF status changesRefreshDatabaseAccountInternalAsync()to track previous PPAF values and emit targeted eventsDocumentClient Direct Integration:
OnEnablePartitionLevelFailoverConfigChangedeventsHandleEnablePartitionLevelFailoverConfigChanged()method processes only actual PPAF changesDynamic PPAF Configuration Updates
When
EnablePartitionLevelFailoverchanges in account properties, the SDK automatically:EnablePartitionLevelFailoverandEnablePartitionLevelCircuitBreakerflagsGlobalPartitionEndpointManagerCoreinstance with updated settings using thread-safe atomic operationsThread Safety
Atomic Operations: Used
Interlocked.Exchangefor atomic updates toPartitionKeyRangeLocationto prevent thread contention during dynamic updates. The implementation:Interlocked.Exchange(ref this.partitionKeyRangeLocation, newValue)to atomically swap referencesUsage
Implementation Details
Interlocked.Exchangefor atomic reference updates, preventing thread contention during dynamic updatesTesting
Updated unit tests to reflect the targeted event architecture:
The implementation satisfies all acceptance criteria:
Fixes #5304.
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.