CosmosDiagnosticsL Fixes InvalidOperationException in CosmosDiagnostics with permanent walking state management#5331
CosmosDiagnosticsL Fixes InvalidOperationException in CosmosDiagnostics with permanent walking state management#5331
Conversation
…concurrent access Co-authored-by: kirankumarkolli <[email protected]>
…tion points Co-authored-by: kirankumarkolli <[email protected]>
|
This PR has better approach #5209 |
There was a problem hiding this comment.
Please follow the required format: "[Internal] Category: (Adds|Fixes|Refactors|Removes) Description"
Internal should be used for PRs that have no customer impact. This flag is used to help generate the changelog to know which PRs should be included. Examples:
Diagnostics: Adds GetElapsedClientLatency to CosmosDiagnostics
PartitionKey: Fixes null reference when using default(PartitionKey)
[v4] Client Encryption: Refactors code to external project
[Internal] Query: Adds code generator for CosmosNumbers for easy additions in the future.
kirankumarkolli
left a comment
There was a problem hiding this comment.
Update logic to not allow any changes to the Trace once WalkTrace started.
…tions Co-authored-by: kirankumarkolli <[email protected]>
|
@copilot look into the comments |
…move parameter Co-authored-by: kirankumarkolli <[email protected]>
Addressed all reviewer feedback:
The implementation now uses centralized walking state management with permanent state setting and direct enumeration protected by locks. Commit: 824d33d |
… adjust assignment order Co-authored-by: kirankumarkolli <[email protected]>
|
Decided to go with ConcurrentDictionary for now. - See #5209 |
|
Closing this - opened #5361 instead |
This PR fixes a race condition that causes
InvalidOperationException: Collection was modified; enumeration operation may not execute.when callingCosmosDiagnostics.ToString()under concurrent access scenarios.Problem
The issue occurs when multiple threads are simultaneously:
CosmosDiagnostics.ToString()which enumerates trace collectionsThis led to "Collection was modified" exceptions during enumeration, particularly when operations are cancelled and diagnostics are being read from
CosmosOperationCanceledException.Solution
The fix implements a permanent walking state management approach for thread-safe enumeration:
Centralized control: Walking state is managed from public API entry points (
ToString(),GetContactedRegions(),GetQueryMetrics(), andIsGoneExceptionHit())Permanent state: Once a trace tree starts being walked, the walking state is set permanently - no reset to false, simplifying the logic
Enhanced modification prevention: All modification methods (
AddChild(),AddDatum(),AddOrUpdateDatum(), and bothStartChild()overloads) check the walking state inside locks and ignore changes when enumeration is in progressNoOpTrace returns: When trace is being walked,
StartChild()methods returnNoOpTrace.Singletoninstead of creating new tracesAtomic operations: All modification checks are performed inside locks to prevent race conditions between checking walking state and performing modifications
Direct enumeration: Uses direct enumeration protected by the permanent walking state instead of snapshot-based approaches
Key Benefits
Example
The following scenario would previously throw
InvalidOperationExceptionbut now works correctly:The permanent walking state approach ensures that once enumeration begins on a trace tree, all further modifications are safely ignored, providing comprehensive protection against concurrent modifications.
Fixes #5112.
💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.