Skip to content

CosmosDiagnosticsL Fixes InvalidOperationException in CosmosDiagnostics with permanent walking state management#5331

Closed
Copilot wants to merge 12 commits intomasterfrom
copilot/fix-5112
Closed

CosmosDiagnosticsL Fixes InvalidOperationException in CosmosDiagnostics with permanent walking state management#5331
Copilot wants to merge 12 commits intomasterfrom
copilot/fix-5112

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Aug 1, 2025

This PR fixes a race condition that causes InvalidOperationException: Collection was modified; enumeration operation may not execute. when calling CosmosDiagnostics.ToString() under concurrent access scenarios.

Problem

The issue occurs when multiple threads are simultaneously:

  1. Adding children or data to trace objects (via background operations like CrossRegionalHedging)
  2. Calling CosmosDiagnostics.ToString() which enumerates trace collections

This led to "Collection was modified" exceptions during enumeration, particularly when operations are cancelled and diagnostics are being read from CosmosOperationCanceledException.

Solution

The fix implements a permanent walking state management approach for thread-safe enumeration:

  1. Centralized control: Walking state is managed from public API entry points (ToString(), GetContactedRegions(), GetQueryMetrics(), and IsGoneExceptionHit())

  2. Permanent state: Once a trace tree starts being walked, the walking state is set permanently - no reset to false, simplifying the logic

  3. Enhanced modification prevention: All modification methods (AddChild(), AddDatum(), AddOrUpdateDatum(), and both StartChild() overloads) check the walking state inside locks and ignore changes when enumeration is in progress

  4. NoOpTrace returns: When trace is being walked, StartChild() methods return NoOpTrace.Singleton instead of creating new traces

  5. Atomic operations: All modification checks are performed inside locks to prevent race conditions between checking walking state and performing modifications

  6. Direct enumeration: Uses direct enumeration protected by the permanent walking state instead of snapshot-based approaches

Key Benefits

  • Eliminates race conditions: Prevents concurrent modifications during enumeration operations
  • Simple architecture: No complex try/finally blocks or state reset logic
  • Comprehensive coverage: All modification and enumeration points protected
  • Performance optimized: Direct enumeration instead of creating snapshots
  • No breaking changes: Public APIs remain unchanged
  • Clean separation: Walking state management separated from enumeration logic

Example

The following scenario would previously throw InvalidOperationException but now works correctly:

// Multiple threads adding data to trace
trace.AddDatum("key", "value");
trace.StartChild("operation");

// Concurrent thread calling ToString() - now safe
string diagnostics = cosmosDiagnostics.ToString(); // No longer throws

The permanent walking state approach ensures that once enumeration begins on a trace tree, all further modifications are safely ignored, providing comprehensive protection against concurrent modifications.

Fixes #5112.


💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.

Copilot AI changed the title [WIP] InvalidOperationException when calling CosmosDiagnostics.ToString() Fix InvalidOperationException in CosmosDiagnostics.ToString() due to concurrent access Aug 1, 2025
Copilot AI requested a review from kirankumarkolli August 1, 2025 19:50
@FabianMeiswinkel
Copy link
Copy Markdown
Member

This PR has better approach #5209

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow the required format: "[Internal] Category: (Adds|Fixes|Refactors|Removes) Description"

Internal should be used for PRs that have no customer impact. This flag is used to help generate the changelog to know which PRs should be included. Examples:
Diagnostics: Adds GetElapsedClientLatency to CosmosDiagnostics
PartitionKey: Fixes null reference when using default(PartitionKey)
[v4] Client Encryption: Refactors code to external project
[Internal] Query: Adds code generator for CosmosNumbers for easy additions in the future.

Comment thread Microsoft.Azure.Cosmos/src/Diagnostics/CosmosTraceDiagnostics.cs
Copy link
Copy Markdown
Member

@kirankumarkolli kirankumarkolli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update logic to not allow any changes to the Trace once WalkTrace started.

Copilot AI changed the title Fix InvalidOperationException in CosmosDiagnostics.ToString() due to concurrent access Fix InvalidOperationException in CosmosDiagnostics.ToString() by preventing modifications during trace walks Aug 14, 2025
Copilot AI requested a review from kirankumarkolli August 14, 2025 21:40
Comment thread Microsoft.Azure.Cosmos/src/Tracing/Trace.cs Outdated
Copilot AI requested a review from kirankumarkolli August 14, 2025 23:31
Comment thread Microsoft.Azure.Cosmos/src/Tracing/Trace.cs Outdated
Comment thread Microsoft.Azure.Cosmos/src/Tracing/Trace.cs Outdated
Comment thread Microsoft.Azure.Cosmos/src/Diagnostics/CosmosTraceDiagnostics.cs Outdated
Comment thread Microsoft.Azure.Cosmos/src/Tracing/Trace.cs Outdated
Comment thread Microsoft.Azure.Cosmos/src/Tracing/Trace.cs Outdated
Comment thread Microsoft.Azure.Cosmos/src/Diagnostics/CosmosTraceDiagnostics.cs
Comment thread Microsoft.Azure.Cosmos/src/Diagnostics/CosmosTraceDiagnostics.cs Outdated
Comment thread Microsoft.Azure.Cosmos/src/Diagnostics/CosmosTraceDiagnostics.cs
@kirankumarkolli
Copy link
Copy Markdown
Member

@copilot look into the comments

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Aug 15, 2025

@copilot look into the comments

Addressed all reviewer feedback:

  1. ✅ Moved Summary.UpdateRegionContacted() inside lock
  2. ✅ Moved isBeingWalked assignment inside lock
  3. ✅ Removed try/finally and false reset from ToString()
  4. ✅ Updated SetWalkingStateRecursively() to take no arguments
  5. ✅ Added early return if already set
  6. ✅ Added SetWalkingStateRecursively() to GetContactedRegions, GetQueryMetrics, IsGoneExceptionHit
  7. ✅ Reverted WalkTraceTreeForGoneException to original implementation
  8. ✅ Reverted PopulateServerSideCumulativeMetrics to original implementation

The implementation now uses centralized walking state management with permanent state setting and direct enumeration protected by locks. Commit: 824d33d

Copilot AI changed the title Fix InvalidOperationException in CosmosDiagnostics.ToString() with centralized walking state management Fix InvalidOperationException in CosmosDiagnostics with permanent walking state management Aug 15, 2025
Copilot AI requested a review from kirankumarkolli August 15, 2025 00:31
Comment thread Microsoft.Azure.Cosmos/src/Tracing/Trace.cs Outdated
Comment thread Microsoft.Azure.Cosmos/src/Tracing/Trace.cs Outdated
… adjust assignment order

Co-authored-by: kirankumarkolli <[email protected]>
@kirankumarkolli kirankumarkolli changed the title Fix InvalidOperationException in CosmosDiagnostics with permanent walking state management CosmosDiagnosticsL Fixes InvalidOperationException in CosmosDiagnostics with permanent walking state management Aug 15, 2025
@FabianMeiswinkel
Copy link
Copy Markdown
Member

FabianMeiswinkel commented Aug 15, 2025

Decided to go with ConcurrentDictionary for now. - See #5209

@FabianMeiswinkel
Copy link
Copy Markdown
Member

Closing this - opened #5361 instead

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

InvalidOperationException when calling CosmosDiagnostics.ToString()

3 participants