Skip to content

Diagnostics: Adds DiagnosticsVerbosity Summary mode for compacted diagnostics output#5695

Open
NaluTripician wants to merge 23 commits intomainfrom
users/nalutripician/diagnostics-compaction
Open

Diagnostics: Adds DiagnosticsVerbosity Summary mode for compacted diagnostics output#5695
NaluTripician wants to merge 23 commits intomainfrom
users/nalutripician/diagnostics-compaction

Conversation

@NaluTripician
Copy link
Copy Markdown
Contributor

@NaluTripician NaluTripician commented Mar 17, 2026

Summary

⚠️ This is a v1 implementation and is subject to change based on feedback from the spec review (#5644).

Implements the core diagnostics compaction feature described in the diagnostics compaction spec. When opted into, Summary mode reduces unbounded CosmosDiagnostics.ToString() output (100+ KB in high-retry scenarios) down to ~2-4 KB by grouping requests by region and aggregating retry statistics.

Problem

CosmosDiagnostics.ToString() grows unboundedly with retries. Each retry adds a full StoreResponseStatistics entry to the trace tree. In pathological scenarios (sustained 429 throttling, transient failures, cross-region failovers), a single operation's diagnostics can grow to hundreds of KB, causing:

  • Log truncation — monitoring systems silently drop oversized entries
  • Memory pressure — large diagnostic strings increase GC overhead
  • Readability — operators cannot extract signal from noise

New Public API

API Description
DiagnosticsVerbosity enum Detailed (default, current behavior) and Summary (compacted)
CosmosClientOptions.DiagnosticsVerbosity Client-level preferred verbosity
CosmosClientOptions.MaxDiagnosticsSummarySizeBytes Max bytes for summary output (default: 8KB, min: 4KB)
CosmosDiagnostics.ToString(DiagnosticsVerbosity) New overload for explicit verbosity control
CosmosClientBuilder.WithDiagnosticsVerbosity() Fluent builder method
CosmosClientBuilder.WithMaxDiagnosticsSummarySizeBytes() Fluent builder method
AZURE_COSMOS_DIAGNOSTICS_VERBOSITY Environment variable override (accepts Summary or Detailed)
AZURE_COSMOS_DIAGNOSTICS_MAX_SUMMARY_SIZE Environment variable override (integer, minimum 4096)

How Summary Mode Works

  1. Walks the ITrace tree to collect all StoreResponseStatistics and HttpResponseStatistics
  2. Groups entries by region (preserving chronological order)
  3. Per region: keeps First and Last request in full detail
  4. Aggregates middle entries by (StatusCode, SubStatusCode) with: count, total RU, min/max/P50/avg latency
  5. Enforces size limit: falls back to truncated JSON if output exceeds MaxDiagnosticsSummarySizeBytes

Key Design Decisions

  • Default is Detailed — zero behavioral change for existing users
  • Parameterless ToString() always returns Detailed — backward compatibility guaranteed
  • In-memory ITrace tree unchanged — compaction only happens at serialization time
  • GA (non-preview) — additive, backward-compatible feature
  • Lazy<string> caching — summary is computed once and reused
  • MaxDiagnosticsSummarySizeBytes wired through pipeline — propagated from CosmosClientOptions to CosmosTraceDiagnostics at call sites with ClientContext access

Example Summary Output

{
  "Summary": {
    "DiagnosticsVerbosity": "Summary",
    "TotalDurationMs": 1234.5,
    "TotalRequestCharge": 245.5,
    "TotalRequestCount": 60,
    "RegionsSummary": [
      {
        "Region": "West US 2",
        "RequestCount": 50,
        "First": { "StatusCode": 429, "SubStatusCode": 3200, "DurationMs": 5, ... },
        "Last": { "StatusCode": 200, "SubStatusCode": 0, "DurationMs": 12, ... },
        "AggregatedGroups": [
          { "StatusCode": 429, "SubStatusCode": 3200, "Count": 48, "P50DurationMs": 12, ... }
        ]
      }
    ]
  }
}

Files Changed

File Change
DiagnosticsVerbosity.cs New — public enum
DiagnosticsSummaryWriter.cs New — core compaction engine
DiagnosticsSummaryWriterTests.cs New — unit tests
DiagnosticsSummaryBaselineTests.cs New — schema baseline tests
DiagnosticsVerbosityEmulatorTests.cs New — emulator integration tests
CosmosClientOptions.cs Added DiagnosticsVerbosity, MaxDiagnosticsSummarySizeBytes, env var support
CosmosClientBuilder.cs Added WithDiagnosticsVerbosity, WithMaxDiagnosticsSummarySizeBytes
ConfigurationManager.cs Added env var constants
CosmosDiagnostics.cs Added abstract ToString(DiagnosticsVerbosity) overload
CosmosTraceDiagnostics.cs Implemented overload with Lazy<string> caching
EncryptionCosmosDiagnostics.cs Implemented overload (SDKPROJECTREF-gated)
ContainerCore.cs Wired MaxDiagnosticsSummarySizeBytes from options
ReadManyQueryHelper.cs Wired MaxDiagnosticsSummarySizeBytes from options
CosmosLinqQuery.cs Wired MaxDiagnosticsSummarySizeBytes from options
ChangeFeedEstimatorIterator.cs Wired MaxDiagnosticsSummarySizeBytes from options
DotNetSDKAPI.net6.json Updated contract baseline
CosmosClientOptionsUnitTests.cs Added tests for new properties, env vars, builder

Test Coverage (56 tests)

Unit Tests (DiagnosticsSummaryWriterTests — 27 tests)

  • Enum/options defaults and validation
  • Single region: 1 request, 2 requests, many retries (429)
  • Multi-region failover with separate region summaries
  • Mixed status codes producing multiple aggregated groups
  • P50 computation: odd count, even count, single item
  • Size enforcement: under limit, over limit (truncation)
  • Empty trace, null region, null trace, invalid enum fallback
  • Backward compatibility: parameterless ToString() unchanged
  • Caching: Lazy<string> returns same instance

Schema Baseline Tests (DiagnosticsSummaryBaselineTests — 9 tests)

  • Exact field set validation for each JSON level (summary, region, request entry, aggregated group, truncated)
  • Field type consistency (numeric, string, array, object)
  • DiagnosticsVerbosity field always "Summary"
  • Truncation message content

Options + Builder Tests (CosmosClientOptionsUnitTests — 12 new tests)

  • Default values, validation, large values
  • Env var parsing (Summary, case-insensitive, invalid ignored, below-min ignored)
  • Explicit property overrides env var
  • Builder methods with validation

Emulator Integration Tests (DiagnosticsVerbosityEmulatorTests — 8 tests)

  • CRUD operations (Create, Read, Replace, Delete) with Summary mode
  • Query with Summary mode
  • Summary output smaller than Detailed
  • Parameterless ToString unchanged by Summary option
  • Summary caching returns same instance

Related

…gnostics output

Implements v1 of the diagnostics compaction feature per the spec at
users/nalutripician/diagnostics-compaction-spec.

New public API:
- DiagnosticsVerbosity enum (Detailed=0, Summary=1)
- CosmosClientOptions.DiagnosticsVerbosity property (default: Detailed)
- CosmosClientOptions.MaxDiagnosticsSummarySizeBytes property (default: 8KB, min: 4KB)
- CosmosDiagnostics.ToString(DiagnosticsVerbosity) abstract overload

Summary mode groups requests by region, keeps first/last in full detail,
and aggregates middle entries by (StatusCode, SubStatusCode) with count,
total RU, min/max/P50/avg latency statistics. Size enforcement truncates
output if it exceeds MaxDiagnosticsSummarySizeBytes.

The parameterless ToString() always returns Detailed output for backward
compatibility. In-memory ITrace tree is unchanged -- compaction only
happens at serialization time.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
NaluTripician and others added 2 commits March 17, 2026 13:58
- Wire MaxDiagnosticsSummarySizeBytes from CosmosClientOptions to
  CosmosTraceDiagnostics callers that have access to ClientContext
  (ContainerCore, ReadManyQueryHelper, CosmosLinqQuery,
  ChangeFeedEstimatorIterator)
- Fix DiagnosticsVerbosity.Summary XML doc to include avg in
  aggregate statistics list
- Clarify CosmosClientOptions.DiagnosticsVerbosity XML doc to
  indicate it is a preference property, not auto-applied
- Add edge case tests (null trace, invalid enum value)
- Add DiagnosticsVerbosity and MaxDiagnosticsSummarySizeBytes tests
  to CosmosClientOptionsUnitTests
- Add changelog entry for new public API surface

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… tests

- Add AZURE_COSMOS_DIAGNOSTICS_VERBOSITY and
  AZURE_COSMOS_DIAGNOSTICS_MAX_SUMMARY_SIZE environment variable
  support via ConfigurationManager
- Read env vars as fallback defaults in CosmosClientOptions constructor
- Add WithDiagnosticsVerbosity and WithMaxDiagnosticsSummarySizeBytes
  fluent builder methods to CosmosClientBuilder
- Update DotNetSDKAPI.net6.json contract with new builder methods
- Add 9 env var and 3 builder tests to CosmosClientOptionsUnitTests
- Add 9 baseline schema validation tests
  (DiagnosticsSummaryBaselineTests)
- Add 8 emulator integration tests
  (DiagnosticsVerbosityEmulatorTests)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@NaluTripician NaluTripician self-assigned this Mar 17, 2026
NaluTripician and others added 9 commits March 18, 2026 10:25
…r changes

Update reflection-based constructor lookups in OpenTelemetryRecorderTests
to match new signatures that include maxDiagnosticsSummarySizeBytes parameter.

Add ClientOptions mock setup in ChangeFeedEstimatorIteratorTests for strict
mocks that now access ClientContext.ClientOptions.MaxDiagnosticsSummarySizeBytes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…sively before trace traversal

WriteSummary now calls SetWalkingStateRecursively() on the concrete Trace
before accessing Data/Children properties, which have Debug.Assert guards
requiring isBeingWalked to be true. Previously, only the CosmosTraceDiagnostics
caller set this state, but direct callers (including unit tests) would crash
the test host in Debug builds.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Brings in the OpenSpec change spec (design, proposal, tasks) for the
diagnostics compaction feature alongside the implementation.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@NaluTripician
Copy link
Copy Markdown
Contributor Author

PR Review Summary

PR: Diagnostics: Adds DiagnosticsVerbosity Summary mode for compacted diagnostics output

Overall Assessment: Well-designed v1 implementation of a genuinely useful feature. The approach (compaction at serialization time, in-memory trace unchanged) is architecturally sound. Test coverage is comprehensive (56 tests). Default-to-Detailed ensures zero impact on existing users.

Existing PR comments: 0 found

Findings: 12 total

Severity Count
Blocking 1
Recommendation 5
Suggestion 4
Observation 2

Key risks:

  • Redundant SetWalkingStateRecursively() double-call on every summary materialization
  • MaxDiagnosticsSummarySizeBytes not propagated through ResponseMessage.Diagnostics (main CRUD pipeline)
  • Truncated summary missing TotalRequestCharge (the most operationally useful field)
  • No test coverage for Gateway-mode (HttpResponseStatistics) code path

See inline comments for details.


⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

Comment thread Microsoft.Azure.Cosmos/src/Diagnostics/CosmosTraceDiagnostics.cs Outdated
Comment thread Microsoft.Azure.Cosmos/src/Diagnostics/CosmosTraceDiagnostics.cs
Comment thread Microsoft.Azure.Cosmos/src/Diagnostics/DiagnosticsSummaryWriter.cs Outdated
Comment thread Microsoft.Azure.Cosmos/src/Diagnostics/DiagnosticsSummaryWriter.cs
Comment thread Microsoft.Azure.Cosmos/src/Diagnostics/DiagnosticsSummaryWriter.cs Outdated
Comment thread Microsoft.Azure.Cosmos/src/Diagnostics/DiagnosticsSummaryWriter.cs Outdated
Comment thread Microsoft.Azure.Cosmos/src/Diagnostics/CosmosTraceDiagnostics.cs
Comment thread openspec/changes/diagnostics-compaction/design.md Outdated
NaluTripician and others added 2 commits April 1, 2026 09:46
- Remove redundant SetWalkingStateRecursively() call from Lazy lambda (blocking)
- Use switch expression for DiagnosticsVerbosity in CosmosTraceDiagnostics
- Add cycle guard (HashSet<ITrace> visited) to CollectRequestEntriesRecursive
- Add ActivityId to summary JSON output from PointOperationStatisticsTraceDatum
- Add TotalRequestCharge to truncated summary output (BuildTruncatedJson)
- Document v1 tradeoff of full summary computation before size check
- Add caching for Summary path in EncryptionCosmosDiagnostics.ToString(verbosity)
- Replace env var string literals with ConfigurationManager constants in tests
- Add HttpResponseStatistics (Gateway mode) tests: single request, sub-status
  code extraction, mixed Direct+Gateway
- Update truncated baseline test to expect TotalRequestCharge field
- Update design.md spec to reflect actual implementation files

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

@NaluTripician NaluTripician left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: Diagnostics Summary Mode

Overall: Well-designed feature with solid architecture. The standalone DiagnosticsSummaryWriter cleanly separates compaction from the trace tree, backward compatibility is properly preserved, and test coverage is substantial (56 tests). Several issues from a prior review round have been addressed.

Remaining Findings Summary

# Severity Finding
1 🔴 High Thread safety in EncryptionCosmosDiagnostics caching (see inline)
2 🟡 Medium Missing MaxDiagnosticsSummarySizeBytes wiring in core CRUD paths (see inline)
3 🟡 Medium No upper bound on MaxDiagnosticsSummarySizeBytes (see inline)
4 🟡 Medium No version field in Summary JSON format (see inline)
5 🟡 Medium Test coverage gaps (see below)
6 🟢 Low Environment variable validation errors are silent (see inline)

Test Coverage Gaps (Finding #5)

The following scenarios lack test coverage:

  1. Concurrent access — All tests are single-threaded. The Lazy<string> caching is never stress-tested with parallel callers. A Parallel.For test verifying ReferenceEquals across threads would validate the thread-safety guarantee.

  2. Size boundary conditions — The truncation test uses 512 bytes with 200 entries (massive overshoot). No test validates the exact boundary: output at exactly maxSizeBytes (should pass) vs. maxSizeBytes + 1 (should truncate).

  3. ActivityId extraction pathsFindActivityId was added per prior feedback, but no unit test creates a trace with PointOperationStatisticsTraceDatum to verify ActivityId appears in output or is correctly omitted when absent.

  4. Multi-byte UTF-8 region names — Size enforcement uses Encoding.UTF8.GetByteCount (correct), but no test validates that non-ASCII region names are counted by byte size, not char count.

Items Already Addressed from Prior Review ✅

Redundant SetWalkingStateRecursively in Lazy lambda, TotalRequestCharge in truncated output, cycle guard (HashSet<ITrace>), ActivityId in summary, switch expression for future enum values, HttpResponseStatistics test coverage, and spec alignment.

Comment thread Microsoft.Azure.Cosmos.Encryption/src/EncryptionCosmosDiagnostics.cs Outdated
Comment thread Microsoft.Azure.Cosmos/src/Diagnostics/CosmosTraceDiagnostics.cs
Comment thread Microsoft.Azure.Cosmos/src/CosmosClientOptions.cs Outdated
Comment thread Microsoft.Azure.Cosmos/src/Diagnostics/DiagnosticsSummaryWriter.cs
Comment thread Microsoft.Azure.Cosmos/src/CosmosClientOptions.cs
…/diagnostics-compaction

# Conflicts:
#	Microsoft.Azure.Cosmos/src/Util/ConfigurationManager.cs
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

- Fix thread safety in EncryptionCosmosDiagnostics by replacing
  manual check-and-set cache with Lazy<string>
- Add upper bound (10 MB) on MaxDiagnosticsSummarySizeBytes with
  ArgumentOutOfRangeException validation
- Add SummaryFormatVersion field (value: 1) to Summary JSON output
  in both full and truncated formats
- Add DefaultTrace.TraceWarning for invalid env var values
- Document v1 limitation that standard CRUD paths use default
  MaxDiagnosticsSummarySizeBytes
- Update baseline tests for new SummaryFormatVersion field
- Add tests for upper bound validation and env var above-max

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

NaluTripician and others added 2 commits April 14, 2026 10:55
Move public override methods (GetStartTimeUtc, GetFailedRequestCount) before
private BuildSummaryDiagnostics to satisfy StyleCop SA1202 rule requiring
public members before private members.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

1 similar comment
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant