Diagnostics: Adds DiagnosticsVerbosity Summary mode for compacted diagnostics output#5695
Diagnostics: Adds DiagnosticsVerbosity Summary mode for compacted diagnostics output#5695NaluTripician wants to merge 23 commits intomainfrom
Conversation
…gnostics output Implements v1 of the diagnostics compaction feature per the spec at users/nalutripician/diagnostics-compaction-spec. New public API: - DiagnosticsVerbosity enum (Detailed=0, Summary=1) - CosmosClientOptions.DiagnosticsVerbosity property (default: Detailed) - CosmosClientOptions.MaxDiagnosticsSummarySizeBytes property (default: 8KB, min: 4KB) - CosmosDiagnostics.ToString(DiagnosticsVerbosity) abstract overload Summary mode groups requests by region, keeps first/last in full detail, and aggregates middle entries by (StatusCode, SubStatusCode) with count, total RU, min/max/P50/avg latency statistics. Size enforcement truncates output if it exceeds MaxDiagnosticsSummarySizeBytes. The parameterless ToString() always returns Detailed output for backward compatibility. In-memory ITrace tree is unchanged -- compaction only happens at serialization time. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Wire MaxDiagnosticsSummarySizeBytes from CosmosClientOptions to CosmosTraceDiagnostics callers that have access to ClientContext (ContainerCore, ReadManyQueryHelper, CosmosLinqQuery, ChangeFeedEstimatorIterator) - Fix DiagnosticsVerbosity.Summary XML doc to include avg in aggregate statistics list - Clarify CosmosClientOptions.DiagnosticsVerbosity XML doc to indicate it is a preference property, not auto-applied - Add edge case tests (null trace, invalid enum value) - Add DiagnosticsVerbosity and MaxDiagnosticsSummarySizeBytes tests to CosmosClientOptionsUnitTests - Add changelog entry for new public API surface Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… tests - Add AZURE_COSMOS_DIAGNOSTICS_VERBOSITY and AZURE_COSMOS_DIAGNOSTICS_MAX_SUMMARY_SIZE environment variable support via ConfigurationManager - Read env vars as fallback defaults in CosmosClientOptions constructor - Add WithDiagnosticsVerbosity and WithMaxDiagnosticsSummarySizeBytes fluent builder methods to CosmosClientBuilder - Update DotNetSDKAPI.net6.json contract with new builder methods - Add 9 env var and 3 builder tests to CosmosClientOptionsUnitTests - Add 9 baseline schema validation tests (DiagnosticsSummaryBaselineTests) - Add 8 emulator integration tests (DiagnosticsVerbosityEmulatorTests) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…r changes Update reflection-based constructor lookups in OpenTelemetryRecorderTests to match new signatures that include maxDiagnosticsSummarySizeBytes parameter. Add ClientOptions mock setup in ChangeFeedEstimatorIteratorTests for strict mocks that now access ClientContext.ClientOptions.MaxDiagnosticsSummarySizeBytes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…sively before trace traversal WriteSummary now calls SetWalkingStateRecursively() on the concrete Trace before accessing Data/Children properties, which have Debug.Assert guards requiring isBeingWalked to be true. Previously, only the CosmosTraceDiagnostics caller set this state, but direct callers (including unit tests) would crash the test host in Debug builds. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Brings in the OpenSpec change spec (design, proposal, tasks) for the diagnostics compaction feature alongside the implementation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PR Review SummaryPR: Diagnostics: Adds DiagnosticsVerbosity Summary mode for compacted diagnostics output Overall Assessment: Well-designed v1 implementation of a genuinely useful feature. The approach (compaction at serialization time, in-memory trace unchanged) is architecturally sound. Test coverage is comprehensive (56 tests). Default-to-Detailed ensures zero impact on existing users. Existing PR comments: 0 found Findings: 12 total
Key risks:
See inline comments for details. |
- Remove redundant SetWalkingStateRecursively() call from Lazy lambda (blocking) - Use switch expression for DiagnosticsVerbosity in CosmosTraceDiagnostics - Add cycle guard (HashSet<ITrace> visited) to CollectRequestEntriesRecursive - Add ActivityId to summary JSON output from PointOperationStatisticsTraceDatum - Add TotalRequestCharge to truncated summary output (BuildTruncatedJson) - Document v1 tradeoff of full summary computation before size check - Add caching for Summary path in EncryptionCosmosDiagnostics.ToString(verbosity) - Replace env var string literals with ConfigurationManager constants in tests - Add HttpResponseStatistics (Gateway mode) tests: single request, sub-status code extraction, mixed Direct+Gateway - Update truncated baseline test to expect TotalRequestCharge field - Update design.md spec to reflect actual implementation files Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
NaluTripician
left a comment
There was a problem hiding this comment.
PR Review: Diagnostics Summary Mode
Overall: Well-designed feature with solid architecture. The standalone DiagnosticsSummaryWriter cleanly separates compaction from the trace tree, backward compatibility is properly preserved, and test coverage is substantial (56 tests). Several issues from a prior review round have been addressed.
Remaining Findings Summary
| # | Severity | Finding |
|---|---|---|
| 1 | 🔴 High | Thread safety in EncryptionCosmosDiagnostics caching (see inline) |
| 2 | 🟡 Medium | Missing MaxDiagnosticsSummarySizeBytes wiring in core CRUD paths (see inline) |
| 3 | 🟡 Medium | No upper bound on MaxDiagnosticsSummarySizeBytes (see inline) |
| 4 | 🟡 Medium | No version field in Summary JSON format (see inline) |
| 5 | 🟡 Medium | Test coverage gaps (see below) |
| 6 | 🟢 Low | Environment variable validation errors are silent (see inline) |
Test Coverage Gaps (Finding #5)
The following scenarios lack test coverage:
-
Concurrent access — All tests are single-threaded. The
Lazy<string>caching is never stress-tested with parallel callers. AParallel.Fortest verifyingReferenceEqualsacross threads would validate the thread-safety guarantee. -
Size boundary conditions — The truncation test uses 512 bytes with 200 entries (massive overshoot). No test validates the exact boundary: output at exactly
maxSizeBytes(should pass) vs.maxSizeBytes + 1(should truncate). -
ActivityIdextraction paths —FindActivityIdwas added per prior feedback, but no unit test creates a trace withPointOperationStatisticsTraceDatumto verify ActivityId appears in output or is correctly omitted when absent. -
Multi-byte UTF-8 region names — Size enforcement uses
Encoding.UTF8.GetByteCount(correct), but no test validates that non-ASCII region names are counted by byte size, not char count.
Items Already Addressed from Prior Review ✅
Redundant SetWalkingStateRecursively in Lazy lambda, TotalRequestCharge in truncated output, cycle guard (HashSet<ITrace>), ActivityId in summary, switch expression for future enum values, HttpResponseStatistics test coverage, and spec alignment.
…/diagnostics-compaction # Conflicts: # Microsoft.Azure.Cosmos/src/Util/ConfigurationManager.cs
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
- Fix thread safety in EncryptionCosmosDiagnostics by replacing manual check-and-set cache with Lazy<string> - Add upper bound (10 MB) on MaxDiagnosticsSummarySizeBytes with ArgumentOutOfRangeException validation - Add SummaryFormatVersion field (value: 1) to Summary JSON output in both full and truncated formats - Add DefaultTrace.TraceWarning for invalid env var values - Document v1 limitation that standard CRUD paths use default MaxDiagnosticsSummarySizeBytes - Update baseline tests for new SummaryFormatVersion field - Add tests for upper bound validation and env var above-max Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
Move public override methods (GetStartTimeUtc, GetFailedRequestCount) before private BuildSummaryDiagnostics to satisfy StyleCop SA1202 rule requiring public members before private members. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
1 similar comment
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
Summary
Implements the core diagnostics compaction feature described in the diagnostics compaction spec. When opted into,
Summarymode reduces unboundedCosmosDiagnostics.ToString()output (100+ KB in high-retry scenarios) down to ~2-4 KB by grouping requests by region and aggregating retry statistics.Problem
CosmosDiagnostics.ToString()grows unboundedly with retries. Each retry adds a fullStoreResponseStatisticsentry to the trace tree. In pathological scenarios (sustained 429 throttling, transient failures, cross-region failovers), a single operation's diagnostics can grow to hundreds of KB, causing:New Public API
DiagnosticsVerbosityenumDetailed(default, current behavior) andSummary(compacted)CosmosClientOptions.DiagnosticsVerbosityCosmosClientOptions.MaxDiagnosticsSummarySizeBytesCosmosDiagnostics.ToString(DiagnosticsVerbosity)CosmosClientBuilder.WithDiagnosticsVerbosity()CosmosClientBuilder.WithMaxDiagnosticsSummarySizeBytes()AZURE_COSMOS_DIAGNOSTICS_VERBOSITYSummaryorDetailed)AZURE_COSMOS_DIAGNOSTICS_MAX_SUMMARY_SIZEHow Summary Mode Works
ITracetree to collect allStoreResponseStatisticsandHttpResponseStatistics(StatusCode, SubStatusCode)with: count, total RU, min/max/P50/avg latencyMaxDiagnosticsSummarySizeBytesKey Design Decisions
Detailed— zero behavioral change for existing usersToString()always returnsDetailed— backward compatibility guaranteedITracetree unchanged — compaction only happens at serialization timeLazy<string>caching — summary is computed once and reusedMaxDiagnosticsSummarySizeByteswired through pipeline — propagated fromCosmosClientOptionstoCosmosTraceDiagnosticsat call sites withClientContextaccessExample Summary Output
{ "Summary": { "DiagnosticsVerbosity": "Summary", "TotalDurationMs": 1234.5, "TotalRequestCharge": 245.5, "TotalRequestCount": 60, "RegionsSummary": [ { "Region": "West US 2", "RequestCount": 50, "First": { "StatusCode": 429, "SubStatusCode": 3200, "DurationMs": 5, ... }, "Last": { "StatusCode": 200, "SubStatusCode": 0, "DurationMs": 12, ... }, "AggregatedGroups": [ { "StatusCode": 429, "SubStatusCode": 3200, "Count": 48, "P50DurationMs": 12, ... } ] } ] } }Files Changed
DiagnosticsVerbosity.csDiagnosticsSummaryWriter.csDiagnosticsSummaryWriterTests.csDiagnosticsSummaryBaselineTests.csDiagnosticsVerbosityEmulatorTests.csCosmosClientOptions.csDiagnosticsVerbosity,MaxDiagnosticsSummarySizeBytes, env var supportCosmosClientBuilder.csWithDiagnosticsVerbosity,WithMaxDiagnosticsSummarySizeBytesConfigurationManager.csCosmosDiagnostics.csToString(DiagnosticsVerbosity)overloadCosmosTraceDiagnostics.csLazy<string>cachingEncryptionCosmosDiagnostics.csSDKPROJECTREF-gated)ContainerCore.csMaxDiagnosticsSummarySizeBytesfrom optionsReadManyQueryHelper.csMaxDiagnosticsSummarySizeBytesfrom optionsCosmosLinqQuery.csMaxDiagnosticsSummarySizeBytesfrom optionsChangeFeedEstimatorIterator.csMaxDiagnosticsSummarySizeBytesfrom optionsDotNetSDKAPI.net6.jsonCosmosClientOptionsUnitTests.csTest Coverage (56 tests)
Unit Tests (DiagnosticsSummaryWriterTests — 27 tests)
ToString()unchangedLazy<string>returns same instanceSchema Baseline Tests (DiagnosticsSummaryBaselineTests — 9 tests)
Options + Builder Tests (CosmosClientOptionsUnitTests — 12 new tests)
Emulator Integration Tests (DiagnosticsVerbosityEmulatorTests — 8 tests)
Related