[Internal] Diagnostics: Adds spec for CosmosDiagnostics compaction summary mode#5644
[Internal] Diagnostics: Adds spec for CosmosDiagnostics compaction summary mode#5644NaluTripician wants to merge 16 commits intomainfrom
Conversation
|
Addressed in latest push — good call that since compaction is serialization-only, it doesn't belong on
|
- Run openspec init to create openspec/ directory structure - Configure openspec/config.yaml with Cosmos SDK project context and artifact rules - Create openspec/README.md with comprehensive developer guide: - Workflow overview (propose → apply → archive) - Slash command reference - Writing good specs guide with examples - Best practices and anti-patterns - FAQ - Update .github/copilot-instructions.md with OpenSpec section - Update CONTRIBUTING.md with spec-driven development guidance Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Create behavioral specifications for all major SDK feature areas: P0 - SDK Fundamentals + Critical Infrastructure: - crud-operations: Core Create/Read/Replace/Upsert/Delete operations - query-and-linq: SQL query execution, LINQ, FeedIterator, pagination - partition-keys: Single, hierarchical/multi-hash, routing, extraction - handler-pipeline: Handler chain ordering and responsibilities - retry-and-failover: Cross-region retries, throttle retries, PPAF/PPCB - cross-region-hedging: Availability strategies, thresholds, cancellation P1 - SDK Design + Active Features: - client-and-configuration: CosmosClient lifecycle, options, connection modes - change-feed: Modes, processor, leases, partition distribution - batch-and-transactional: TransactionalBatch, bulk execution - diagnostics-and-observability: OpenTelemetry, diagnostics, tracing - serialization: Serializer contracts, STJ vs Newtonsoft P2 - Supplementary: - container-and-database-management: DB/Container CRUD, indexing, throughput - patch-operations: JSON patch, partial updates - distributed-transactions: DTS (evolving, under active development) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ranches content - Enhances config.yaml with prescriptive artifact rules (EARS notation, Mermaid diagrams, task separation, baseline tests, contract enforcement) - Adds 3 new specs from feature-specs branch: client-side-encryption, consistency-and-session, transport-and-connectivity - Adds specs/README.md catalog index organized by area - Enhances all 14 existing specs with richer detail (tables, code blocks, edge cases, cross-references) - Streamlines openspec/README.md for clarity Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| /// overridden by the caller before calling ToString(), or by using the | ||
| /// ToString(DiagnosticsVerbosity) overload. | ||
| /// </summary> | ||
| public DiagnosticsVerbosity Verbosity { get; set; } = DiagnosticsVerbosity.Detailed; |
There was a problem hiding this comment.
Why even have this property when there is a ToString overload accepting it? IMO this is not needed and introduces weird race conditions - just drop it - ToString() defaults to Detailed (backwards compatibility) and anyone else can use the new overloads.
FabianMeiswinkel
left a comment
There was a problem hiding this comment.
LGTM except for the DiagnosticsVerbosity on CosmsoDiagnostics
| /// Default: 8192 (8 KB). Minimum: 4096 (4 KB). | ||
| /// Can also be set via the AZURE_COSMOS_DIAGNOSTICS_MAX_SUMMARY_SIZE environment variable. | ||
| /// </summary> | ||
| public int MaxDiagnosticsSummarySizeBytes { get; set; } = 8192; |
There was a problem hiding this comment.
What's the impact of it on the fidelity of the troubleshooting context?
| - **Memory pressure** — large diagnostic strings increase GC overhead, especially at high throughput | ||
| - **Readability** — operators cannot quickly extract signal from noise when hundreds of identical retry entries are listed | ||
|
|
||
| **Example scenario:** A point read that encounters 50 retries due to 429 throttling in West US 2, then fails over to East US 2 with 10 more retries, produces ~60 full `StoreResponseStatistics` entries in the trace tree. With summary mode, this compacts to: first request + last request + 1 aggregated group per region. |
There was a problem hiding this comment.
Cross partition queries are other bigger source of issues.
| /// aggregate statistics (count, total RU, min/max/P50 latency). | ||
| /// Respects MaxDiagnosticsSummarySizeBytes limit. | ||
| /// </summary> | ||
| Summary = 1, |
There was a problem hiding this comment.
What's the guidance for customers on when to use what?
| `CosmosDiagnostics.ToString()` produces a JSON trace that grows **unboundedly** with retries. Each retry attempt creates a new child `ITrace` node containing a full `ClientSideRequestStatisticsTraceDatum` with complete `StoreResponseStatistics` and `HttpResponseStatistics` entries. In pathological scenarios (sustained 429 throttling, transient failures, cross-region failovers), a single operation's diagnostics can grow to hundreds of KB. | ||
|
|
||
| **Impact:** | ||
| - **Log truncation** — monitoring systems (Application Insights, Azure Monitor, etc.) silently drop oversized log entries |
There was a problem hiding this comment.
JSON is verbose, thouhgts on encoding which might help/
There was a problem hiding this comment.
Thoughts from offline discussion:
- Consider a hybrid format with JSON text readable summary + encoded information that contains more details for debugging. Have an easy way to decode the information for IcMs + general debugging. Decoding would have to also be available for customers. Possibly add a tool in the repo to do this. Could be standardized across SDKs.
There was a problem hiding this comment.
Also important, how can we preserve important information such as which replicas are contacted/failing?
Adds a spec sheet for the diagnostics compaction feature that introduces a DiagnosticsVerbosity option (Detailed vs Summary) to reduce unbounded diagnostics output size. Summary mode groups requests by region, keeps first/last request in full detail, and deduplicates middle requests with aggregate statistics (count, RU, min/max/P50/avg latency). Reference: Azure/azure-sdk-for-rust#3592 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tring/ToJson overloads Address review feedback: DiagnosticsVerbosity only impacts serialization, so it belongs on CosmosDiagnostics.ToString()/ToJsonString() methods rather than on RequestOptions. Changes: - Remove DiagnosticsVerbosity from RequestOptions (section 3.3) - Add ToString(DiagnosticsVerbosity) and ToJsonString(DiagnosticsVerbosity) overloads to CosmosDiagnostics (section 3.3) - Simplify precedence chain (remove RequestOptions level) - Update implementation plan, work items, and test plan Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…5688) # Pull Request Template ## Description Adds contracts and changelog updates to master. No contract changes. ## Type of change Please delete options that are not relevant. - [] Bug fix (non-breaking change which fixes an issue) ## Closing issues To automatically close an issue: closes #IssueNumber --------- Co-authored-by: Fabian Meiswinkel <fabianm@microsoft.com>
## Description Fixes #5620 Replaces the unsafe `(T)(object)stream` cast pattern with safe `is` pattern matching in all `FromStream<T>` serializer implementations across the SDK. ### Problem The `FromStream<T>` method in multiple serializer implementations uses the following pattern: ```csharp if (typeof(Stream).IsAssignableFrom(typeof(T))) { return (T)(object)stream; } ``` `typeof(Stream).IsAssignableFrom(typeof(T))` returns `true` when `T` is `Stream` **or any subclass** (e.g., `MemoryStream`, `FileStream`). If `T` is a specific `Stream` subclass but the runtime `stream` parameter is a different `Stream` type, the cast `(T)(object)stream` throws a raw `InvalidCastException` with no context about what went wrong. **Example that throws:** ```csharp // T = MemoryStream, but stream is actually a FileStream at runtime serializer.FromStream<MemoryStream>(someFileStream); // InvalidCastException! ``` ### Fix Replaced the unsafe cast with safe `is` pattern matching: ```csharp if (typeof(Stream).IsAssignableFrom(typeof(T))) { if (stream is T typedStream) { return typedStream; } throw new InvalidCastException( $"Stream of type '{stream.GetType().FullName}' is not compatible " + $"with the requested type '{typeof(T).FullName}'."); } ``` This provides: - ✅ Safe runtime type checking (no unexpected `InvalidCastException`) - ✅ A descriptive error message identifying both the actual and expected types - ✅ No behavioral change for the common case (`T = Stream`) ### Files Changed **Core SDK Serializers (2):** - `Microsoft.Azure.Cosmos/src/Serializer/CosmosJsonDotNetSerializer.cs` - `Microsoft.Azure.Cosmos/src/Serializer/CosmosSystemTextJsonSerializer.cs` **Sample Code (3):** - `Microsoft.Azure.Cosmos.Samples/Usage/SystemTextJson/CosmosSystemTextJsonSerializer.cs` - `Microsoft.Azure.Cosmos.Samples/Usage/ReEncryption/ReEncryptionSupport/ReEncryptionJsonSerializer.cs` - `Microsoft.Azure.Cosmos.Samples/Usage/ItemManagement/Program.cs` **Encryption Modules (2):** - `Microsoft.Azure.Cosmos.Encryption/src/CosmosJsonDotNetSerializer.cs` - `Microsoft.Azure.Cosmos.Encryption.Custom/src/Common/CosmosJsonDotNetSerializer.cs` **Unit Tests (2):** - `Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.Tests/CosmosJsonSerializerUnitTests.cs` — 2 new tests - `Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.Tests/Json/CosmosSystemTextJsonSerializerTest.cs` — 2 new tests ### Not Changed - **`PatchOperationCore{T}.cs`** — Has a similar-looking pattern but casts FROM `T` TO `Stream` (upcast), which always succeeds. No fix needed. - **Test utility serializers** — Internal test code, not customer-facing. ### Testing - 4 new unit tests added (2 per serializer): - `ValidateFromStreamWithBaseStreamType` / `TestFromStreamWithBaseStreamType` — Confirms `FromStream<Stream>(memoryStream)` succeeds (regression test) - `ValidateFromStreamWithIncompatibleStreamTypeThrowsDescriptiveError` / `TestFromStreamWithIncompatibleStreamTypeThrowsDescriptiveError` — Confirms `FromStream<FileStream>(memoryStream)` throws `InvalidCastException` with a descriptive message containing both type names - All existing tests continue to pass --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
#5679) ## Summary This PR adds comprehensive unit test coverage for the FaultInjection library, increasing the unit test count from 1 to 49. > **Note:** This PR should be merged **after** PR #5675 (bug fixes) and PR #5678 (code quality). 8 tests use `Assert.Inconclusive` for validations that depend on fixes in those PRs — they will become fully passing once those PRs are merged. ### Changes **New Test File: `FaultInjectionBuilderValidationTests.cs`** Contains 4 test classes with 49 test methods: 1. **`FaultInjectionBuilderValidationTests`** (21 tests) — #5670 - `FaultInjectionRuleBuilder`: null/empty ID, null condition/result, hit limit validation, Gateway+Gone rejection - `FaultInjectionServerErrorResultBuilder`: injection rate boundaries (0, 0.5, 1.0, 1.1, -0.5), delay validation for all delay types, WithDelay on non-delay type - `FaultInjectionEndpointBuilder`: null database/container/feedRange, negative replica count, valid build - `FaultInjectionConnectionErrorResultBuilder`: threshold boundaries, negative interval - `FaultInjectionCustomServerErrorResultBuilder`: basic build, injection rate validation 2. **`FaultInjectorUnitTests`** (5 tests) — #5671 - Null rules constructor, empty rules, GetApplicationContext before init, unknown activity ID, GetClientOptions 3. **`FaultInjectionApplicationContextUnitTests`** (7 tests) — #5672 - Add/get by rule ID, get by activity ID, non-existent lookups, multiple executions, GetAllRuleExecutions, concurrent access 4. **`FaultInjectionRuleLifecycleTests`** (16 tests) — #5673 - Enable/disable toggle, uninitialized hit count, default condition (All types), all operation types enumeration, ToString formatting **Bug Fix: `FaultInjectionServerErrorResult.ToString()`** - Fixed `FormatException` caused by unbalanced braces in the format string (`{3}}` → `{3}}}`). ### Test Results (on master) - ✅ 41 passed - ⏭️ 8 skipped (Inconclusive — depend on PRs #5675 and #5678) - ❌ 0 failed Parent tracking issue: #5652 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…cketsHttpHandler. (#5693) # Pull Request Template ## Description Enables EnableMultipleHttp2Connections on SocketsHttpHandler to allow the SDK to open additional HTTP/2 TCP connections when the concurrent streams limit on an existing connection is reached, improving throughput in Gateway (thin client) mode. ## Type of change Please delete options that are not relevant. - [X] Bug fix (non-breaking change which fixes an issue) ## Closing issues To automatically close an issue: closes #IssueNumber
…read requests (#5685) # Pull Request Template ## Description Currently, Cosmos DB accounts are configured with a single default consistency level, and the existing per-request override only allows weakening it — there's no way to strengthen reads or choose a fundamentally different read strategy without changing the account-level setting. This becomes a limitation for workloads where certain reads need the latest committed data via quorum reads. ReadConsistencyStrategy introduces a new dimension of control by allowing applications to specify their desired read behavior (Eventual, Session, LatestCommitted, GlobalStrong) per-request or per-client, completely independent of the account's default consistency. Unlike the existing ConsistencyLevel override, these strategies map directly to how the Direct layer reads from replicas — single replica reads for Eventual, session-token-aware reads for Session, quorum reads with GLSN barrier for LatestCommitted, and quorum reads with GCLSN barrier for GlobalStrong. This PR adds the ReadConsistencyStrategy property is exposed on all read-path request options (ItemRequestOptions, QueryRequestOptions, ChangeFeedRequestOptions, ReadManyRequestOptions), CosmosClientOptions. The SDK sets the x-ms-cosmos-read-consistency-strategy header and propagates the strategy to DocumentServiceRequestContext, where the Direct package's ConsistencyReader and QuorumReader handle the actual read mode selection. ## Type of change Please delete options that are not relevant. - [X] New feature (non-breaking change which adds functionality) ## Closing issues To automatically close an issue: closes #IssueNumber
…ew feedback Address Fabian's review comment: drop the mutable Verbosity property from CosmosDiagnostics to avoid thread-safety issues. Callers use the explicit ToString(DiagnosticsVerbosity) / ToJsonString(DiagnosticsVerbosity) overloads instead. Parameterless ToString() always returns Detailed (backward compat). Changes: - Remove Verbosity get/set property from CosmosDiagnostics API surface - Update CosmosClientOptions doc to clarify it is a config value, not auto-flowed - Simplify precedence order from 4 levels to 3 - Remove ResponseMessage.cs from modified files list (no verbosity propagation needed) - Update WI-1 and WI-3 scope/acceptance criteria - Replace ToString_UsesSummary_WhenVerbosityPropertySet test with ToString_Parameterless_AlwaysDetailed test Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ToString(DiagnosticsVerbosity) already covers this use case, making ToJsonString(DiagnosticsVerbosity) redundant. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Move behavioral requirements into openspec/specs/diagnostics-and-observability/spec.md using EARS notation (WHEN/THEN/SHALL) - Create openspec/changes/diagnostics-compaction/ with proposal, design, and tasks - Remove old specs/diagnostics-compaction.md - Update spec index README Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ffbab35 to
43f6d9e
Compare
…/diagnostics-compaction-spec # Conflicts: # openspec/config.yaml
Brings in the OpenSpec change spec (design, proposal, tasks) for the diagnostics compaction feature alongside the implementation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Review SummaryOverall: The specs are impressively thorough — particularly retry-and-failover (324 lines) and diagnostics-and-observability (399 lines). The OpenSpec infrastructure is well-structured. However, there are several accuracy and consistency issues to address before merge. Findings: 6 Recommendations, 3 Suggestions, 3 Observations (0 Blocking) Key concerns:
Existing comments cross-reference: Found 9 prior comments from @FabianMeiswinkel and @kirankumarkolli on the old Additional observations:
|
| ## References | ||
|
|
||
| - Source: `Microsoft.Azure.Cosmos/src/Diagnostics/CosmosDiagnostics.cs` | ||
| - Source: `Microsoft.Azure.Cosmos/src/Diagnostics/DiagnosticsVerbosity.cs` |
There was a problem hiding this comment.
🟡 Recommendation · Maintainability: Spec Accuracy
Spec references files that don't exist yet
The References section lists DiagnosticsVerbosity.cs and DiagnosticsSummaryWriter.cs as source files, but these are files to be created (per design.md). I verified via the GitHub API that neither file exists in the repository today.
Since this spec catalog is positioned as 'authoritative context for implementation' and 'living documentation,' AI agents reading this spec will attempt to reference or call APIs from these non-existent files. More broadly, the entire DiagnosticsVerbosity section (~200 lines of the 399-line spec) describes proposed behavior mixed in with existing behavior, with no visual distinction.
Suggestion: Either (a) add a clear > :construction: PROPOSED — Not yet implemented callout to all DiagnosticsVerbosity sections, or (b) defer adding them to the main spec until the implementation PR merges — the openspec/changes/diagnostics-compaction/ artifacts already capture the full design.
| | Parameter | Default | | ||
| |-----------|---------| | ||
| | Max retry attempts | 9 | | ||
| | Max cumulative wait time | 60 seconds | |
There was a problem hiding this comment.
🟡 Recommendation · Correctness: Inconsistent Default
MaxRetryWaitTime default disagrees with client-and-configuration spec
This table says 'Max cumulative wait time' defaults to 60 seconds. However, openspec/specs/client-and-configuration/spec.md says MaxRetryWaitTimeOnRateLimitedRequests defaults to 30 seconds.
I verified both values in the source:
RetryOptions.DefaultMaxRetryWaitTimeInSeconds = 30(the public-facing default)ResourceThrottleRetryPolicy.DefaultMaxWaitTimeInSeconds = 60(internal policy fallback)
The effective customer-facing default is 30 seconds since CosmosClientOptions always passes RetryOptions values to the policy constructor. The 60s value here cites the internal policy fallback, not what customers experience.
Suggestion: Update this table to say 30 seconds (matching the effective default), or add a note: '60s is the internal policy fallback; effective default is 30s per CosmosClientOptions.'
| When `CosmosClientOptions.MaxDiagnosticsSummarySizeBytes` is set below 4096, | ||
| the system shall reject the value (minimum: 4096 bytes). | ||
|
|
||
| **Scenario: Verbosity precedence** |
There was a problem hiding this comment.
🟡 Recommendation · Correctness: Contradictory Requirements
Verbosity precedence contradicts 'Parameterless ToString always returns Detailed'
This section defines a 3-tier precedence: (1) explicit ToString(DiagnosticsVerbosity) parameter → (2) CosmosClientOptions.DiagnosticsVerbosity → (3) default Detailed.
But line 169 says: 'Parameterless ToString() always returns Detailed, regardless of CosmosClientOptions.DiagnosticsVerbosity setting.'
These two requirements contradict each other. If parameterless ToString() always returns Detailed regardless of the client option, then CosmosClientOptions.DiagnosticsVerbosity has no effect — there's no call path that consults it (since ToString(verbosity) already has an explicit parameter, and ToString() ignores it).
Suggestion: Clarify the intent: either (a) parameterless ToString() respects CosmosClientOptions.DiagnosticsVerbosity as its default (remove the 'regardless' clause on line 169), or (b) CosmosClientOptions.DiagnosticsVerbosity only provides a convenience default for callers who construct the verbosity parameter from config (document this explicitly). As written, the option appears useless.
| |------|--------| | ||
| | `ContractEnforcementTests.cs` baseline | Update public API contract for new enum and properties | | ||
|
|
||
| ## Alternatives Considered |
There was a problem hiding this comment.
🟡 Recommendation · Completeness: Missing Alternative
Hybrid encoding alternative from review discussion not captured
@kirankumarkolli raised 'JSON is verbose, thoughts on encoding which might help?' and @NaluTripician replied with thoughts from offline discussion: 'Consider a hybrid format with JSON text readable summary + encoded information that contains more details for debugging.'
This is a substantive alternative approach that emerged from review, but it's not listed in the Alternatives Considered section. Future readers won't know this option was discussed and evaluated.
Suggestion: Add 'Alternative 4: Hybrid format (readable summary + encoded details)' with pros/cons and decision rationale, even if the decision is to defer. This preserves the design context from the review discussion.
| | **Detailed** (default) | Current behavior — full trace tree output | Debugging, development | | ||
| | **Summary** | Region-grouped compaction with first/last + aggregated middle | Production logging, size-constrained environments | | ||
|
|
||
| **Key design principle:** The in-memory representation (`ITrace` tree, `ClientSideRequestStatisticsTraceDatum`) stays **unchanged**. Compaction only happens at **serialization time** in the `TraceJsonWriter` path. This preserves full programmatic access to diagnostics data while reducing serialized output size. |
There was a problem hiding this comment.
🟡 Recommendation · Completeness: Scope Gap
Cross-partition query diagnostics growth not addressed
@kirankumarkolli noted 'Cross partition queries are other bigger source of issues.' The proposal focuses on retry-heavy scenarios (429 throttling, failover), but cross-partition queries can produce large diagnostics due to fan-out across many physical partitions — regardless of retries.
A query fanning out to 50 partitions produces 50 separate ClientSideRequestStatisticsTraceDatum entries even without any retries. Summary mode's region-grouping helps somewhat, but the compaction algorithm (first/last + aggregated middle) is designed for the retry pattern, not the fan-out pattern.
Suggestion: Add a 'Scope Limitations' note or expand 'Non-Goals' to acknowledge that cross-partition query fan-out diagnostics are not addressed in this phase, and note whether a future phase could address it.
| - Changing the in-memory `ITrace` tree structure | ||
| - Modifying the `Detailed` mode output format | ||
| - Adding new programmatic APIs beyond `ToString(DiagnosticsVerbosity)` overload | ||
| - Per-request verbosity override via `RequestOptions` (can be added later) |
There was a problem hiding this comment.
🟢 Suggestion · Completeness: Replica Information
Consider preserving replica/endpoint information in summary mode
@NaluTripician noted in PR comments: 'Also important, how can we preserve important information such as which replicas are contacted/failing?' This concern isn't addressed in the current spec.
Summary mode compacts middle requests into aggregated stats, but replica-level detail (which specific replicas were contacted, which failed) is potentially lost. This information is critical for IcM debugging — knowing that replica X in region Y is the one failing is often the key diagnostic signal.
Suggestion: Add a requirement or open question about preserving replica/endpoint information in summary mode output. At minimum, the First and Last entries preserve this, but consider whether aggregated groups should include a distinct endpoint count or list.
Summary
Adds a spec sheet for the diagnostics compaction feature that introduces a
DiagnosticsVerbosityoption (DetailedvsSummary) to reduce unbounded diagnostics output size during high-retry scenarios (429 throttling, transient failures, cross-region failovers).Spec location:
specs/diagnostics-compaction.mdKey Design Decisions
Detailed— no behavioral change for existing usersCosmosClientOptions.DiagnosticsVerbosity+ per-requestRequestOptions.DiagnosticsVerbosityoverride + env varsITracetree is unchangedReference
Modeled after the Rust SDK's approach: Azure/azure-sdk-for-rust#3592
Work Items
DiagnosticsVerbosityenum, properties onCosmosClientOptions/RequestOptions, env var supportToString()on verbosity, size enforcement, truncated fallbackContractEnforcementTestsbaselinesPlease review the spec and leave feedback before implementation begins.
#5325