Skip to content

[Internal] Diagnostics: Adds spec for CosmosDiagnostics compaction summary mode#5644

Open
NaluTripician wants to merge 16 commits intomainfrom
users/nalutripician/diagnostics-compaction-spec
Open

[Internal] Diagnostics: Adds spec for CosmosDiagnostics compaction summary mode#5644
NaluTripician wants to merge 16 commits intomainfrom
users/nalutripician/diagnostics-compaction-spec

Conversation

@NaluTripician
Copy link
Copy Markdown
Contributor

@NaluTripician NaluTripician commented Feb 26, 2026

Summary

Adds a spec sheet for the diagnostics compaction feature that introduces a DiagnosticsVerbosity option (Detailed vs Summary) to reduce unbounded diagnostics output size during high-retry scenarios (429 throttling, transient failures, cross-region failovers).

Spec location: specs/diagnostics-compaction.md

Key Design Decisions

  • Summary mode groups requests by region, keeps first/last in full detail, deduplicates middle requests with aggregate stats (count, RU, min/max/P50/avg latency)
  • Default is Detailed — no behavioral change for existing users
  • Configuration: CosmosClientOptions.DiagnosticsVerbosity + per-request RequestOptions.DiagnosticsVerbosity override + env vars
  • Max summary size: 8KB default (4KB min), with truncated fallback
  • Compaction is serialization-only — in-memory ITrace tree is unchanged

Reference

Modeled after the Rust SDK's approach: Azure/azure-sdk-for-rust#3592

Work Items

# Work Item Description
1 Enum & Options Plumbing DiagnosticsVerbosity enum, properties on CosmosClientOptions/RequestOptions, env var support
2 Summary Computation Engine Trace tree walk, region grouping, first/last/aggregated groups
3 Serialization Integration Branch ToString() on verbosity, size enforcement, truncated fallback
4 Contract Updates Update ContractEnforcementTests baselines
5 Unit Tests 21 tests covering summary engine, verbosity flow, edge cases
6 Integration Tests Emulator-based tests for real operations with summary mode
7 Baseline Tests Golden-file JSONs for summary mode serialization stability

Please review the spec and leave feedback before implementation begins.

#5325

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good!

@NaluTripician NaluTripician changed the title [Internal] Spec: CosmosDiagnostics Compaction — Summary Mode [Internal] Diagnostics: Adds spec for CosmosDiagnostics compaction summary mode Feb 26, 2026
@NaluTripician NaluTripician self-assigned this Mar 3, 2026
Comment thread specs/diagnostics-compaction.md Outdated
@NaluTripician
Copy link
Copy Markdown
Contributor Author

Addressed in latest push — good call that since compaction is serialization-only, it doesn't belong on RequestOptions. Changes:

  • Removed DiagnosticsVerbosity? from RequestOptions entirely
  • Added ToString(DiagnosticsVerbosity) and ToJsonString(DiagnosticsVerbosity) overloads on CosmosDiagnostics so callers control verbosity at serialization time
  • Kept CosmosClientOptions.DiagnosticsVerbosity as a client-wide default that flows to CosmosDiagnostics.Verbosity (used by parameterless ToString()) — useful for centralized logging pipelines
  • Simplified precedence: ToString(param) > Verbosity property (from client options) > env var > default

NaluTripician and others added 4 commits March 5, 2026 12:28
- Run openspec init to create openspec/ directory structure
- Configure openspec/config.yaml with Cosmos SDK project context and artifact rules
- Create openspec/README.md with comprehensive developer guide:
  - Workflow overview (propose → apply → archive)
  - Slash command reference
  - Writing good specs guide with examples
  - Best practices and anti-patterns
  - FAQ
- Update .github/copilot-instructions.md with OpenSpec section
- Update CONTRIBUTING.md with spec-driven development guidance

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Create behavioral specifications for all major SDK feature areas:

P0 - SDK Fundamentals + Critical Infrastructure:
- crud-operations: Core Create/Read/Replace/Upsert/Delete operations
- query-and-linq: SQL query execution, LINQ, FeedIterator, pagination
- partition-keys: Single, hierarchical/multi-hash, routing, extraction
- handler-pipeline: Handler chain ordering and responsibilities
- retry-and-failover: Cross-region retries, throttle retries, PPAF/PPCB
- cross-region-hedging: Availability strategies, thresholds, cancellation

P1 - SDK Design + Active Features:
- client-and-configuration: CosmosClient lifecycle, options, connection modes
- change-feed: Modes, processor, leases, partition distribution
- batch-and-transactional: TransactionalBatch, bulk execution
- diagnostics-and-observability: OpenTelemetry, diagnostics, tracing
- serialization: Serializer contracts, STJ vs Newtonsoft

P2 - Supplementary:
- container-and-database-management: DB/Container CRUD, indexing, throughput
- patch-operations: JSON patch, partial updates
- distributed-transactions: DTS (evolving, under active development)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ranches content

- Enhances config.yaml with prescriptive artifact rules (EARS notation, Mermaid diagrams, task separation, baseline tests, contract enforcement)
- Adds 3 new specs from feature-specs branch: client-side-encryption, consistency-and-session, transport-and-connectivity
- Adds specs/README.md catalog index organized by area
- Enhances all 14 existing specs with richer detail (tables, code blocks, edge cases, cross-references)
- Streamlines openspec/README.md for clarity

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread specs/diagnostics-compaction.md Outdated
/// overridden by the caller before calling ToString(), or by using the
/// ToString(DiagnosticsVerbosity) overload.
/// </summary>
public DiagnosticsVerbosity Verbosity { get; set; } = DiagnosticsVerbosity.Detailed;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why even have this property when there is a ToString overload accepting it? IMO this is not needed and introduces weird race conditions - just drop it - ToString() defaults to Detailed (backwards compatibility) and anyone else can use the new overloads.

Copy link
Copy Markdown
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for the DiagnosticsVerbosity on CosmsoDiagnostics

Comment thread specs/diagnostics-compaction.md Outdated
Comment thread specs/diagnostics-compaction.md Outdated
/// Default: 8192 (8 KB). Minimum: 4096 (4 KB).
/// Can also be set via the AZURE_COSMOS_DIAGNOSTICS_MAX_SUMMARY_SIZE environment variable.
/// </summary>
public int MaxDiagnosticsSummarySizeBytes { get; set; } = 8192;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the impact of it on the fidelity of the troubleshooting context?

Comment thread specs/diagnostics-compaction.md Outdated
- **Memory pressure** — large diagnostic strings increase GC overhead, especially at high throughput
- **Readability** — operators cannot quickly extract signal from noise when hundreds of identical retry entries are listed

**Example scenario:** A point read that encounters 50 retries due to 429 throttling in West US 2, then fails over to East US 2 with 10 more retries, produces ~60 full `StoreResponseStatistics` entries in the trace tree. With summary mode, this compacts to: first request + last request + 1 aggregated group per region.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cross partition queries are other bigger source of issues.

Comment thread specs/diagnostics-compaction.md Outdated
/// aggregate statistics (count, total RU, min/max/P50 latency).
/// Respects MaxDiagnosticsSummarySizeBytes limit.
/// </summary>
Summary = 1,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the guidance for customers on when to use what?

Comment thread specs/diagnostics-compaction.md Outdated
`CosmosDiagnostics.ToString()` produces a JSON trace that grows **unboundedly** with retries. Each retry attempt creates a new child `ITrace` node containing a full `ClientSideRequestStatisticsTraceDatum` with complete `StoreResponseStatistics` and `HttpResponseStatistics` entries. In pathological scenarios (sustained 429 throttling, transient failures, cross-region failovers), a single operation's diagnostics can grow to hundreds of KB.

**Impact:**
- **Log truncation** — monitoring systems (Application Insights, Azure Monitor, etc.) silently drop oversized log entries
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JSON is verbose, thouhgts on encoding which might help/

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts from offline discussion:

  • Consider a hybrid format with JSON text readable summary + encoded information that contains more details for debugging. Have an easy way to decode the information for IcMs + general debugging. Decoding would have to also be available for customers. Possibly add a tool in the repo to do this. Could be standardized across SDKs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also important, how can we preserve important information such as which replicas are contacted/failing?

NaluTripician and others added 8 commits March 24, 2026 11:24
Adds a spec sheet for the diagnostics compaction feature that introduces
a DiagnosticsVerbosity option (Detailed vs Summary) to reduce unbounded
diagnostics output size. Summary mode groups requests by region, keeps
first/last request in full detail, and deduplicates middle requests with
aggregate statistics (count, RU, min/max/P50/avg latency).

Reference: Azure/azure-sdk-for-rust#3592

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tring/ToJson overloads

Address review feedback: DiagnosticsVerbosity only impacts serialization,
so it belongs on CosmosDiagnostics.ToString()/ToJsonString() methods
rather than on RequestOptions.

Changes:
- Remove DiagnosticsVerbosity from RequestOptions (section 3.3)
- Add ToString(DiagnosticsVerbosity) and ToJsonString(DiagnosticsVerbosity)
  overloads to CosmosDiagnostics (section 3.3)
- Simplify precedence chain (remove RequestOptions level)
- Update implementation plan, work items, and test plan

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…5688)

# Pull Request Template

## Description

Adds contracts and changelog updates to master. No contract changes.

## Type of change

Please delete options that are not relevant.

- [] Bug fix (non-breaking change which fixes an issue)

## Closing issues

To automatically close an issue: closes #IssueNumber

---------

Co-authored-by: Fabian Meiswinkel <fabianm@microsoft.com>
## Description

Fixes #5620

Replaces the unsafe `(T)(object)stream` cast pattern with safe `is`
pattern matching in all `FromStream<T>` serializer implementations
across the SDK.

### Problem

The `FromStream<T>` method in multiple serializer implementations uses
the following pattern:

```csharp
if (typeof(Stream).IsAssignableFrom(typeof(T)))
{
    return (T)(object)stream;
}
```

`typeof(Stream).IsAssignableFrom(typeof(T))` returns `true` when `T` is
`Stream` **or any subclass** (e.g., `MemoryStream`, `FileStream`). If
`T` is a specific `Stream` subclass but the runtime `stream` parameter
is a different `Stream` type, the cast `(T)(object)stream` throws a raw
`InvalidCastException` with no context about what went wrong.

**Example that throws:**
```csharp
// T = MemoryStream, but stream is actually a FileStream at runtime
serializer.FromStream<MemoryStream>(someFileStream); // InvalidCastException!
```

### Fix

Replaced the unsafe cast with safe `is` pattern matching:

```csharp
if (typeof(Stream).IsAssignableFrom(typeof(T)))
{
    if (stream is T typedStream)
    {
        return typedStream;
    }

    throw new InvalidCastException(
        $"Stream of type '{stream.GetType().FullName}' is not compatible "
        + $"with the requested type '{typeof(T).FullName}'.");
}
```

This provides:
- ✅ Safe runtime type checking (no unexpected `InvalidCastException`)
- ✅ A descriptive error message identifying both the actual and expected
types
- ✅ No behavioral change for the common case (`T = Stream`)

### Files Changed

**Core SDK Serializers (2):**
- `Microsoft.Azure.Cosmos/src/Serializer/CosmosJsonDotNetSerializer.cs`
-
`Microsoft.Azure.Cosmos/src/Serializer/CosmosSystemTextJsonSerializer.cs`

**Sample Code (3):**
-
`Microsoft.Azure.Cosmos.Samples/Usage/SystemTextJson/CosmosSystemTextJsonSerializer.cs`
-
`Microsoft.Azure.Cosmos.Samples/Usage/ReEncryption/ReEncryptionSupport/ReEncryptionJsonSerializer.cs`
- `Microsoft.Azure.Cosmos.Samples/Usage/ItemManagement/Program.cs`

**Encryption Modules (2):**
- `Microsoft.Azure.Cosmos.Encryption/src/CosmosJsonDotNetSerializer.cs`
-
`Microsoft.Azure.Cosmos.Encryption.Custom/src/Common/CosmosJsonDotNetSerializer.cs`

**Unit Tests (2):**
-
`Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.Tests/CosmosJsonSerializerUnitTests.cs`
— 2 new tests
-
`Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.Tests/Json/CosmosSystemTextJsonSerializerTest.cs`
— 2 new tests

### Not Changed

- **`PatchOperationCore{T}.cs`** — Has a similar-looking pattern but
casts FROM `T` TO `Stream` (upcast), which always succeeds. No fix
needed.
- **Test utility serializers** — Internal test code, not
customer-facing.

### Testing

- 4 new unit tests added (2 per serializer):
- `ValidateFromStreamWithBaseStreamType` /
`TestFromStreamWithBaseStreamType` — Confirms
`FromStream<Stream>(memoryStream)` succeeds (regression test)
- `ValidateFromStreamWithIncompatibleStreamTypeThrowsDescriptiveError` /
`TestFromStreamWithIncompatibleStreamTypeThrowsDescriptiveError` —
Confirms `FromStream<FileStream>(memoryStream)` throws
`InvalidCastException` with a descriptive message containing both type
names
- All existing tests continue to pass

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
#5679)

## Summary

This PR adds comprehensive unit test coverage for the FaultInjection
library, increasing the unit test count from 1 to 49.

> **Note:** This PR should be merged **after** PR #5675 (bug fixes) and
PR #5678 (code quality). 8 tests use `Assert.Inconclusive` for
validations that depend on fixes in those PRs — they will become fully
passing once those PRs are merged.

### Changes

**New Test File: `FaultInjectionBuilderValidationTests.cs`**

Contains 4 test classes with 49 test methods:

1. **`FaultInjectionBuilderValidationTests`** (21 tests) — #5670
- `FaultInjectionRuleBuilder`: null/empty ID, null condition/result, hit
limit validation, Gateway+Gone rejection
- `FaultInjectionServerErrorResultBuilder`: injection rate boundaries
(0, 0.5, 1.0, 1.1, -0.5), delay validation for all delay types,
WithDelay on non-delay type
- `FaultInjectionEndpointBuilder`: null database/container/feedRange,
negative replica count, valid build
- `FaultInjectionConnectionErrorResultBuilder`: threshold boundaries,
negative interval
- `FaultInjectionCustomServerErrorResultBuilder`: basic build, injection
rate validation

2. **`FaultInjectorUnitTests`** (5 tests) — #5671
- Null rules constructor, empty rules, GetApplicationContext before
init, unknown activity ID, GetClientOptions

3. **`FaultInjectionApplicationContextUnitTests`** (7 tests) — #5672
- Add/get by rule ID, get by activity ID, non-existent lookups, multiple
executions, GetAllRuleExecutions, concurrent access

4. **`FaultInjectionRuleLifecycleTests`** (16 tests) — #5673
- Enable/disable toggle, uninitialized hit count, default condition (All
types), all operation types enumeration, ToString formatting

**Bug Fix: `FaultInjectionServerErrorResult.ToString()`**
- Fixed `FormatException` caused by unbalanced braces in the format
string (`{3}}` → `{3}}}`).

### Test Results (on master)
- ✅ 41 passed
- ⏭️ 8 skipped (Inconclusive — depend on PRs #5675 and #5678)
- ❌ 0 failed

Parent tracking issue: #5652

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…cketsHttpHandler. (#5693)

# Pull Request Template

## Description

Enables EnableMultipleHttp2Connections on SocketsHttpHandler to allow
the SDK to open additional HTTP/2 TCP connections when the concurrent
streams limit on an existing connection is reached, improving throughput
in Gateway (thin client) mode.

## Type of change

Please delete options that are not relevant.

- [X] Bug fix (non-breaking change which fixes an issue)


## Closing issues

To automatically close an issue: closes #IssueNumber
…read requests (#5685)

# Pull Request Template

## Description
Currently, Cosmos DB accounts are configured with a single default
consistency level, and the existing per-request override only allows
weakening it — there's no way to strengthen reads or choose a
fundamentally different read strategy without changing the account-level
setting. This becomes a limitation for workloads where certain reads
need the latest committed data via quorum reads.

ReadConsistencyStrategy introduces a new dimension of control by
allowing applications to specify their desired read behavior (Eventual,
Session, LatestCommitted, GlobalStrong) per-request or per-client,
completely independent of the
account's default consistency. Unlike the existing ConsistencyLevel
override, these strategies map directly to how the Direct layer reads
from replicas — single replica reads for Eventual, session-token-aware
reads for Session, quorum reads with GLSN barrier for LatestCommitted,
and quorum reads with GCLSN barrier for GlobalStrong.

This PR adds the ReadConsistencyStrategy property is exposed on all
read-path request options (ItemRequestOptions, QueryRequestOptions,
ChangeFeedRequestOptions, ReadManyRequestOptions), CosmosClientOptions.
The SDK sets the x-ms-cosmos-read-consistency-strategy header and
propagates the strategy to DocumentServiceRequestContext, where the
Direct package's ConsistencyReader and QuorumReader handle the actual
read mode selection.

## Type of change

Please delete options that are not relevant.

- [X] New feature (non-breaking change which adds functionality)

## Closing issues

To automatically close an issue: closes #IssueNumber
…ew feedback

Address Fabian's review comment: drop the mutable Verbosity property from
CosmosDiagnostics to avoid thread-safety issues. Callers use the explicit
ToString(DiagnosticsVerbosity) / ToJsonString(DiagnosticsVerbosity) overloads
instead. Parameterless ToString() always returns Detailed (backward compat).

Changes:
- Remove Verbosity get/set property from CosmosDiagnostics API surface
- Update CosmosClientOptions doc to clarify it is a config value, not auto-flowed
- Simplify precedence order from 4 levels to 3
- Remove ResponseMessage.cs from modified files list (no verbosity propagation needed)
- Update WI-1 and WI-3 scope/acceptance criteria
- Replace ToString_UsesSummary_WhenVerbosityPropertySet test with
  ToString_Parameterless_AlwaysDetailed test

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
NaluTripician and others added 2 commits March 24, 2026 11:24
ToString(DiagnosticsVerbosity) already covers this use case, making
ToJsonString(DiagnosticsVerbosity) redundant.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Move behavioral requirements into openspec/specs/diagnostics-and-observability/spec.md
  using EARS notation (WHEN/THEN/SHALL)
- Create openspec/changes/diagnostics-compaction/ with proposal, design, and tasks
- Remove old specs/diagnostics-compaction.md
- Update spec index README

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@NaluTripician NaluTripician force-pushed the users/nalutripician/diagnostics-compaction-spec branch from ffbab35 to 43f6d9e Compare March 24, 2026 18:34
…/diagnostics-compaction-spec

# Conflicts:
#	openspec/config.yaml
NaluTripician added a commit that referenced this pull request Mar 31, 2026
Brings in the OpenSpec change spec (design, proposal, tasks) for the
diagnostics compaction feature alongside the implementation.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@NaluTripician
Copy link
Copy Markdown
Contributor Author

Review Summary

Overall: The specs are impressively thorough — particularly retry-and-failover (324 lines) and diagnostics-and-observability (399 lines). The OpenSpec infrastructure is well-structured. However, there are several accuracy and consistency issues to address before merge.

Findings: 6 Recommendations, 3 Suggestions, 3 Observations (0 Blocking)

Key concerns:

  1. Diagnostics spec includes proposed DiagnosticsVerbosity API as if it already exists (files referenced don't exist yet)
  2. MaxRetryWaitTime default inconsistent across specs (30s vs 60s)
  3. Contradictory requirements around ToString() and verbosity precedence
  4. Prior review feedback (hybrid encoding, cross-partition queries) not captured in alternatives
  5. PR description is stale — still references RequestOptions.DiagnosticsVerbosity which was deferred

Existing comments cross-reference: Found 9 prior comments from @FabianMeiswinkel and @kirankumarkolli on the old specs/diagnostics-compaction.md. Key feedback was addressed (RequestOptions.DiagnosticsVerbosity deferred, CosmosDiagnostics.Verbosity property removed), but some discussion points weren't captured in the new artifacts.

Additional observations:

  • Missing trailing newlines on all 22 new .md files
  • No enforcement mechanism for spec maintenance (no CI check, no PR template checkbox)
  • External dependency on openspec-dev/openspec tool not documented as a risk
  • Distributed transactions spec is very thin (56 lines, mostly open questions) — consider keeping in changes/ until it matures
  • Commit history includes unrelated merge-through commits — consider squashing on merge

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

## References

- Source: `Microsoft.Azure.Cosmos/src/Diagnostics/CosmosDiagnostics.cs`
- Source: `Microsoft.Azure.Cosmos/src/Diagnostics/DiagnosticsVerbosity.cs`
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Recommendation · Maintainability: Spec Accuracy

Spec references files that don't exist yet

The References section lists DiagnosticsVerbosity.cs and DiagnosticsSummaryWriter.cs as source files, but these are files to be created (per design.md). I verified via the GitHub API that neither file exists in the repository today.

Since this spec catalog is positioned as 'authoritative context for implementation' and 'living documentation,' AI agents reading this spec will attempt to reference or call APIs from these non-existent files. More broadly, the entire DiagnosticsVerbosity section (~200 lines of the 399-line spec) describes proposed behavior mixed in with existing behavior, with no visual distinction.

Suggestion: Either (a) add a clear > :construction: PROPOSED — Not yet implemented callout to all DiagnosticsVerbosity sections, or (b) defer adding them to the main spec until the implementation PR merges — the openspec/changes/diagnostics-compaction/ artifacts already capture the full design.


⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

| Parameter | Default |
|-----------|---------|
| Max retry attempts | 9 |
| Max cumulative wait time | 60 seconds |
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Recommendation · Correctness: Inconsistent Default

MaxRetryWaitTime default disagrees with client-and-configuration spec

This table says 'Max cumulative wait time' defaults to 60 seconds. However, openspec/specs/client-and-configuration/spec.md says MaxRetryWaitTimeOnRateLimitedRequests defaults to 30 seconds.

I verified both values in the source:

  • RetryOptions.DefaultMaxRetryWaitTimeInSeconds = 30 (the public-facing default)
  • ResourceThrottleRetryPolicy.DefaultMaxWaitTimeInSeconds = 60 (internal policy fallback)

The effective customer-facing default is 30 seconds since CosmosClientOptions always passes RetryOptions values to the policy constructor. The 60s value here cites the internal policy fallback, not what customers experience.

Suggestion: Update this table to say 30 seconds (matching the effective default), or add a note: '60s is the internal policy fallback; effective default is 30s per CosmosClientOptions.'


⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

When `CosmosClientOptions.MaxDiagnosticsSummarySizeBytes` is set below 4096,
the system shall reject the value (minimum: 4096 bytes).

**Scenario: Verbosity precedence**
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Recommendation · Correctness: Contradictory Requirements

Verbosity precedence contradicts 'Parameterless ToString always returns Detailed'

This section defines a 3-tier precedence: (1) explicit ToString(DiagnosticsVerbosity) parameter → (2) CosmosClientOptions.DiagnosticsVerbosity → (3) default Detailed.

But line 169 says: 'Parameterless ToString() always returns Detailed, regardless of CosmosClientOptions.DiagnosticsVerbosity setting.'

These two requirements contradict each other. If parameterless ToString() always returns Detailed regardless of the client option, then CosmosClientOptions.DiagnosticsVerbosity has no effect — there's no call path that consults it (since ToString(verbosity) already has an explicit parameter, and ToString() ignores it).

Suggestion: Clarify the intent: either (a) parameterless ToString() respects CosmosClientOptions.DiagnosticsVerbosity as its default (remove the 'regardless' clause on line 169), or (b) CosmosClientOptions.DiagnosticsVerbosity only provides a convenience default for callers who construct the verbosity parameter from config (document this explicitly). As written, the option appears useless.


⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

|------|--------|
| `ContractEnforcementTests.cs` baseline | Update public API contract for new enum and properties |

## Alternatives Considered
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Recommendation · Completeness: Missing Alternative

Hybrid encoding alternative from review discussion not captured

@kirankumarkolli raised 'JSON is verbose, thoughts on encoding which might help?' and @NaluTripician replied with thoughts from offline discussion: 'Consider a hybrid format with JSON text readable summary + encoded information that contains more details for debugging.'

This is a substantive alternative approach that emerged from review, but it's not listed in the Alternatives Considered section. Future readers won't know this option was discussed and evaluated.

Suggestion: Add 'Alternative 4: Hybrid format (readable summary + encoded details)' with pros/cons and decision rationale, even if the decision is to defer. This preserves the design context from the review discussion.


⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

| **Detailed** (default) | Current behavior — full trace tree output | Debugging, development |
| **Summary** | Region-grouped compaction with first/last + aggregated middle | Production logging, size-constrained environments |

**Key design principle:** The in-memory representation (`ITrace` tree, `ClientSideRequestStatisticsTraceDatum`) stays **unchanged**. Compaction only happens at **serialization time** in the `TraceJsonWriter` path. This preserves full programmatic access to diagnostics data while reducing serialized output size.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Recommendation · Completeness: Scope Gap

Cross-partition query diagnostics growth not addressed

@kirankumarkolli noted 'Cross partition queries are other bigger source of issues.' The proposal focuses on retry-heavy scenarios (429 throttling, failover), but cross-partition queries can produce large diagnostics due to fan-out across many physical partitions — regardless of retries.

A query fanning out to 50 partitions produces 50 separate ClientSideRequestStatisticsTraceDatum entries even without any retries. Summary mode's region-grouping helps somewhat, but the compaction algorithm (first/last + aggregated middle) is designed for the retry pattern, not the fan-out pattern.

Suggestion: Add a 'Scope Limitations' note or expand 'Non-Goals' to acknowledge that cross-partition query fan-out diagnostics are not addressed in this phase, and note whether a future phase could address it.


⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

- Changing the in-memory `ITrace` tree structure
- Modifying the `Detailed` mode output format
- Adding new programmatic APIs beyond `ToString(DiagnosticsVerbosity)` overload
- Per-request verbosity override via `RequestOptions` (can be added later)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Suggestion · Completeness: Replica Information

Consider preserving replica/endpoint information in summary mode

@NaluTripician noted in PR comments: 'Also important, how can we preserve important information such as which replicas are contacted/failing?' This concern isn't addressed in the current spec.

Summary mode compacts middle requests into aggregated stats, but replica-level detail (which specific replicas were contacted, which failed) is potentially lost. This information is critical for IcM debugging — knowing that replica X in region Y is the one failing is often the key diagnostic signal.

Suggestion: Add a requirement or open question about preserving replica/endpoint information in summary mode output. At minimum, the First and Last entries preserve this, but consider whether aggregated groups should include a distinct endpoint count or list.


⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants