Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
16 commits
Select commit Hold shift + click to select a range
f5f433a
[Internal] OpenSpec: Initialize spec-driven development infrastructure
NaluTripician Mar 5, 2026
982cd98
[Internal] OpenSpec: Create initial spec catalog (14 specs)
NaluTripician Mar 5, 2026
ebbc233
Merge branch 'master' into users/nalutripician/openspec-adoption
NaluTripician Mar 5, 2026
a19ffa6
[Internal] OpenSpec: Enhances spec catalog with merged best-of-both-b…
NaluTripician Mar 10, 2026
dca524b
[Internal] Spec: CosmosDiagnostics compaction — Summary mode
NaluTripician Feb 26, 2026
619b328
[Internal] Spec: Move DiagnosticsVerbosity from RequestOptions to ToS…
NaluTripician Mar 5, 2026
8dce13c
[Internal] Adds: Contracts and changelog update to master for hotfix …
NaluTripician Mar 10, 2026
3556eef
Serializer: Fixes unsafe stream cast in FromStream<T> (#5651)
NaluTripician Mar 11, 2026
56c449a
[FaultInjection] FaultInjection: Adds comprehensive unit test coverag…
NaluTripician Mar 11, 2026
5f7c315
[ThinClient Integration]: Adds Enable Multiple Http2 connection on So…
aavasthy Mar 14, 2026
ccd17b5
Read Consistency Strategy: Adds Read Consistency Strategy option for …
aavasthy Mar 16, 2026
f03c76c
[Internal] Spec: Remove CosmosDiagnostics.Verbosity property per revi…
NaluTripician Mar 17, 2026
4e4a030
Remove ToJsonString(DiagnosticsVerbosity) overload from spec
NaluTripician Mar 17, 2026
43f6d9e
[Internal] Spec: Migrate diagnostics-compaction spec to OpenSpec format
NaluTripician Mar 24, 2026
c2c356f
Merge remote-tracking branch 'origin/master' into users/nalutripician…
NaluTripician Mar 24, 2026
b24564a
Merge branch 'master' into users/nalutripician/diagnostics-compaction…
NaluTripician Apr 7, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions .github/copilot-instructions.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,4 +62,13 @@ Purpose: quick, actionable context so an AI coding assistant can be immediately
- In VS Code Copilot Chat: `@MsdataDirectSyncAgent sync msdata/direct`.
- In the Copilot CLI: describe the task naturally (e.g., "sync the msdata/direct branch with master").

- **OpenSpec — Spec-Driven Development**:
- The SDK uses [OpenSpec](https://github.com/openspec-dev/openspec) for spec-driven development. Specs live in `openspec/specs/` and capture behavioral contracts for major feature areas.
- **Read `openspec/README.md`** for the full developer guide, workflow instructions, and best practices.
- Active changes (in-progress work) live in `openspec/changes/`. Archived changes in `openspec/changes/archive/`.
- Configuration and project context: `openspec/config.yaml`.
- Slash commands: `/opsx:propose` (create change), `/opsx:apply` (implement), `/opsx:explore` (investigate), `/opsx:archive` (complete).
- When making changes, check if an existing spec in `openspec/specs/` covers the affected behavior — if so, update the spec alongside the code change.
- When proposing a new feature or significant behavioral change, use `/opsx:propose` to create structured artifacts (proposal, design, tasks) before implementing.

If anything here is unclear or you want the file to include additional examples (specific files, common refactor targets, or typical PR reviewers), tell me what to add and I will iterate.
9 changes: 9 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,7 @@ When evaluating adding new tests, please search in the existing test files if th

1. Create a branch for your contribution (if you are an external contributor, on your own fork).
1. Make sure your work is adding [tests](#tests) as required (either unit and/or emulator tests depending on the scope of the work).
1. If your change affects behavior covered by an [OpenSpec spec](#spec-driven-development), update the relevant spec alongside your code changes.
1. Send a Pull Request to the master branch once your work is ready to be reviewed.
1. The CI pipeline will start any required tests. If you are an external contributor, a team member will start the verification once we confirm the nature of the contribution through a `/azp run` comment in your Pull Request.
1. Look for review comments and attempt to answer/address them to the best of your ability.
Expand Down Expand Up @@ -136,3 +137,11 @@ Or all through `Re-run failed checks` on the top right corner:
- [General .NET SDK Troubleshooting](https://docs.microsoft.com/azure/cosmos-db/sql/troubleshoot-dot-net-sdk)
- [Timeout troubleshooting](https://docs.microsoft.com/azure/cosmos-db/sql/troubleshoot-dot-net-sdk-request-timeout?tabs=cpu-new)
- [Service unavailable troubleshooting](https://docs.microsoft.com/azure/cosmos-db/sql/troubleshoot-service-unavailable)

## Spec-Driven Development

This repository uses [OpenSpec](https://github.com/openspec-dev/openspec) for spec-driven development. Behavioral specifications for major SDK feature areas live in `openspec/specs/`.

When contributing changes that affect documented behavior, check if an existing spec covers the area and update it as part of your PR. For new features or significant changes, consider using the OpenSpec workflow to propose, design, and implement changes with AI assistance.

See [`openspec/README.md`](openspec/README.md) for the full developer guide, workflow instructions, and best practices.
99 changes: 99 additions & 0 deletions openspec/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# OpenSpec — Spec-Driven Development for the Azure Cosmos DB .NET SDK

This directory contains [OpenSpec](https://github.com/openspec-dev/openspec) artifacts for the Azure Cosmos DB .NET v3 SDK. OpenSpec provides a structured, AI-assisted workflow for proposing, specifying, designing, and implementing changes.

## Why OpenSpec?

The Cosmos DB .NET SDK is a large, complex codebase (~1,400+ source files) with many interdependent subsystems — retry policies, handler pipelines, cross-region routing, change feed processing, query execution, and more. OpenSpec helps by:

1. **Capturing behavioral contracts** — Specs define *what* a feature should do (invariants, edge cases, error handling), not *how* it's implemented. This makes them durable even as implementation evolves.
2. **Guiding AI-assisted development** — AI agents use specs as context when proposing and implementing changes, leading to more accurate code generation.
3. **Reducing tribal knowledge** — Complex features like PPAF, cross-region hedging, and the handler pipeline have subtle invariants that are easy to break. Specs make these invariants explicit and reviewable.

## Directory Structure

```
openspec/
├── config.yaml # Project context and artifact rules
├── README.md # This file
├── specs/ # Main spec catalog (living documentation)
│ ├── README.md # Spec index organized by area
│ ├── retry-and-failover/
│ │ └── spec.md
│ └── ...
├── changes/ # Active changes (in-progress work)
│ └── archive/ # Completed changes
```

| Concept | Location | Purpose |
|---------|----------|---------|
| **Specs** | `openspec/specs/<feature>/spec.md` | Living behavioral contracts for major feature areas. |
| **Changes** | `openspec/changes/<name>/` | In-progress work with proposal, design, and task artifacts. |
| **Archive** | `openspec/changes/archive/` | Completed changes with full context preserved. |
| **Config** | `openspec/config.yaml` | Project context and per-artifact rules that guide AI. |

## Workflow

```
Propose ──▶ Specs ──▶ Design ──▶ Tasks ──▶ Apply ──▶ Archive
```

| Command | Purpose |
|---------|---------|
| `/opsx:propose <name>` | Create a new change with proposal, design, and task artifacts |
| `/opsx:apply [name]` | Implement tasks from a change |
| `/opsx:explore [topic]` | Investigate ideas or problems without making code changes |
| `/opsx:archive [name]` | Archive a completed change |

## Writing Good Specs

Specs capture **behavioral contracts** using [EARS notation](https://en.wikipedia.org/wiki/Easy_Approach_to_Requirements_Syntax) (WHEN/THEN/SHALL). They should answer: "What do I need to know to safely modify this feature?"

### What a spec should include

1. **Purpose** — One-paragraph summary of what the feature does
2. **Public API surface** — Key types, methods, and their contracts (C# code blocks)
3. **Requirements** — Behavioral requirements using EARS notation (`WHEN <condition>, THEN the SDK SHALL <behavior>`)
4. **Reference tables** — Status code tables, configuration defaults, parameter matrices for dense reference data
5. **Edge cases** — Non-obvious behaviors, race conditions, failure modes
6. **Interactions** — How this feature relates to other SDK components (cross-spec links)
7. **References** — Links to source files and existing design docs

### What a spec should NOT include

- Implementation details (specific variable names, internal algorithms)
- Performance benchmarks (these change; use test projects instead)
- Step-by-step code walkthroughs (that's what `docs/SdkDesign.md` is for)

## When to Create or Update Specs

**Create a new spec when:**
- Adding a new major feature to the SDK
- An area has complex invariants that are easy to break
- The same behavioral rules are explained in multiple PR reviews

**Update an existing spec when:**
- Your PR changes behavior covered by a spec
- A bug fix reveals an invariant that wasn't captured
- A design doc in `docs/` gets updated

**Don't need a spec for:**
- Pure refactoring with no behavioral change
- Test-only changes, documentation updates, dependency bumps

## Best Practices

| ✅ Do | ❌ Don't |
|-------|---------|
| Be specific about invariants (status codes, timeouts) | Copy implementation details into specs |
| Use EARS notation for requirements | Create a spec per class (group by feature area) |
| Include cross-spec "Interactions" sections | Let specs go stale |
| Reference source files by path | Duplicate content from `docs/` |
| Update specs as part of behavioral change PRs | Skip `/opsx:explore` for complex changes |
| Review spec diffs in PRs like code | Archive before the PR is merged |

## Related Documentation

- [Spec Index](specs/README.md) — All specs organized by area
- [SdkDesignGuidelines.md](../SdkDesignGuidelines.md) — Public API contract rules
- [docs/SdkDesign.md](../docs/SdkDesign.md) — SDK architecture overview
105 changes: 105 additions & 0 deletions openspec/changes/diagnostics-compaction/design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# Diagnostics Compaction — Design

## Summary Compaction Algorithm

### Data Collection

Walk the `ITrace` tree (same traversal as `SummaryDiagnostics.CollectSummaryFromTraceTree()`) to collect all `StoreResponseStatistics` and `HttpResponseStatistics` entries from every `ClientSideRequestStatisticsTraceDatum` in the trace hierarchy.

### Region Grouping

Group collected entries by `Region` (string). Entries with a null/empty region are grouped under `"Unknown"`.

### Per-Region Summary

For each region group (ordered chronologically by request start time):

1. **First**: Full details of the chronologically first request
2. **Last**: Full details of the chronologically last request (omitted if only 1 request)
3. **Middle entries** (all except first and last): Group by `(StatusCode, SubStatusCode)`:
- **Count**: Number of requests in this group
- **TotalRequestCharge**: Sum of RU charges
- **MinDurationMs / MaxDurationMs / P50DurationMs / AvgDurationMs**: Latency statistics

### Size Enforcement

1. Serialize the summary JSON
2. If `serializedBytes <= MaxDiagnosticsSummarySizeBytes` → return as-is
3. If `serializedBytes > MaxDiagnosticsSummarySizeBytes` → return truncated output

### Handling Both Direct and Gateway Requests

Both `StoreResponseStatistics` (direct mode) and `HttpResponseStatistics` (gateway mode) are collected and treated uniformly in the summary. The aggregated groups include entries from both transport paths. An optional `"TransportType"` field (`"Direct"` / `"Gateway"`) can be included in aggregated groups if needed to distinguish.

## Request Flow

```mermaid
flowchart TD
A["ToString(DiagnosticsVerbosity)"] --> B{Verbosity?}
B -->|Detailed| C["Existing TraceJsonWriter path"]
B -->|Summary| D["DiagnosticsSummaryWriter"]
D --> E["Walk ITrace tree"]
E --> F["Collect StoreResponseStatistics\n+ HttpResponseStatistics"]
F --> G["Group by Region"]
G --> H["Per region:\nFirst + Last + Aggregated Middle"]
H --> I["Serialize to JSON"]
I --> J{Size <= Max?}
J -->|Yes| K["Return summary JSON"]
J -->|No| L["Return truncated JSON"]
C --> M["Return full trace JSON"]
```

## Files to Create

| File | Description |
|------|-------------|
| `Microsoft.Azure.Cosmos/src/Diagnostics/DiagnosticsVerbosity.cs` | `DiagnosticsVerbosity` enum |
| `Microsoft.Azure.Cosmos/src/Diagnostics/DiagnosticsSummaryWriter.cs` | Summary computation and JSON serialization logic |

## Files to Modify

| File | Change |
|------|--------|
| `CosmosClientOptions.cs` | Add `DiagnosticsVerbosity` and `MaxDiagnosticsSummarySizeBytes` properties with validation |
| `CosmosDiagnostics.cs` | Add `ToString(DiagnosticsVerbosity)` abstract overload |
| `CosmosTraceDiagnostics.cs` | Implement `ToString(DiagnosticsVerbosity)` overload; delegate to `DiagnosticsSummaryWriter` when verbosity is `Summary` |
| `TraceWriter.TraceJsonWriter.cs` | Add summary serialization path that delegates to `DiagnosticsSummaryWriter` when verbosity is `Summary` |
| `SummaryDiagnostics.cs` | Extend `CollectSummaryFromTraceTree()` to support region-grouped collection with ordering |
| `ClientSideRequestStatisticsTraceDatum.cs` | Ensure `StoreResponseStatistics` and `HttpResponseStatistics` lists are accessible for summary computation |

## Contract/Baseline Updates

| File | Change |
|------|--------|
| `ContractEnforcementTests.cs` baseline | Update public API contract for new enum and properties |

## Alternatives Considered
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Recommendation · Completeness: Missing Alternative

Hybrid encoding alternative from review discussion not captured

@kirankumarkolli raised 'JSON is verbose, thoughts on encoding which might help?' and @NaluTripician replied with thoughts from offline discussion: 'Consider a hybrid format with JSON text readable summary + encoded information that contains more details for debugging.'

This is a substantive alternative approach that emerged from review, but it's not listed in the Alternatives Considered section. Future readers won't know this option was discussed and evaluated.

Suggestion: Add 'Alternative 4: Hybrid format (readable summary + encoded details)' with pros/cons and decision rationale, even if the decision is to defer. This preserves the design context from the review discussion.


⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.


### Alternative 1: Emit summary alongside truncated trace tree
Instead of replacing the full trace, emit the summary _alongside_ the first + last children of the trace tree.

**Pros:** Preserves some trace structure for tooling that parses it.
**Cons:** Larger output size; complex to implement; defeats the purpose of compaction.
**Decision:** Rejected — summary replaces the full trace. The `First` and `Last` entries in each region summary provide the detailed bookends.

### Alternative 2: Per-request verbosity via RequestOptions
Add a `DiagnosticsVerbosity` property to `RequestOptions` for per-request control.

**Pros:** More granular control.
**Cons:** Verbosity is a serialization concern, not a request concern. The `ToString(DiagnosticsVerbosity)` overload provides the same flexibility without complicating `RequestOptions`.
**Decision:** Deferred. Can be added later if needed.

### Alternative 3: Transport type distinction in aggregated groups
Include a `TransportType` field (`"Direct"` / `"Gateway"`) in each aggregated group.

**Pros:** Helps distinguish transport-specific issues.
**Cons:** Increases output size; `StatusCode/SubStatusCode` is usually sufficient.
**Decision:** Deferred. Can add later if customer feedback warrants it.

## Key References

- `Microsoft.Azure.Cosmos/src/Diagnostics/CosmosTraceDiagnostics.cs` — concrete diagnostics implementation
- `Microsoft.Azure.Cosmos/src/Tracing/TraceWriter.TraceJsonWriter.cs` — current trace serialization
- `Microsoft.Azure.Cosmos/src/Diagnostics/SummaryDiagnostics.cs` — existing summary aggregation (foundation)
- `Microsoft.Azure.Cosmos/src/Tracing/TraceData/ClientSideRequestStatisticsTraceDatum.cs` — stats data
- `docs/SdkDesign.md` — SDK architecture overview
71 changes: 71 additions & 0 deletions openspec/changes/diagnostics-compaction/proposal.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# Diagnostics Compaction — Proposal

## Problem

`CosmosDiagnostics.ToString()` produces a JSON trace that grows **unboundedly** with retries. Each retry attempt creates a new child `ITrace` node containing a full `ClientSideRequestStatisticsTraceDatum` with complete `StoreResponseStatistics` and `HttpResponseStatistics` entries. In pathological scenarios (sustained 429 throttling, transient failures, cross-region failovers), a single operation's diagnostics can grow to hundreds of KB.

**Impact:**
- **Log truncation** — monitoring systems (Application Insights, Azure Monitor, etc.) silently drop oversized log entries
- **Memory pressure** — large diagnostic strings increase GC overhead, especially at high throughput
- **Readability** — operators cannot quickly extract signal from noise when hundreds of identical retry entries are listed

**Example scenario:** A point read that encounters 50 retries due to 429 throttling in West US 2, then fails over to East US 2 with 10 more retries, produces ~60 full `StoreResponseStatistics` entries in the trace tree. With summary mode, this compacts to: first request + last request + 1 aggregated group per region.

## Proposed Approach

Introduce a **`DiagnosticsVerbosity`** concept (modeled after [Azure/azure-sdk-for-rust#3592](https://github.com/Azure/azure-sdk-for-rust/pull/3592)) that controls how `CosmosDiagnostics.ToString()` serializes trace data:

| Mode | Behavior | Use Case |
|------|----------|----------|
| **Detailed** (default) | Current behavior — full trace tree output | Debugging, development |
| **Summary** | Region-grouped compaction with first/last + aggregated middle | Production logging, size-constrained environments |

**Key design principle:** The in-memory representation (`ITrace` tree, `ClientSideRequestStatisticsTraceDatum`) stays **unchanged**. Compaction only happens at **serialization time** in the `TraceJsonWriter` path. This preserves full programmatic access to diagnostics data while reducing serialized output size.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Recommendation · Completeness: Scope Gap

Cross-partition query diagnostics growth not addressed

@kirankumarkolli noted 'Cross partition queries are other bigger source of issues.' The proposal focuses on retry-heavy scenarios (429 throttling, failover), but cross-partition queries can produce large diagnostics due to fan-out across many physical partitions — regardless of retries.

A query fanning out to 50 partitions produces 50 separate ClientSideRequestStatisticsTraceDatum entries even without any retries. Summary mode's region-grouping helps somewhat, but the compaction algorithm (first/last + aggregated middle) is designed for the retry pattern, not the fan-out pattern.

Suggestion: Add a 'Scope Limitations' note or expand 'Non-Goals' to acknowledge that cross-partition query fan-out diagnostics are not addressed in this phase, and note whether a future phase could address it.


⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.


## SDK Area

- **Primary:** Diagnostics
- **Secondary:** Client-config (new options properties)

## Preview vs GA

The `DiagnosticsVerbosity` enum and related options should ship as **GA** (non-preview) since it's an additive, backward-compatible feature with no impact when not opted into.

## Backward Compatibility

- **Default is `Detailed`** — no behavioral change for existing users
- **No breaking changes** — `ToString()` output format only changes when `Summary` is explicitly opted into
- **Programmatic API unchanged** — `GetContactedRegions()`, `GetFailedRequestCount()`, etc. continue to work from the full in-memory trace regardless of verbosity

## Rollout Strategy

1. Ship with `Detailed` as default in initial release
2. Document `Summary` mode in SDK documentation and changelog
3. Consider making `Summary` the default in a future major version after customer feedback

## Non-Goals

- Changing the in-memory `ITrace` tree structure
- Modifying the `Detailed` mode output format
- Adding new programmatic APIs beyond `ToString(DiagnosticsVerbosity)` overload
- Per-request verbosity override via `RequestOptions` (can be added later)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Suggestion · Completeness: Replica Information

Consider preserving replica/endpoint information in summary mode

@NaluTripician noted in PR comments: 'Also important, how can we preserve important information such as which replicas are contacted/failing?' This concern isn't addressed in the current spec.

Summary mode compacts middle requests into aggregated stats, but replica-level detail (which specific replicas were contacted, which failed) is potentially lost. This information is critical for IcM debugging — knowing that replica X in region Y is the one failing is often the key diagnostic signal.

Suggestion: Add a requirement or open question about preserving replica/endpoint information in summary mode output. At minimum, the First and Last entries preserve this, but consider whether aggregated groups should include a distinct endpoint count or list.


⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.


## Resolved Questions

1. **Should `AggregatedGroups` include an `AvgDurationMs` field?** The Rust SDK only includes min/max/P50. Adding avg is cheap to compute but adds to the output size. _Decision: Include avg. It's a single field and provides useful signal._

2. **Should the summary include the `children` trace tree at all?** Currently proposed as replacing the entire trace output. An alternative is to emit the summary _alongside_ a truncated trace tree (e.g., first + last children only). _Decision: Summary replaces the full trace. The `First` and `Last` entries in each region summary provide the detailed bookends._

3. **Gateway vs Direct distinction in aggregated groups.** Should each `AggregatedGroup` indicate whether it's from Direct or Gateway transport? _Decision: Defer. The `StatusCode/SubStatusCode` combination is usually sufficient. Can add a `TransportType` field later if needed._

4. **Caching.** The Rust SDK caches serialized JSON per verbosity level via `OnceLock`. Should the .NET SDK cache the summary JSON? _Decision: Yes, use `Lazy<string>` or similar. `ToString()` may be called multiple times (logging, telemetry, etc.)._

5. **Thread safety.** `CosmosDiagnostics.Verbosity` as a settable property on a potentially shared object needs consideration. _Decision: Use the `ToString(DiagnosticsVerbosity)` overload which avoids mutating state entirely. The property is set once from `CosmosClientOptions` during response creation and read during serialization._

## References

- **Rust SDK PR:** [Azure/azure-sdk-for-rust#3592](https://github.com/Azure/azure-sdk-for-rust/pull/3592) — `DiagnosticsContext` with `Summary` and `Detailed` modes
- **Current .NET diagnostics:** `Microsoft.Azure.Cosmos/src/Diagnostics/` and `Microsoft.Azure.Cosmos/src/Tracing/`
- **Existing summary:** `SummaryDiagnostics.cs` — aggregates `(StatusCode, SubStatusCode)` counts (foundation to build on)
- **Trace tree:** `ITrace` → `Trace` with recursive children and `ClientSideRequestStatisticsTraceDatum` data
- **Related spec:** `openspec/specs/diagnostics-and-observability/spec.md`
Loading
Loading