Azure · NaluTripician · Mar 5, 2026 · Mar 5, 2026 · Mar 5, 2026 · Mar 10, 2026
@@ -62,4 +62,13 @@ Purpose: quick, actionable context so an AI coding assistant can be immediately
   - In VS Code Copilot Chat: `@MsdataDirectSyncAgent sync msdata/direct`.
   - In the Copilot CLI: describe the task naturally (e.g., "sync the msdata/direct branch with master").
 
+- **OpenSpec — Spec-Driven Development**:
+  - The SDK uses [OpenSpec](https://github.com/openspec-dev/openspec) for spec-driven development. Specs live in `openspec/specs/` and capture behavioral contracts for major feature areas.
+  - **Read `openspec/README.md`** for the full developer guide, workflow instructions, and best practices.
+  - Active changes (in-progress work) live in `openspec/changes/`. Archived changes in `openspec/changes/archive/`.
+  - Configuration and project context: `openspec/config.yaml`.
+  - Slash commands: `/opsx:propose` (create change), `/opsx:apply` (implement), `/opsx:explore` (investigate), `/opsx:archive` (complete).
+  - When making changes, check if an existing spec in `openspec/specs/` covers the affected behavior — if so, update the spec alongside the code change.
+  - When proposing a new feature or significant behavioral change, use `/opsx:propose` to create structured artifacts (proposal, design, tasks) before implementing.
+
 If anything here is unclear or you want the file to include additional examples (specific files, common refactor targets, or typical PR reviewers), tell me what to add and I will iterate.
@@ -93,6 +93,7 @@ When evaluating adding new tests, please search in the existing test files if th
 
 1. Create a branch for your contribution (if you are an external contributor, on your own fork).
 1. Make sure your work is adding [tests](#tests) as required (either unit and/or emulator tests depending on the scope of the work).
+1. If your change affects behavior covered by an [OpenSpec spec](#spec-driven-development), update the relevant spec alongside your code changes.
 1. Send a Pull Request to the master branch once your work is ready to be reviewed.
 1. The CI pipeline will start any required tests. If you are an external contributor, a team member will start the verification once we confirm the nature of the contribution through a `/azp run` comment in your Pull Request.
 1. Look for review comments and attempt to answer/address them to the best of your ability.
@@ -136,3 +137,11 @@ Or all through `Re-run failed checks` on the top right corner:
 - [General .NET SDK Troubleshooting](https://docs.microsoft.com/azure/cosmos-db/sql/troubleshoot-dot-net-sdk)
 - [Timeout troubleshooting](https://docs.microsoft.com/azure/cosmos-db/sql/troubleshoot-dot-net-sdk-request-timeout?tabs=cpu-new)
 - [Service unavailable troubleshooting](https://docs.microsoft.com/azure/cosmos-db/sql/troubleshoot-service-unavailable)
+
+## Spec-Driven Development
+
+This repository uses [OpenSpec](https://github.com/openspec-dev/openspec) for spec-driven development. Behavioral specifications for major SDK feature areas live in `openspec/specs/`.
+
+When contributing changes that affect documented behavior, check if an existing spec covers the area and update it as part of your PR. For new features or significant changes, consider using the OpenSpec workflow to propose, design, and implement changes with AI assistance.
+
+See [`openspec/README.md`](openspec/README.md) for the full developer guide, workflow instructions, and best practices.
@@ -0,0 +1,99 @@
+# OpenSpec — Spec-Driven Development for the Azure Cosmos DB .NET SDK
+
+This directory contains [OpenSpec](https://github.com/openspec-dev/openspec) artifacts for the Azure Cosmos DB .NET v3 SDK. OpenSpec provides a structured, AI-assisted workflow for proposing, specifying, designing, and implementing changes.
+
+## Why OpenSpec?
+
+The Cosmos DB .NET SDK is a large, complex codebase (~1,400+ source files) with many interdependent subsystems — retry policies, handler pipelines, cross-region routing, change feed processing, query execution, and more. OpenSpec helps by:
+
+1. **Capturing behavioral contracts** — Specs define *what* a feature should do (invariants, edge cases, error handling), not *how* it's implemented. This makes them durable even as implementation evolves.
+2. **Guiding AI-assisted development** — AI agents use specs as context when proposing and implementing changes, leading to more accurate code generation.
+3. **Reducing tribal knowledge** — Complex features like PPAF, cross-region hedging, and the handler pipeline have subtle invariants that are easy to break. Specs make these invariants explicit and reviewable.
+
+## Directory Structure
+
+```
+openspec/
+├── config.yaml              # Project context and artifact rules
+├── README.md                # This file
+├── specs/                   # Main spec catalog (living documentation)
+│   ├── README.md            # Spec index organized by area
+│   ├── retry-and-failover/
+│   │   └── spec.md
+│   └── ...
+├── changes/                 # Active changes (in-progress work)
+│   └── archive/             # Completed changes
+```
+
+| Concept | Location | Purpose |
+|---------|----------|---------|
+| **Specs** | `openspec/specs/<feature>/spec.md` | Living behavioral contracts for major feature areas. |
+| **Changes** | `openspec/changes/<name>/` | In-progress work with proposal, design, and task artifacts. |
+| **Archive** | `openspec/changes/archive/` | Completed changes with full context preserved. |
+| **Config** | `openspec/config.yaml` | Project context and per-artifact rules that guide AI. |
+
+## Workflow
+
+```
+Propose ──▶ Specs ──▶ Design ──▶ Tasks ──▶ Apply ──▶ Archive
+```
+
+| Command | Purpose |
+|---------|---------|
+| `/opsx:propose <name>` | Create a new change with proposal, design, and task artifacts |
+| `/opsx:apply [name]` | Implement tasks from a change |
+| `/opsx:explore [topic]` | Investigate ideas or problems without making code changes |
+| `/opsx:archive [name]` | Archive a completed change |
+
+## Writing Good Specs
+
+Specs capture **behavioral contracts** using [EARS notation](https://en.wikipedia.org/wiki/Easy_Approach_to_Requirements_Syntax) (WHEN/THEN/SHALL). They should answer: "What do I need to know to safely modify this feature?"
+
+### What a spec should include
+
+1. **Purpose** — One-paragraph summary of what the feature does
+2. **Public API surface** — Key types, methods, and their contracts (C# code blocks)
+3. **Requirements** — Behavioral requirements using EARS notation (`WHEN <condition>, THEN the SDK SHALL <behavior>`)
+4. **Reference tables** — Status code tables, configuration defaults, parameter matrices for dense reference data
+5. **Edge cases** — Non-obvious behaviors, race conditions, failure modes
+6. **Interactions** — How this feature relates to other SDK components (cross-spec links)
+7. **References** — Links to source files and existing design docs
+
+### What a spec should NOT include
+
+- Implementation details (specific variable names, internal algorithms)
+- Performance benchmarks (these change; use test projects instead)
+- Step-by-step code walkthroughs (that's what `docs/SdkDesign.md` is for)
+
+## When to Create or Update Specs
+
+**Create a new spec when:**
+- Adding a new major feature to the SDK
+- An area has complex invariants that are easy to break
+- The same behavioral rules are explained in multiple PR reviews
+
+**Update an existing spec when:**
+- Your PR changes behavior covered by a spec
+- A bug fix reveals an invariant that wasn't captured
+- A design doc in `docs/` gets updated
+
+**Don't need a spec for:**
+- Pure refactoring with no behavioral change
+- Test-only changes, documentation updates, dependency bumps
+
+## Best Practices
+
+| ✅ Do | ❌ Don't |
+|-------|---------|
+| Be specific about invariants (status codes, timeouts) | Copy implementation details into specs |
+| Use EARS notation for requirements | Create a spec per class (group by feature area) |
+| Include cross-spec "Interactions" sections | Let specs go stale |
+| Reference source files by path | Duplicate content from `docs/` |
+| Update specs as part of behavioral change PRs | Skip `/opsx:explore` for complex changes |
+| Review spec diffs in PRs like code | Archive before the PR is merged |
+
+## Related Documentation
+
+- [Spec Index](specs/README.md) — All specs organized by area
+- [SdkDesignGuidelines.md](../SdkDesignGuidelines.md) — Public API contract rules
+- [docs/SdkDesign.md](../docs/SdkDesign.md) — SDK architecture overview
@@ -0,0 +1,105 @@
+# Diagnostics Compaction — Design
+
+## Summary Compaction Algorithm
+
+### Data Collection
+
+Walk the `ITrace` tree (same traversal as `SummaryDiagnostics.CollectSummaryFromTraceTree()`) to collect all `StoreResponseStatistics` and `HttpResponseStatistics` entries from every `ClientSideRequestStatisticsTraceDatum` in the trace hierarchy.
+
+### Region Grouping
+
+Group collected entries by `Region` (string). Entries with a null/empty region are grouped under `"Unknown"`.
+
+### Per-Region Summary
+
+For each region group (ordered chronologically by request start time):
+
+1. **First**: Full details of the chronologically first request
+2. **Last**: Full details of the chronologically last request (omitted if only 1 request)
+3. **Middle entries** (all except first and last): Group by `(StatusCode, SubStatusCode)`:
+   - **Count**: Number of requests in this group
+   - **TotalRequestCharge**: Sum of RU charges
+   - **MinDurationMs / MaxDurationMs / P50DurationMs / AvgDurationMs**: Latency statistics
+
+### Size Enforcement
+
+1. Serialize the summary JSON
+2. If `serializedBytes <= MaxDiagnosticsSummarySizeBytes` → return as-is
+3. If `serializedBytes > MaxDiagnosticsSummarySizeBytes` → return truncated output
+
+### Handling Both Direct and Gateway Requests
+
+Both `StoreResponseStatistics` (direct mode) and `HttpResponseStatistics` (gateway mode) are collected and treated uniformly in the summary. The aggregated groups include entries from both transport paths. An optional `"TransportType"` field (`"Direct"` / `"Gateway"`) can be included in aggregated groups if needed to distinguish.
+
+## Request Flow
+
+```mermaid
+flowchart TD
+    A["ToString(DiagnosticsVerbosity)"] --> B{Verbosity?}
+    B -->|Detailed| C["Existing TraceJsonWriter path"]
+    B -->|Summary| D["DiagnosticsSummaryWriter"]
+    D --> E["Walk ITrace tree"]
+    E --> F["Collect StoreResponseStatistics\n+ HttpResponseStatistics"]
+    F --> G["Group by Region"]
+    G --> H["Per region:\nFirst + Last + Aggregated Middle"]
+    H --> I["Serialize to JSON"]
+    I --> J{Size <= Max?}
+    J -->|Yes| K["Return summary JSON"]
+    J -->|No| L["Return truncated JSON"]
+    C --> M["Return full trace JSON"]
+```
+
+## Files to Create
+
+| File | Description |
+|------|-------------|
+| `Microsoft.Azure.Cosmos/src/Diagnostics/DiagnosticsVerbosity.cs` | `DiagnosticsVerbosity` enum |
+| `Microsoft.Azure.Cosmos/src/Diagnostics/DiagnosticsSummaryWriter.cs` | Summary computation and JSON serialization logic |
+
+## Files to Modify
+
+| File | Change |
+|------|--------|
+| `CosmosClientOptions.cs` | Add `DiagnosticsVerbosity` and `MaxDiagnosticsSummarySizeBytes` properties with validation |
+| `CosmosDiagnostics.cs` | Add `ToString(DiagnosticsVerbosity)` abstract overload |
+| `CosmosTraceDiagnostics.cs` | Implement `ToString(DiagnosticsVerbosity)` overload; delegate to `DiagnosticsSummaryWriter` when verbosity is `Summary` |
+| `TraceWriter.TraceJsonWriter.cs` | Add summary serialization path that delegates to `DiagnosticsSummaryWriter` when verbosity is `Summary` |
+| `SummaryDiagnostics.cs` | Extend `CollectSummaryFromTraceTree()` to support region-grouped collection with ordering |
+| `ClientSideRequestStatisticsTraceDatum.cs` | Ensure `StoreResponseStatistics` and `HttpResponseStatistics` lists are accessible for summary computation |
+
+## Contract/Baseline Updates
+
+| File | Change |
+|------|--------|
+| `ContractEnforcementTests.cs` baseline | Update public API contract for new enum and properties |
+
+## Alternatives Considered
+
+### Alternative 1: Emit summary alongside truncated trace tree
+Instead of replacing the full trace, emit the summary _alongside_ the first + last children of the trace tree.
+
+**Pros:** Preserves some trace structure for tooling that parses it.
+**Cons:** Larger output size; complex to implement; defeats the purpose of compaction.
+**Decision:** Rejected — summary replaces the full trace. The `First` and `Last` entries in each region summary provide the detailed bookends.
+
+### Alternative 2: Per-request verbosity via RequestOptions
+Add a `DiagnosticsVerbosity` property to `RequestOptions` for per-request control.
+
+**Pros:** More granular control.
+**Cons:** Verbosity is a serialization concern, not a request concern. The `ToString(DiagnosticsVerbosity)` overload provides the same flexibility without complicating `RequestOptions`.
+**Decision:** Deferred. Can be added later if needed.
+
+### Alternative 3: Transport type distinction in aggregated groups
+Include a `TransportType` field (`"Direct"` / `"Gateway"`) in each aggregated group.
+
+**Pros:** Helps distinguish transport-specific issues.
+**Cons:** Increases output size; `StatusCode/SubStatusCode` is usually sufficient.
+**Decision:** Deferred. Can add later if customer feedback warrants it.
+
+## Key References
+
+- `Microsoft.Azure.Cosmos/src/Diagnostics/CosmosTraceDiagnostics.cs` — concrete diagnostics implementation
+- `Microsoft.Azure.Cosmos/src/Tracing/TraceWriter.TraceJsonWriter.cs` — current trace serialization
+- `Microsoft.Azure.Cosmos/src/Diagnostics/SummaryDiagnostics.cs` — existing summary aggregation (foundation)
+- `Microsoft.Azure.Cosmos/src/Tracing/TraceData/ClientSideRequestStatisticsTraceDatum.cs` — stats data
+- `docs/SdkDesign.md` — SDK architecture overview
@@ -0,0 +1,71 @@
+# Diagnostics Compaction — Proposal
+
+## Problem
+
+`CosmosDiagnostics.ToString()` produces a JSON trace that grows **unboundedly** with retries. Each retry attempt creates a new child `ITrace` node containing a full `ClientSideRequestStatisticsTraceDatum` with complete `StoreResponseStatistics` and `HttpResponseStatistics` entries. In pathological scenarios (sustained 429 throttling, transient failures, cross-region failovers), a single operation's diagnostics can grow to hundreds of KB.
+
+**Impact:**
+- **Log truncation** — monitoring systems (Application Insights, Azure Monitor, etc.) silently drop oversized log entries
+- **Memory pressure** — large diagnostic strings increase GC overhead, especially at high throughput
+- **Readability** — operators cannot quickly extract signal from noise when hundreds of identical retry entries are listed
+
+**Example scenario:** A point read that encounters 50 retries due to 429 throttling in West US 2, then fails over to East US 2 with 10 more retries, produces ~60 full `StoreResponseStatistics` entries in the trace tree. With summary mode, this compacts to: first request + last request + 1 aggregated group per region.
+
+## Proposed Approach
+
+Introduce a **`DiagnosticsVerbosity`** concept (modeled after [Azure/azure-sdk-for-rust#3592](https://github.com/Azure/azure-sdk-for-rust/pull/3592)) that controls how `CosmosDiagnostics.ToString()` serializes trace data:
+
+| Mode | Behavior | Use Case |
+|------|----------|----------|
+| **Detailed** (default) | Current behavior — full trace tree output | Debugging, development |
+| **Summary** | Region-grouped compaction with first/last + aggregated middle | Production logging, size-constrained environments |
+
+**Key design principle:** The in-memory representation (`ITrace` tree, `ClientSideRequestStatisticsTraceDatum`) stays **unchanged**. Compaction only happens at **serialization time** in the `TraceJsonWriter` path. This preserves full programmatic access to diagnostics data while reducing serialized output size.
+
+## SDK Area
+
+- **Primary:** Diagnostics
+- **Secondary:** Client-config (new options properties)
+
+## Preview vs GA
+
+The `DiagnosticsVerbosity` enum and related options should ship as **GA** (non-preview) since it's an additive, backward-compatible feature with no impact when not opted into.
+
+## Backward Compatibility
+
+- **Default is `Detailed`** — no behavioral change for existing users
+- **No breaking changes** — `ToString()` output format only changes when `Summary` is explicitly opted into
+- **Programmatic API unchanged** — `GetContactedRegions()`, `GetFailedRequestCount()`, etc. continue to work from the full in-memory trace regardless of verbosity
+
+## Rollout Strategy
+
+1. Ship with `Detailed` as default in initial release
+2. Document `Summary` mode in SDK documentation and changelog
+3. Consider making `Summary` the default in a future major version after customer feedback
+
+## Non-Goals
+
+- Changing the in-memory `ITrace` tree structure
+- Modifying the `Detailed` mode output format
+- Adding new programmatic APIs beyond `ToString(DiagnosticsVerbosity)` overload
+- Per-request verbosity override via `RequestOptions` (can be added later)
+
+## Resolved Questions
+
+1. **Should `AggregatedGroups` include an `AvgDurationMs` field?** The Rust SDK only includes min/max/P50. Adding avg is cheap to compute but adds to the output size. _Decision: Include avg. It's a single field and provides useful signal._
+
+2. **Should the summary include the `children` trace tree at all?** Currently proposed as replacing the entire trace output. An alternative is to emit the summary _alongside_ a truncated trace tree (e.g., first + last children only). _Decision: Summary replaces the full trace. The `First` and `Last` entries in each region summary provide the detailed bookends._
+
+3. **Gateway vs Direct distinction in aggregated groups.** Should each `AggregatedGroup` indicate whether it's from Direct or Gateway transport? _Decision: Defer. The `StatusCode/SubStatusCode` combination is usually sufficient. Can add a `TransportType` field later if needed._
+
+4. **Caching.** The Rust SDK caches serialized JSON per verbosity level via `OnceLock`. Should the .NET SDK cache the summary JSON? _Decision: Yes, use `Lazy<string>` or similar. `ToString()` may be called multiple times (logging, telemetry, etc.)._
+
+5. **Thread safety.** `CosmosDiagnostics.Verbosity` as a settable property on a potentially shared object needs consideration. _Decision: Use the `ToString(DiagnosticsVerbosity)` overload which avoids mutating state entirely. The property is set once from `CosmosClientOptions` during response creation and read during serialization._
+
+## References
+
+- **Rust SDK PR:** [Azure/azure-sdk-for-rust#3592](https://github.com/Azure/azure-sdk-for-rust/pull/3592) — `DiagnosticsContext` with `Summary` and `Detailed` modes
+- **Current .NET diagnostics:** `Microsoft.Azure.Cosmos/src/Diagnostics/` and `Microsoft.Azure.Cosmos/src/Tracing/`
+- **Existing summary:** `SummaryDiagnostics.cs` — aggregates `(StatusCode, SubStatusCode)` counts (foundation to build on)
+- **Trace tree:** `ITrace` → `Trace` with recursive children and `ClientSideRequestStatisticsTraceDatum` data
+- **Related spec:** `openspec/specs/diagnostics-and-observability/spec.md`