-
Notifications
You must be signed in to change notification settings - Fork 533
[Internal] Distributed Transactions: Adds openspec proposal for SDK implementation of Distributed Transact… #5781
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,2 @@ | ||
| schema: spec-driven | ||
| created: 2026-04-10 |
Large diffs are not rendered by default.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,38 @@ | ||
| ## Why | ||
|
|
||
| Azure Cosmos DB provides full ACID transactions within a single logical partition via `TransactionalBatch` and stored procedures. Applications that need atomicity across multiple partitions or containers must hand-roll compensating patterns (Saga, event sourcing) at significant complexity cost. Two gaps exist today within a single database account: | ||
|
|
||
| 1. **Cross-partition write atomicity** — there is no API to atomically commit mutations spanning multiple partition keys or containers. Any scenario (ledger transfers, order+inventory, multi-entity state machines) requires application-level compensating logic that is error-prone and non-portable across languages. | ||
|
|
||
| 2. **Consistent cross-partition reads** — `ReadManyItemsAsync` fans out parallel queries per partition range with no snapshot guarantee, making it vulnerable to read skew: two items returned in the same call can reflect different logical instants (e.g., one item seen before a concurrent transfer and another after it). | ||
|
|
||
| The Cosmos DB Gateway exposes a distributed transaction coordinator at `POST /operations/dtc`. The same endpoint handles both atomic writes and consistent snapshot reads — the `operationType` field in the JSON request body (`"Write"` or `"Read"`) selects the behavior. | ||
|
|
||
| ## What Changes | ||
|
|
||
| - Add `CosmosClient.CreateDistributedWriteTransaction()` returning a new `DistributedWriteTransaction` builder. | ||
| - Add `CosmosClient.CreateDistributedReadTransaction()` returning a new `DistributedReadTransaction` builder. | ||
| - `DistributedWriteTransaction` accumulates Create, Replace, Delete, Upsert, and Patch operations across any partitions and containers within the same account, then commits them atomically via `CommitTransactionAsync()`. | ||
| - `DistributedReadTransaction` accumulates ReadItem operations across any partitions and containers, then executes them as a single consistent server-side snapshot via `CommitTransactionAsync()`. | ||
| - New wire operations: `CommitDistributedTransaction` (reused for both write and read transactions; the distinction is in the request body). | ||
| - Session tokens are merged into the client's `ISessionContainer` after every commit or execute. | ||
| - Per-operation ETags are returned in both write and read responses, enabling ETag-gated follow-up writes. | ||
|
|
||
| ## Capabilities | ||
|
|
||
| ### New Capabilities | ||
|
|
||
| - `distributed-write-transaction`: Adds `CosmosClient.CreateDistributedWriteTransaction()`, `DistributedWriteTransaction`, `DistributedTransactionResponse`, and related types for cross-partition, cross-container atomic writes within the same database account. | ||
| - `distributed-read-transaction`: Adds `CosmosClient.CreateDistributedReadTransaction()`, `DistributedReadTransaction`, and related types for cross-partition, cross-container consistent snapshot reads within the same database account. Reuses `DistributedTransactionResponse` as the shared response type for both write and read transactions. | ||
|
|
||
| ### Modified Capabilities | ||
|
|
||
| - `client-and-configuration`: `CosmosClient` gains two new factory methods: `CreateDistributedWriteTransaction()` and `CreateDistributedReadTransaction()`. | ||
|
|
||
| ## Impact | ||
|
|
||
| - **Public API surface**: New abstract classes `DistributedWriteTransaction` and `DistributedReadTransaction`; a single shared `DistributedTransactionResponse` response type for both (with `IdempotencyToken` as `Guid?`, non-null for writes, null for reads); new options classes; two new factory methods on `CosmosClient`. Initially behind `#if INTERNAL`; promoted to public at GA. | ||
| - **Wire protocol**: Uses `POST /operations/dtc` for reads and writes. | ||
| - **Gateway routing**: `GatewayStoreModel`, `GatewayStoreClient`, and `RequestInvokerHandler` gain awareness of both operation types; `x-ms-session-token` is suppressed for both (per-partition tokens travel in the body). | ||
| - **Existing behavior**: No change for any existing API. Both features are entirely additive. | ||
| - **Tests**: New unit tests (serialization, response parsing, session token merge) and emulator integration tests for both write and read transactions. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,132 @@ | ||
| ## ADDED Requirements | ||
|
|
||
| ### Requirement: Supported read operations span multiple partitions and containers | ||
|
|
||
| `DistributedReadTransaction` SHALL support ReadItem operations. Each operation SHALL independently specify its target database, container, partition key, and document id, allowing reads from any combination of partitions and containers within the same database account. | ||
|
|
||
| #### Scenario: Operations target different partition keys | ||
|
|
||
| - **WHEN** a `DistributedReadTransaction` is built with ReadItem operations targeting different partition keys within the same container | ||
| - **THEN** `CommitTransactionAsync` SHALL return all items as a consistent snapshot | ||
|
|
||
| #### Scenario: Operations target different containers | ||
|
|
||
| - **WHEN** a `DistributedReadTransaction` is built with ReadItem operations targeting different containers within the same database account | ||
| - **THEN** `CommitTransactionAsync` SHALL return all items as a consistent snapshot | ||
|
|
||
| ### Requirement: Per-item ETag is returned in the response | ||
|
|
||
| The response SHALL expose an ETag for each successfully read item, reflecting the document version that was read. | ||
|
|
||
| #### Scenario: ETag returned for each read item | ||
|
|
||
| - **WHEN** `CommitTransactionAsync` completes with `IsSuccessStatusCode = true` | ||
| - **THEN** each operation's result SHALL carry a non-null `ETag` that can be used as `IfMatchEtag` on a subsequent `DistributedWriteTransaction` for conditional writes (optimistic CAS pattern) | ||
|
|
||
| ### Requirement: Conditional reads via IfNoneMatchEtag return 304 when unchanged | ||
|
|
||
| Each ReadItem operation in a `DistributedReadTransaction` SHALL support an optional `IfNoneMatchETag` precondition. If the document's current ETag matches, the server SHALL return 304 Not Modified for that operation with no response body. | ||
|
|
||
| #### Scenario: Document is unchanged — 304 returned | ||
|
|
||
| - **WHEN** `CommitTransactionAsync` is called and one operation specifies an `IfNoneMatchETag` that matches the document's current ETag | ||
| - **THEN** that operation's result SHALL carry `StatusCode = 304` and a null resource body | ||
|
|
||
| #### Scenario: Document has changed — full response returned | ||
|
|
||
| - **WHEN** `CommitTransactionAsync` is called and one operation specifies an `IfNoneMatchETag` that does not match the document's current ETag | ||
| - **THEN** that operation's result SHALL carry `StatusCode = 200` and the full document body | ||
|
|
||
|
|
||
|
|
||
| `DistributedReadTransaction` SHALL execute all staged ReadItem operations against a single consistent server-side snapshot, guaranteeing that no two results reflect different logical instants (no read skew). | ||
|
|
||
| #### Scenario: All items read from the same snapshot | ||
|
|
||
| - **WHEN** `CommitTransactionAsync` is called with multiple ReadItem operations targeting different partitions or containers | ||
| - **THEN** the service SHALL return all items as they existed at the same logical point in time | ||
| - **AND** the SDK SHALL return a `DistributedTransactionResponse` with `IsSuccessStatusCode = true` and `StatusCode = 200` | ||
|
|
||
| #### Scenario: One item not found | ||
|
|
||
| - **WHEN** `CommitTransactionAsync` is called and one of the staged items does not exist in the container | ||
| - **THEN** the SDK SHALL return a `DistributedTransactionResponse` with `StatusCode = 207` | ||
| - **AND** the missing item's result SHALL carry `StatusCode = 404` | ||
| - **AND** all other items' results SHALL carry their actual response status codes | ||
|
|
||
| ### Requirement: Session tokens are merged into ISessionContainer after execution | ||
|
|
||
| After execution completes, the SDK SHALL merge per-operation session tokens from the response into the client's `ISessionContainer`. | ||
|
|
||
| #### Scenario: Session tokens merged after successful execution | ||
|
|
||
| - **WHEN** `CommitTransactionAsync` completes and the response contains per-operation `sessionToken` values | ||
| - **THEN** the SDK SHALL merge each response token into `ISessionContainer` for the corresponding partition | ||
|
|
||
| ### Requirement: Read transactions are not supported in Direct mode | ||
|
|
||
| The SDK SHALL route every `CommitTransactionAsync` request through the Gateway or Thin Client endpoint. Direct mode is not supported for distributed transactions. | ||
|
|
||
| #### Scenario: Direct mode client executes a read transaction | ||
|
|
||
| - **WHEN** the `CosmosClient` is configured with Direct connectivity mode | ||
| - **THEN** the SDK SHALL override connectivity to Gateway mode and route `CommitTransactionAsync` to the Gateway read-transaction endpoint | ||
|
|
||
| ### Requirement: Read transactions are idempotent and safe to retry unconditionally | ||
|
|
||
| `CommitTransactionAsync` SHALL be safe to retry on any transient failure without risk of duplicate side effects. No idempotency token is required. | ||
|
|
||
| #### Scenario: Network failure triggers automatic retry | ||
|
|
||
| - **WHEN** `CommitTransactionAsync` experiences a transient failure (408 Request Timeout or 503 Service Unavailable) | ||
| - **THEN** the SDK SHALL automatically retry the request | ||
|
|
||
| ### Requirement: CommitTransactionAsync is blocked on accounts with multiple write regions | ||
|
|
||
| `CommitTransactionAsync` SHALL throw `NotSupportedException` if the account is configured with multiple write regions. | ||
|
|
||
| #### Scenario: Multi-write-region account | ||
|
|
||
| - **WHEN** the Cosmos DB account has `UseMultipleWriteLocations = true` | ||
| - **THEN** `CommitTransactionAsync` SHALL throw `NotSupportedException` before sending any request | ||
|
|
||
| ### Requirement: Consistency level controls snapshot freshness | ||
|
|
||
| The `ConsistencyLevel` set in `DistributedReadTransactionRequestOptions` SHALL determine how fresh the server-side snapshot is. | ||
|
|
||
| #### Scenario: Strong consistency provides globally latest snapshot | ||
|
|
||
| - **WHEN** `CommitTransactionAsync` is called with `ConsistencyLevel = Strong` | ||
| - **THEN** the service SHALL return all items from the globally latest committed version with no stale reads | ||
|
|
||
| #### Scenario: Session consistency provides per-partition monotonic reads | ||
|
|
||
| - **WHEN** `CommitTransactionAsync` is called with `ConsistencyLevel = Session` and per-operation session tokens are populated from the client's `ISessionContainer` | ||
| - **THEN** the service SHALL return items that are at least as fresh as the provided session token for each partition | ||
|
|
||
| ### Requirement: Operation count is validated before the wire call | ||
|
|
||
| The SDK SHALL reject a `CommitTransactionAsync` call before dispatch if the staged operations exceed the supported limit. | ||
|
|
||
| #### Scenario: Exceeds maximum operation count | ||
|
|
||
| - **WHEN** more than 100 ReadItem operations have been staged on the transaction | ||
| - **THEN** `CommitTransactionAsync` SHALL throw `ArgumentException` before sending any request | ||
|
|
||
| ### Requirement: Throttled requests are automatically retried | ||
|
|
||
| When the Gateway returns 429 (Too Many Requests), the SDK SHALL automatically retry after the `x-ms-retry-after-ms` interval. | ||
|
|
||
| #### Scenario: 429 response triggers retry | ||
|
|
||
| - **WHEN** `CommitTransactionAsync` receives a 429 response | ||
| - **THEN** the SDK SHALL wait for `x-ms-retry-after-ms` and retry the request | ||
|
|
||
| ### Requirement: CommitTransactionAsync is instrumented with OpenTelemetry | ||
|
|
||
| The SDK SHALL emit an OpenTelemetry span for each `CommitTransactionAsync` call. | ||
|
|
||
| #### Scenario: Successful execution span | ||
|
|
||
| - **WHEN** `CommitTransactionAsync` completes successfully | ||
| - **THEN** the SDK SHALL emit a span with operation name `commit_distributed_read_transaction`, `db.cosmosdb.status_code = 200`, and `db.cosmosdb.request_charge` set to the total RU charge for all read operations | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,118 @@ | ||
| ## ADDED Requirements | ||
|
|
||
| ### Requirement: Supported write operations span multiple partitions and containers | ||
|
|
||
| `DistributedWriteTransaction` SHALL support Create, Replace, Delete, Upsert, and Patch operations. Each operation SHALL independently specify its target database, container, partition key, and document id, allowing operations to target any combination of partitions and containers within the same database account. | ||
|
|
||
| #### Scenario: Operations target different partition keys | ||
|
|
||
| - **WHEN** a `DistributedWriteTransaction` is built with operations targeting different partition keys within the same container | ||
| - **THEN** `CommitTransactionAsync` SHALL commit all operations atomically | ||
|
|
||
| #### Scenario: Operations target different containers | ||
|
|
||
| - **WHEN** a `DistributedWriteTransaction` is built with operations targeting different containers within the same database account | ||
| - **THEN** `CommitTransactionAsync` SHALL commit all operations atomically | ||
|
|
||
| ### Requirement: Per-operation ETag is returned on success | ||
|
|
||
| After a successful commit, the response SHALL expose an ETag for each write operation, reflecting the server-assigned version of the written document. | ||
|
|
||
| #### Scenario: ETag returned for each committed operation | ||
|
|
||
| - **WHEN** `CommitTransactionAsync` completes with `IsSuccessStatusCode = true` | ||
| - **THEN** each operation's result SHALL carry a non-null `ETag` that can be used as `IfMatchEtag` on a subsequent `DistributedWriteTransaction` for optimistic concurrency | ||
|
|
||
| ### Requirement: Per-operation optimistic concurrency via IfMatchEtag | ||
|
|
||
| Each operation in a `DistributedWriteTransaction` SHALL support an `IfMatchEtag` precondition. If the document's current ETag does not match, the entire transaction SHALL be rolled back. | ||
|
|
||
| #### Scenario: ETag precondition fails on one operation | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🔴 Blocking · Correctness: Internal Contradiction Rolled-back operation status code: 424 here vs 453 in design.md This scenario says rolled-back operations carry 424 is a standard WebDAV status code. 453 is a custom Cosmos DB code. These are fundamentally different values — an implementer cannot satisfy both. Suggestion: Align on whichever status code the server actually returns. Update either this spec or the design doc. If the server returns 453, update this spec to use |
||
|
|
||
| - **WHEN** `CommitTransactionAsync` is called and one operation's `IfMatchEtag` does not match the document's current ETag | ||
| - **THEN** the service SHALL roll back all operations and the failed operation's result SHALL carry `StatusCode = 412 Precondition Failed` | ||
| - **AND** all other operations SHALL carry `StatusCode = 424 Failed Dependency` | ||
|
|
||
|
|
||
|
|
||
| `DistributedWriteTransaction` SHALL atomically commit all staged write operations across any partitions and containers within the same database account. If any operation fails, the service SHALL roll back all operations in the transaction. | ||
|
|
||
| #### Scenario: All operations succeed | ||
|
|
||
| - **WHEN** `CommitTransactionAsync` is called and all operations are valid and commit successfully | ||
| - **THEN** the SDK SHALL return a `DistributedTransactionResponse` with `IsSuccessStatusCode = true` and `StatusCode = 200` | ||
|
|
||
| #### Scenario: One operation fails — all are rolled back | ||
|
|
||
| - **WHEN** `CommitTransactionAsync` is called and at least one operation fails (e.g., ETag mismatch, conflict) | ||
| - **THEN** the service SHALL roll back all operations and the SDK SHALL return a `DistributedTransactionResponse` with `IsSuccessStatusCode = false` | ||
| - **AND** the failed operation's result SHALL carry the specific failure status code (e.g., 412 Precondition Failed, 409 Conflict) | ||
| - **AND** all other operations SHALL carry `StatusCode = 424 Failed Dependency` | ||
|
|
||
| ### Requirement: Session tokens are merged into ISessionContainer after commit | ||
|
|
||
| After a commit completes (success or failure), the SDK SHALL merge per-operation session tokens from the response into the client's `ISessionContainer`. | ||
|
|
||
| #### Scenario: Session tokens merged after commit | ||
|
|
||
| - **WHEN** `CommitTransactionAsync` completes and the response contains per-operation `sessionToken` values | ||
| - **THEN** the SDK SHALL call `ISessionContainer.SetSessionToken` for each operation's response token so that subsequent reads on any affected partition observe the committed writes | ||
|
|
||
| ### Requirement: Commits are idempotent via a per-call idempotency token | ||
|
|
||
| Each `CommitTransactionAsync` call SHALL auto-generate a unique idempotency token and send it as `x-ms-idempotency-token`. The token SHALL be exposed on the response so callers can safely replay a commit whose outcome is unknown. | ||
|
|
||
| #### Scenario: Retry with the same idempotency token | ||
|
|
||
| - **WHEN** a caller retries `CommitTransactionAsync` using the `IdempotencyToken` from a previous response | ||
| - **THEN** the service SHALL return the same result as the original commit without applying the operations a second time | ||
|
|
||
| ### Requirement: Write transactions are not supported in Direct mode | ||
|
|
||
| The SDK SHALL route every `CommitTransactionAsync` request through the Gateway or Thin Client endpoint. Direct mode is not supported for distributed transactions. | ||
|
|
||
| #### Scenario: Direct mode client commits a write transaction | ||
|
|
||
| - **WHEN** the `CosmosClient` is configured with Direct connectivity mode | ||
| - **THEN** the SDK SHALL override connectivity to Gateway mode and route `CommitTransactionAsync` to the Gateway write-transaction endpoint | ||
|
|
||
| ### Requirement: CommitTransactionAsync is blocked on accounts with multiple write regions | ||
|
|
||
| `CommitTransactionAsync` SHALL throw `NotSupportedException` if the account is configured with multiple write regions. | ||
|
|
||
| #### Scenario: Multi-write-region account | ||
|
|
||
| - **WHEN** the Cosmos DB account has `UseMultipleWriteLocations = true` | ||
| - **THEN** `CommitTransactionAsync` SHALL throw `NotSupportedException` before sending any request | ||
|
|
||
| ### Requirement: Operation count and payload size are validated before the wire call | ||
|
|
||
| The SDK SHALL reject a `CommitTransactionAsync` call before dispatch if the staged operations exceed the supported limits. | ||
|
|
||
| #### Scenario: Exceeds maximum operation count | ||
|
|
||
| - **WHEN** more than 100 operations have been staged on the transaction | ||
| - **THEN** `CommitTransactionAsync` SHALL throw `ArgumentException` before sending any request | ||
|
|
||
| #### Scenario: Exceeds maximum payload size | ||
|
|
||
| - **WHEN** the total serialized request body would exceed 2 MB | ||
| - **THEN** `CommitTransactionAsync` SHALL throw `ArgumentException` before sending any request | ||
|
|
||
| ### Requirement: Throttled requests are automatically retried | ||
|
|
||
| When the Gateway returns 429 (Too Many Requests), the SDK SHALL automatically retry after the `x-ms-retry-after-ms` interval, up to the configured maximum retry count. | ||
|
|
||
| #### Scenario: 429 response triggers retry | ||
|
|
||
| - **WHEN** `CommitTransactionAsync` receives a 429 response | ||
| - **THEN** the SDK SHALL wait for `x-ms-retry-after-ms` and retry the commit with the same idempotency token | ||
|
|
||
| ### Requirement: CommitTransactionAsync is instrumented with OpenTelemetry | ||
|
|
||
| The SDK SHALL emit an OpenTelemetry span for each `CommitTransactionAsync` call. | ||
|
|
||
| #### Scenario: Successful commit span | ||
|
|
||
| - **WHEN** `CommitTransactionAsync` completes successfully | ||
| - **THEN** the SDK SHALL emit a span with `db.cosmosdb.operation_type = "CommitDistributedTransaction"`, `db.cosmosdb.status_code = 200`, and `db.cosmosdb.request_charge` set to the total RU charge for the transaction | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 Blocking · Correctness: Internal Contradiction
503 Service Unavailable is not in the design.md retry table
This scenario references '408 Request Timeout or 503 Service Unavailable' as retryable failures. However, the comprehensive retry table in
design.mddoes not include 503 at all. The table lists 408, 449/5352, 429/3200, and 500 with sub-statuses 5411-5413 — but no 503.Either 503 is a valid Gateway response that's missing from the retry table, or this scenario cites a non-existent response code. An implementer following the design retry table would not implement 503 retry; an implementer following this spec would.
Suggestion: Add 503 to the design.md retry table if the Gateway can return it, or replace 503 here with the actual retryable codes from the table (408, 449, 500/5411-5413).