diff --git a/openspec/changes/distributed-transactions/.openspec.yaml b/openspec/changes/distributed-transactions/.openspec.yaml new file mode 100644 index 0000000000..e49efd11cd --- /dev/null +++ b/openspec/changes/distributed-transactions/.openspec.yaml @@ -0,0 +1,2 @@ +schema: spec-driven +created: 2026-04-10 diff --git a/openspec/changes/distributed-transactions/design.md b/openspec/changes/distributed-transactions/design.md new file mode 100644 index 0000000000..2a63256c7a --- /dev/null +++ b/openspec/changes/distributed-transactions/design.md @@ -0,0 +1,312 @@ +## Context + +### Current transaction boundary + +`TransactionalBatch` provides ACID atomicity within one logical partition but requires every operation to share the same `PartitionKey`. No SDK today exposes a cross-partition atomicity primitive. + +### Current multi-item read path + +`ReadManyItemsAsync` fans out parallel point-reads per partition range and collects results with `Task.WhenAll`. There is no snapshot guarantee across partitions — each read resolves independently, leaving the result set open to read skew. + +### Server-side capabilities (confirmed with service team) + +The Cosmos DB Gateway exposes a single HTTP endpoint for distributed transactions: + +`POST /operations/dtc` + +Both write and read transactions use this endpoint. The `operationType` field in the JSON request body distinguishes them — `"Write"` for write transactions and `"Read"` for read transactions. The endpoint returns `200` (committed, with per-operation results) or `452` (aborted, with per-operation results). All other outcomes (`408`, `449`, `429`, `500`, `400`) are returned with an empty body. + +### SDK routing infrastructure + +```mermaid +flowchart TD + A["CosmosClient
CreateDistributedWriteTransaction() / CreateDistributedReadTransaction()"] + A --> B["RequestInvokerHandler (handler pipeline)"] + + B --> C{Connectivity Mode} + + C -- "Direct mode" --> D["Force Gateway
UseGatewayMode = true"] + C -- "Gateway mode" --> GW["GatewayStoreModel
· Strips x-ms-session-token
· CommitDistributedTransaction → POST"] + C -- "Thin Client mode" --> TC["ThinClientStoreModel"] + + D --> GW + + GW --> EP["GatewayStoreClient
POST /operations/dtc"] + TC --> EP + + EP --> COORD["Cosmos DB Gateway
DTC Coordinator"] +``` + +The SDK dispatches all requests through the `RequestInvokerHandler` handler pipeline. For distributed transactions: + +- **Direct mode** clients have `UseGatewayMode = true` applied automatically, redirecting the request to `GatewayStoreModel`. +- **Gateway mode** and **Thin Client mode** reach the coordinator natively. +- `GatewayStoreModel` strips the outbound `x-ms-session-token` header for all DTC requests (session token management must not interfere with the coordinator) and maps `OperationType.CommitDistributedTransaction` → `HttpMethod.Post`. +- The endpoint URL (`/operations/dtc`) is set directly on the `RequestMessage` by the committer, not derived from `OperationType`. + +## Goals / Non-Goals + +**Goals:** +- Cross-partition, cross-container atomic write transactions within the same database account +- Cross-partition, cross-container consistent snapshot read transactions within the same database account +- Per-item ETag output on both write and read responses +- Session-consistency: session tokens merged into `ISessionContainer` after every `CommitTransactionAsync` call (write and read) +- Idempotent write commits (auto-generated idempotency token per `CommitTransactionAsync` call) +- 429 auto-retry; OpenTelemetry tracing; client-side op-count/size guards +- Cross-language API parity (.NET, Java, Python, JavaScript, Go) + +**Non-Goals:** +- Multi-write-region (multi-master) accounts — addressed in a separate spec +- Combined read-write in a single transaction call — use `DistributedReadTransaction` + `DistributedWriteTransaction` with ETag CAS instead +- Cross-account transactions +- Query (non-point-read) operations within a read transaction +- Direct mode connectivity — only Gateway and Thin Client modes are supported +- Analytical store + +## Decisions + +### 1. Factory on CosmosClient, not Container + +**Decision:** `CreateDistributedWriteTransaction()` and `CreateDistributedReadTransaction()` are methods on `CosmosClient`. + +**Rationale:** A distributed transaction spans multiple containers within the same database account; there is no single "home" container. The factory on `CosmosClient` matches the scope of the operation. Methods are `virtual` to support mocking. + +### 2. No Direct mode — Gateway and Thin Client modes supported + +**Decision:** Both `DistributedWriteTransactionCore` and `DistributedReadTransactionCore` force the request away from Direct mode by setting `requestMessage.UseGatewayMode = true` before dispatch. Gateway mode and Thin Client mode are both supported. + +**Rationale:** The distributed transaction coordinator lives in the Gateway tier. Direct mode bypasses the Gateway and connects to individual replica nodes, making the coordinator unreachable. Thin Client mode routes through the Gateway tier and is therefore compatible. The override is enforced unconditionally — no per-call option to bypass it. + +### 3. Idempotency token per commit (write only) + +**Decision:** A new `Guid.NewGuid()` is generated at the start of each `CommitTransactionAsync()` call and sent as `x-ms-idempotency-token`. The token is exposed on the response (`DistributedTransactionResponse.IdempotencyToken`) so callers can replay a commit whose outcome is unknown. + +**Rationale:** Distributed writes are not inherently idempotent. A unique token per call lets the server detect and deduplicate replayed commits. Pure reads are inherently idempotent; no token is needed for `CommitTransactionAsync()` on read transactions. + +### 4. Session token handling — write transactions + +**Decision:** Before dispatch, populate each operation's `sessionToken` field from `ISessionContainer` for its partition. After commit, merge per-operation `sessionToken` values from `operationResponses` into `ISessionContainer`. + +**Rationale:** Session consistency requires the client to track the latest LSN seen per partition. Populating inbound tokens ensures the commit does not proceed at a stale LSN. Merging post-commit tokens ensures subsequent reads see the committed writes. + +### 5. Session token handling — read transactions + +**Decision:** Before `CommitTransactionAsync()` dispatch, populate each operation's `sessionToken` from `ISessionContainer`. After execution, merge response session tokens back. + +**Rationale:** For reads, the inbound token ensures the snapshot is at least as fresh as the client's last seen write on each partition. Merging post-read tokens keeps `ISessionContainer` up to date for future operations. + +### 6. Single wire endpoint for both write and read transactions + +**Decision:** Both write and read transactions use `POST /operations/dtc`, `OperationType.CommitDistributedTransaction`, `ResourceType.DistributedTransactionBatch`. The `operationType` field in the JSON request body (`"Write"` or `"Read"`) is what distinguishes them on the server side. + +**Rationale — why one endpoint:** The coordinator executes the same two-phase protocol (Prepare → Commit/Abort) for both transaction types. Separating them into two endpoints would duplicate Gateway routing logic with no behavioral benefit. + +**Rationale — why POST for reads:** HTTP GET is semantically *safe* (no server-side state changes) and does not support a request body. Read transactions violate both constraints: (1) the coordinator writes a ledger record during the Prepare phase, and (2) the operation list (item IDs, partition keys, container RIDs, session tokens) is a structured payload that cannot fit in a URL. POST is also consistent with how Cosmos DB handles all other complex read operations — SQL queries, `ReadManyItemsAsync`, and `TransactionalBatch` all use POST. + +**Rationale — why reuse `OperationType.CommitDistributedTransaction`:** `OperationType` drives three routing decisions in the SDK — HTTP method, session-token stripping, and URL. All three are already correct for read transactions: +- **HTTP method**: `GatewayStoreClient` and `RequestInvokerHandler` will map `CommitDistributedTransaction` → `HttpMethod.Post`. +- **Session-token stripping**: `GatewayStoreModel` will match `CommitDistributedTransaction` + `DistributedTransactionBatch` — read transactions inherit this suppression automatically. +- **Endpoint URL**: set on the `RequestMessage` by the read committer directly, not derived from `OperationType`. + +A dedicated `CommitDistributedReadTransaction` enum value would drive zero new behavior. The distinction is expressed entirely through the request body. + +### 7. Single shared response type for both write and read transactions + +**Decision:** Both `DistributedWriteTransaction.CommitTransactionAsync()` and `DistributedReadTransaction.CommitTransactionAsync()` return `DistributedTransactionResponse`. `IdempotencyToken` is `Guid?` — set to the auto-generated GUID for write commits, `null` for read commits. `DistributedTransactionOperationResult` is likewise shared. + +**Rationale:** The per-operation result shape is identical for reads and writes (`StatusCode`, `ETag`, `SessionToken`, `RequestCharge`, `ResourceStream`). The only top-level difference is `IdempotencyToken`, which does not justify a separate type — it is simply not applicable for reads. This follows the `TransactionalBatch` precedent where `TransactionalBatchResponse` is reused for all operations including `ReadItem`. A single type reduces the public API surface, eliminates duplicated parsing logic, and makes it easier for callers who handle both response types uniformly. + +### 8. No combined read-write transaction in v1 + +**Decision:** `DistributedReadTransaction` is a separate class from `DistributedWriteTransaction`. There is no API to atomically read-then-write in a single call. + +**Rationale:** Combined read-write transactions require pessimistic locking or MVCC conflict detection at commit time — both add substantial server complexity. The ETag CAS pattern (read snapshot → compute new state → write with `IfMatchEtag`) achieves optimistic isolation with no new server machinery and handles the common case of low contention efficiently. + +### 9. Consistency level for read transactions + +**Decision:** `DistributedReadTransactionRequestOptions.ConsistencyLevel` overrides the account default. Strong or BoundedStaleness provides a true cross-partition snapshot; Session provides per-partition monotonic reads but no global snapshot. + +**Rationale:** The consistency level maps directly to the request header forwarded to the Gateway. Callers needing read-skew elimination must use Strong or BoundedStaleness; this is documented prominently in the API. + +### 10. Single write-region scope + +**Decision:** Both transaction types throw `NotSupportedException` if the account has multiple write regions. + +**Rationale:** The Gateway coordinator's behavior on multi-write-region accounts is not validated. The guard prevents silent incorrect behavior until multi-master support is specified and tested (separate spec). + +### 11. Terminal response shape — 200 (committed) and 452 (aborted) + +**Decision:** The SDK maps the two terminal envelope status codes as follows: +- **200** → `DistributedTransactionResponse.IsSuccessStatusCode = true`; `operationResponses` contains per-operation results as returned from the prepare phase (Create → 201, Replace/Upsert → 200, Delete → 204). +- **452** → `DistributedTransactionResponse.IsSuccessStatusCode = false`; `operationResponses` contains: **453 / sub-status 5415** (`DtcOperationRolledBack`) for any operation that voted Yes and was rolled back, and the **original error code** (e.g. 409, 412) for whichever operation(s) voted No and caused the abort. + +`207 Multi-Status` is never returned by the Gateway for distributed transactions. + +**Rationale:** The coordinator drives each transaction to one of exactly two terminal outcomes before responding to the SDK. The 200/452 split gives callers a single `IsSuccessStatusCode` check for the aggregate result; per-operation `StatusCode` on the 452 body identifies the specific operation(s) that caused the abort and which were collateral rollbacks. + +## Wire Contract + +Both write and read transactions POST to the same Gateway endpoint. The `operationType` field at the envelope level distinguishes them. + +``` +POST https://.documents.azure.com/operations/dtc +``` + +### Request headers + +| Header | Required | Notes | +|---|---|---| +| `Content-Type` | Yes | `application/json` | +| `x-ms-idempotency-token` | Write only | `Guid` generated once per `CommitTransactionAsync` call; reused on retries | +| `x-ms-consistency-level` | No | Overrides account default; relevant for read transactions | +| `x-ms-session-token` | **Never sent** | Stripped by `GatewayStoreModel`; per-operation session tokens are embedded in the request body instead | + +### Request body + +The envelope `operationType` field (`"Write"` or `"Read"`) selects coordinator behavior. Each element of `operations` carries its own `operationType` matching the write verb or `"Read"`. + +**Write transaction** + +```json +{ + "operationType": "Write", + "operations": [ + { + "operationType": "Create" | "Replace" | "Upsert" | "Delete" | "Patch", + "databaseRid": "", + "containerRid": "", + "partitionKey": "", + "id": "", + "resourceBody": { }, + "ifMatchEtag": "", + "sessionToken": "" + } + ] +} +``` + +| Field | Write ops | Notes | +|---|---|---| +| `operationType` | Required | `"Create"`, `"Replace"`, `"Upsert"`, `"Delete"`, `"Patch"` | +| `databaseRid` | Required | Resolved by SDK via `GetCachedContainerPropertiesAsync` | +| `containerRid` | Required | Resolved by SDK via `GetCachedContainerPropertiesAsync` | +| `partitionKey` | Required | Serialized partition key value | +| `id` | Required | Document id | +| `resourceBody` | Create / Replace / Upsert / Patch | Omitted for Delete | +| `ifMatchEtag` | Optional | Optimistic concurrency precondition; transaction aborts with 412 if mismatch | +| `sessionToken` | Optional | Populated by SDK from `ISessionContainer` for this operation's partition | + +**Read transaction** + +```json +{ + "operationType": "Read", + "operations": [ + { + "operationType": "Read", + "databaseRid": "", + "containerRid": "", + "partitionKey": "", + "id": "", + "ifNoneMatchEtag": "", + "sessionToken": "" + } + ] +} +``` + +| Field | Read ops | Notes | +|---|---|---| +| `operationType` | Required | Always `"Read"` | +| `databaseRid` | Required | Resolved by SDK | +| `containerRid` | Required | Resolved by SDK | +| `partitionKey` | Required | Serialized partition key value | +| `id` | Required | Document id | +| `ifNoneMatchEtag` | Optional | Returns 304 for that operation if ETag matches; no resource body | +| `sessionToken` | Optional | Populated by SDK from `ISessionContainer` for this operation's partition | + +### Response headers + +| Header | Present on | Notes | +|---|---|---| +| `x-ms-request-charge` | All responses | Total RU charge aggregated across all coordinator phases | +| `x-ms-activity-id` | All responses | Gateway-assigned correlation id for diagnostics | +| `Retry-After` | 449 only | Backoff in seconds before the SDK should retry | + +### Response body — 200 (committed) and 452 (aborted) + +Both terminal status codes include a 1:1 `operationResponses` array. Non-terminal responses (408, 449, 429, 400, 500) have an empty body. + +```json +{ + "operationResponses": [ + { + "index": 0, + "statusCode": 201, + "subStatusCode": 0, + "eTag": "\"00000000-0000-0000-0000-000000000000\"", + "sessionToken": "", + "requestCharge": 3.5, + "resourceBody": { } + } + ] +} +``` + +| Field | Notes | +|---|---| +| `index` | Zero-based position matching the request `operations` array | +| `statusCode` | HTTP status for this operation. On 200: per-verb result (Create → 201, Replace/Upsert → 200, Delete → 204, Read → 200). On 452: 453 (`DtcOperationRolledBack`, sub-status 5415) for rolled-back ops; original error code (e.g. 409, 412) for the operation(s) that caused the abort | +| `subStatusCode` | 0 on success; 5415 for rolled-back operations on 452 | +| `eTag` | Server-assigned version of the written or read document; null for Delete and rolled-back ops | +| `sessionToken` | Latest LSN for this operation's partition; merged into `ISessionContainer` by the SDK after commit | +| `requestCharge` | Per-operation RU charge from the prepare phase | +| `resourceBody` | Document body; present for Create/Replace/Upsert/Read; absent for Delete, 304, and error ops | + +## Retry Policy + +The coordinator exhausts its own internal retries before surfacing an error to the SDK. When the SDK does receive a non-200/non-452 status, it means the coordinator itself could not resolve the situation. The table below defines the SDK-side retry behavior for each possible envelope response. + +### Envelope status retry table + +| HTTP | Sub-Status | Meaning | SDK retry | Notes | +|---|---|---|---|---| +| **200** | 0 | Committed | No — return to caller | Success terminal state | +| **452** | 0 | Aborted | No — return to caller | App must inspect per-op results; some aborts are fatal (see below) | +| **408** | 0 | Coordinator retries exhausted (stuck) | Yes — automatic | Coordinator could not make progress; SDK retry may succeed on a fresh coordinator | +| **449** | 5352 | Coordinator race conflict | Yes — honor `Retry-After` header | Two coordinators raced on the ledger ETag; SDK retry resolves by re-submitting | +| **429** | 3200 | Ledger throttled, coordinator retries exhausted | Yes — exponential backoff | SDK retry after backoff | +| **500** | 5411 | LedgerFailure | Yes — automatic | Infrastructure transient | +| **500** | 5412 | AccountConfigFailure | Yes — automatic | Infrastructure transient | +| **500** | 5413 | DispatchFailure | Yes — automatic | Infrastructure transient | +| **400** | 5405 | ParseFailure | No | Permanent — malformed request body | +| **400** | 5406 | FeatureDisabled | No | Permanent — account/feature flag | +| **400** | 5407 | MaxOpsExceeded | No | Permanent — reduce operation count | +| **400** | 5408 | MissingIdempotencyToken | No | Permanent — SDK bug if raised | +| **400** | 5409 | InvalidAccountName | No | Permanent | +| **400** | 5410 | InvalidOperation | No | Permanent | + +### Retry key: write transactions + +When the SDK retries a `CommitTransactionAsync` call on a write transaction it **reuses the same idempotency token** generated at the start of the original call. This allows the coordinator to recognise a replayed commit and return the already-committed result rather than re-executing the transaction. The idempotency token is never regenerated mid-retry loop. + +### 452 Aborted — application-dependent handling + +452 is not automatically retried because the cause determines whether a retry is safe: + +- **Transient abort** (e.g., 409 on a row that has since been updated) — application may choose to re-read and retry with a new `DistributedWriteTransaction`. +- **Fatal abort** (e.g., 400-range semantic conflict, schema violation) — retrying would produce the same result. + +The SDK surfaces the full 452 response including per-operation codes so the application can make this determination. The SDK does NOT silently reset to Preparing on a 452. + +### Read transactions + +Read transactions are inherently idempotent. The SDK retries on 408, 449, 429, and 500 sub-statuses unconditionally, with no idempotency token. + +## Risks / Trade-offs + +- **[Risk] Server-side read transaction support** — the `operationType: "Read"` path in `POST /operations/dtc` is a new coordinator capability. Service team confirmation of snapshot semantics, wire contract, and delivery timeline is required before SDK implementation begins. +- **[Risk] Abort / rollback for write transactions** — `AbortTransactionAsync` cannot be implemented until the service team confirms and ships a corresponding abort endpoint. Network failures rely on server-side timeout rollback in the interim; callers must be documented that the abort API is not available in v1. +- **[Trade-off] No Direct mode support** — clients in Direct mode are silently upgraded to Gateway mode for DTC requests, adding one Gateway round-trip. Accepted because the coordinator must see all operations atomically; Direct mode bypasses the coordinator. Thin Client mode is unaffected. +- **[Trade-off] ETag CAS instead of combined read-write transaction** — callers make two network calls (read + conditional write). Accepted because combined read-write transactions carry high server complexity relative to the benefit for common workloads. +- **[Risk] Read transaction 304 optimization** — `ifNoneMatchETag` conditional reads returning 304 require server support in the new endpoint; this may not be available at GA and should be treated as a stretch goal. diff --git a/openspec/changes/distributed-transactions/proposal.md b/openspec/changes/distributed-transactions/proposal.md new file mode 100644 index 0000000000..14cb1ed43b --- /dev/null +++ b/openspec/changes/distributed-transactions/proposal.md @@ -0,0 +1,38 @@ +## Why + +Azure Cosmos DB provides full ACID transactions within a single logical partition via `TransactionalBatch` and stored procedures. Applications that need atomicity across multiple partitions or containers must hand-roll compensating patterns (Saga, event sourcing) at significant complexity cost. Two gaps exist today within a single database account: + +1. **Cross-partition write atomicity** — there is no API to atomically commit mutations spanning multiple partition keys or containers. Any scenario (ledger transfers, order+inventory, multi-entity state machines) requires application-level compensating logic that is error-prone and non-portable across languages. + +2. **Consistent cross-partition reads** — `ReadManyItemsAsync` fans out parallel queries per partition range with no snapshot guarantee, making it vulnerable to read skew: two items returned in the same call can reflect different logical instants (e.g., one item seen before a concurrent transfer and another after it). + +The Cosmos DB Gateway exposes a distributed transaction coordinator at `POST /operations/dtc`. The same endpoint handles both atomic writes and consistent snapshot reads — the `operationType` field in the JSON request body (`"Write"` or `"Read"`) selects the behavior. + +## What Changes + +- Add `CosmosClient.CreateDistributedWriteTransaction()` returning a new `DistributedWriteTransaction` builder. +- Add `CosmosClient.CreateDistributedReadTransaction()` returning a new `DistributedReadTransaction` builder. +- `DistributedWriteTransaction` accumulates Create, Replace, Delete, Upsert, and Patch operations across any partitions and containers within the same account, then commits them atomically via `CommitTransactionAsync()`. +- `DistributedReadTransaction` accumulates ReadItem operations across any partitions and containers, then executes them as a single consistent server-side snapshot via `CommitTransactionAsync()`. +- New wire operations: `CommitDistributedTransaction` (reused for both write and read transactions; the distinction is in the request body). +- Session tokens are merged into the client's `ISessionContainer` after every commit or execute. +- Per-operation ETags are returned in both write and read responses, enabling ETag-gated follow-up writes. + +## Capabilities + +### New Capabilities + +- `distributed-write-transaction`: Adds `CosmosClient.CreateDistributedWriteTransaction()`, `DistributedWriteTransaction`, `DistributedTransactionResponse`, and related types for cross-partition, cross-container atomic writes within the same database account. +- `distributed-read-transaction`: Adds `CosmosClient.CreateDistributedReadTransaction()`, `DistributedReadTransaction`, and related types for cross-partition, cross-container consistent snapshot reads within the same database account. Reuses `DistributedTransactionResponse` as the shared response type for both write and read transactions. + +### Modified Capabilities + +- `client-and-configuration`: `CosmosClient` gains two new factory methods: `CreateDistributedWriteTransaction()` and `CreateDistributedReadTransaction()`. + +## Impact + +- **Public API surface**: New abstract classes `DistributedWriteTransaction` and `DistributedReadTransaction`; a single shared `DistributedTransactionResponse` response type for both (with `IdempotencyToken` as `Guid?`, non-null for writes, null for reads); new options classes; two new factory methods on `CosmosClient`. Initially behind `#if INTERNAL`; promoted to public at GA. +- **Wire protocol**: Uses `POST /operations/dtc` for reads and writes. +- **Gateway routing**: `GatewayStoreModel`, `GatewayStoreClient`, and `RequestInvokerHandler` gain awareness of both operation types; `x-ms-session-token` is suppressed for both (per-partition tokens travel in the body). +- **Existing behavior**: No change for any existing API. Both features are entirely additive. +- **Tests**: New unit tests (serialization, response parsing, session token merge) and emulator integration tests for both write and read transactions. diff --git a/openspec/changes/distributed-transactions/specs/distributed-read-transaction/spec.md b/openspec/changes/distributed-transactions/specs/distributed-read-transaction/spec.md new file mode 100644 index 0000000000..c47ab17a26 --- /dev/null +++ b/openspec/changes/distributed-transactions/specs/distributed-read-transaction/spec.md @@ -0,0 +1,132 @@ +## ADDED Requirements + +### Requirement: Supported read operations span multiple partitions and containers + +`DistributedReadTransaction` SHALL support ReadItem operations. Each operation SHALL independently specify its target database, container, partition key, and document id, allowing reads from any combination of partitions and containers within the same database account. + +#### Scenario: Operations target different partition keys + +- **WHEN** a `DistributedReadTransaction` is built with ReadItem operations targeting different partition keys within the same container +- **THEN** `CommitTransactionAsync` SHALL return all items as a consistent snapshot + +#### Scenario: Operations target different containers + +- **WHEN** a `DistributedReadTransaction` is built with ReadItem operations targeting different containers within the same database account +- **THEN** `CommitTransactionAsync` SHALL return all items as a consistent snapshot + +### Requirement: Per-item ETag is returned in the response + +The response SHALL expose an ETag for each successfully read item, reflecting the document version that was read. + +#### Scenario: ETag returned for each read item + +- **WHEN** `CommitTransactionAsync` completes with `IsSuccessStatusCode = true` +- **THEN** each operation's result SHALL carry a non-null `ETag` that can be used as `IfMatchEtag` on a subsequent `DistributedWriteTransaction` for conditional writes (optimistic CAS pattern) + +### Requirement: Conditional reads via IfNoneMatchEtag return 304 when unchanged + +Each ReadItem operation in a `DistributedReadTransaction` SHALL support an optional `IfNoneMatchETag` precondition. If the document's current ETag matches, the server SHALL return 304 Not Modified for that operation with no response body. + +#### Scenario: Document is unchanged — 304 returned + +- **WHEN** `CommitTransactionAsync` is called and one operation specifies an `IfNoneMatchETag` that matches the document's current ETag +- **THEN** that operation's result SHALL carry `StatusCode = 304` and a null resource body + +#### Scenario: Document has changed — full response returned + +- **WHEN** `CommitTransactionAsync` is called and one operation specifies an `IfNoneMatchETag` that does not match the document's current ETag +- **THEN** that operation's result SHALL carry `StatusCode = 200` and the full document body + + + +`DistributedReadTransaction` SHALL execute all staged ReadItem operations against a single consistent server-side snapshot, guaranteeing that no two results reflect different logical instants (no read skew). + +#### Scenario: All items read from the same snapshot + +- **WHEN** `CommitTransactionAsync` is called with multiple ReadItem operations targeting different partitions or containers +- **THEN** the service SHALL return all items as they existed at the same logical point in time +- **AND** the SDK SHALL return a `DistributedTransactionResponse` with `IsSuccessStatusCode = true` and `StatusCode = 200` + +#### Scenario: One item not found + +- **WHEN** `CommitTransactionAsync` is called and one of the staged items does not exist in the container +- **THEN** the SDK SHALL return a `DistributedTransactionResponse` with `StatusCode = 207` +- **AND** the missing item's result SHALL carry `StatusCode = 404` +- **AND** all other items' results SHALL carry their actual response status codes + +### Requirement: Session tokens are merged into ISessionContainer after execution + +After execution completes, the SDK SHALL merge per-operation session tokens from the response into the client's `ISessionContainer`. + +#### Scenario: Session tokens merged after successful execution + +- **WHEN** `CommitTransactionAsync` completes and the response contains per-operation `sessionToken` values +- **THEN** the SDK SHALL merge each response token into `ISessionContainer` for the corresponding partition + +### Requirement: Read transactions are not supported in Direct mode + +The SDK SHALL route every `CommitTransactionAsync` request through the Gateway or Thin Client endpoint. Direct mode is not supported for distributed transactions. + +#### Scenario: Direct mode client executes a read transaction + +- **WHEN** the `CosmosClient` is configured with Direct connectivity mode +- **THEN** the SDK SHALL override connectivity to Gateway mode and route `CommitTransactionAsync` to the Gateway read-transaction endpoint + +### Requirement: Read transactions are idempotent and safe to retry unconditionally + +`CommitTransactionAsync` SHALL be safe to retry on any transient failure without risk of duplicate side effects. No idempotency token is required. + +#### Scenario: Network failure triggers automatic retry + +- **WHEN** `CommitTransactionAsync` experiences a transient failure (408 Request Timeout or 503 Service Unavailable) +- **THEN** the SDK SHALL automatically retry the request + +### Requirement: CommitTransactionAsync is blocked on accounts with multiple write regions + +`CommitTransactionAsync` SHALL throw `NotSupportedException` if the account is configured with multiple write regions. + +#### Scenario: Multi-write-region account + +- **WHEN** the Cosmos DB account has `UseMultipleWriteLocations = true` +- **THEN** `CommitTransactionAsync` SHALL throw `NotSupportedException` before sending any request + +### Requirement: Consistency level controls snapshot freshness + +The `ConsistencyLevel` set in `DistributedReadTransactionRequestOptions` SHALL determine how fresh the server-side snapshot is. + +#### Scenario: Strong consistency provides globally latest snapshot + +- **WHEN** `CommitTransactionAsync` is called with `ConsistencyLevel = Strong` +- **THEN** the service SHALL return all items from the globally latest committed version with no stale reads + +#### Scenario: Session consistency provides per-partition monotonic reads + +- **WHEN** `CommitTransactionAsync` is called with `ConsistencyLevel = Session` and per-operation session tokens are populated from the client's `ISessionContainer` +- **THEN** the service SHALL return items that are at least as fresh as the provided session token for each partition + +### Requirement: Operation count is validated before the wire call + +The SDK SHALL reject a `CommitTransactionAsync` call before dispatch if the staged operations exceed the supported limit. + +#### Scenario: Exceeds maximum operation count + +- **WHEN** more than 100 ReadItem operations have been staged on the transaction +- **THEN** `CommitTransactionAsync` SHALL throw `ArgumentException` before sending any request + +### Requirement: Throttled requests are automatically retried + +When the Gateway returns 429 (Too Many Requests), the SDK SHALL automatically retry after the `x-ms-retry-after-ms` interval. + +#### Scenario: 429 response triggers retry + +- **WHEN** `CommitTransactionAsync` receives a 429 response +- **THEN** the SDK SHALL wait for `x-ms-retry-after-ms` and retry the request + +### Requirement: CommitTransactionAsync is instrumented with OpenTelemetry + +The SDK SHALL emit an OpenTelemetry span for each `CommitTransactionAsync` call. + +#### Scenario: Successful execution span + +- **WHEN** `CommitTransactionAsync` completes successfully +- **THEN** the SDK SHALL emit a span with operation name `commit_distributed_read_transaction`, `db.cosmosdb.status_code = 200`, and `db.cosmosdb.request_charge` set to the total RU charge for all read operations diff --git a/openspec/changes/distributed-transactions/specs/distributed-write-transaction/spec.md b/openspec/changes/distributed-transactions/specs/distributed-write-transaction/spec.md new file mode 100644 index 0000000000..136e366766 --- /dev/null +++ b/openspec/changes/distributed-transactions/specs/distributed-write-transaction/spec.md @@ -0,0 +1,118 @@ +## ADDED Requirements + +### Requirement: Supported write operations span multiple partitions and containers + +`DistributedWriteTransaction` SHALL support Create, Replace, Delete, Upsert, and Patch operations. Each operation SHALL independently specify its target database, container, partition key, and document id, allowing operations to target any combination of partitions and containers within the same database account. + +#### Scenario: Operations target different partition keys + +- **WHEN** a `DistributedWriteTransaction` is built with operations targeting different partition keys within the same container +- **THEN** `CommitTransactionAsync` SHALL commit all operations atomically + +#### Scenario: Operations target different containers + +- **WHEN** a `DistributedWriteTransaction` is built with operations targeting different containers within the same database account +- **THEN** `CommitTransactionAsync` SHALL commit all operations atomically + +### Requirement: Per-operation ETag is returned on success + +After a successful commit, the response SHALL expose an ETag for each write operation, reflecting the server-assigned version of the written document. + +#### Scenario: ETag returned for each committed operation + +- **WHEN** `CommitTransactionAsync` completes with `IsSuccessStatusCode = true` +- **THEN** each operation's result SHALL carry a non-null `ETag` that can be used as `IfMatchEtag` on a subsequent `DistributedWriteTransaction` for optimistic concurrency + +### Requirement: Per-operation optimistic concurrency via IfMatchEtag + +Each operation in a `DistributedWriteTransaction` SHALL support an `IfMatchEtag` precondition. If the document's current ETag does not match, the entire transaction SHALL be rolled back. + +#### Scenario: ETag precondition fails on one operation + +- **WHEN** `CommitTransactionAsync` is called and one operation's `IfMatchEtag` does not match the document's current ETag +- **THEN** the service SHALL roll back all operations and the failed operation's result SHALL carry `StatusCode = 412 Precondition Failed` +- **AND** all other operations SHALL carry `StatusCode = 424 Failed Dependency` + + + +`DistributedWriteTransaction` SHALL atomically commit all staged write operations across any partitions and containers within the same database account. If any operation fails, the service SHALL roll back all operations in the transaction. + +#### Scenario: All operations succeed + +- **WHEN** `CommitTransactionAsync` is called and all operations are valid and commit successfully +- **THEN** the SDK SHALL return a `DistributedTransactionResponse` with `IsSuccessStatusCode = true` and `StatusCode = 200` + +#### Scenario: One operation fails — all are rolled back + +- **WHEN** `CommitTransactionAsync` is called and at least one operation fails (e.g., ETag mismatch, conflict) +- **THEN** the service SHALL roll back all operations and the SDK SHALL return a `DistributedTransactionResponse` with `IsSuccessStatusCode = false` +- **AND** the failed operation's result SHALL carry the specific failure status code (e.g., 412 Precondition Failed, 409 Conflict) +- **AND** all other operations SHALL carry `StatusCode = 424 Failed Dependency` + +### Requirement: Session tokens are merged into ISessionContainer after commit + +After a commit completes (success or failure), the SDK SHALL merge per-operation session tokens from the response into the client's `ISessionContainer`. + +#### Scenario: Session tokens merged after commit + +- **WHEN** `CommitTransactionAsync` completes and the response contains per-operation `sessionToken` values +- **THEN** the SDK SHALL call `ISessionContainer.SetSessionToken` for each operation's response token so that subsequent reads on any affected partition observe the committed writes + +### Requirement: Commits are idempotent via a per-call idempotency token + +Each `CommitTransactionAsync` call SHALL auto-generate a unique idempotency token and send it as `x-ms-idempotency-token`. The token SHALL be exposed on the response so callers can safely replay a commit whose outcome is unknown. + +#### Scenario: Retry with the same idempotency token + +- **WHEN** a caller retries `CommitTransactionAsync` using the `IdempotencyToken` from a previous response +- **THEN** the service SHALL return the same result as the original commit without applying the operations a second time + +### Requirement: Write transactions are not supported in Direct mode + +The SDK SHALL route every `CommitTransactionAsync` request through the Gateway or Thin Client endpoint. Direct mode is not supported for distributed transactions. + +#### Scenario: Direct mode client commits a write transaction + +- **WHEN** the `CosmosClient` is configured with Direct connectivity mode +- **THEN** the SDK SHALL override connectivity to Gateway mode and route `CommitTransactionAsync` to the Gateway write-transaction endpoint + +### Requirement: CommitTransactionAsync is blocked on accounts with multiple write regions + +`CommitTransactionAsync` SHALL throw `NotSupportedException` if the account is configured with multiple write regions. + +#### Scenario: Multi-write-region account + +- **WHEN** the Cosmos DB account has `UseMultipleWriteLocations = true` +- **THEN** `CommitTransactionAsync` SHALL throw `NotSupportedException` before sending any request + +### Requirement: Operation count and payload size are validated before the wire call + +The SDK SHALL reject a `CommitTransactionAsync` call before dispatch if the staged operations exceed the supported limits. + +#### Scenario: Exceeds maximum operation count + +- **WHEN** more than 100 operations have been staged on the transaction +- **THEN** `CommitTransactionAsync` SHALL throw `ArgumentException` before sending any request + +#### Scenario: Exceeds maximum payload size + +- **WHEN** the total serialized request body would exceed 2 MB +- **THEN** `CommitTransactionAsync` SHALL throw `ArgumentException` before sending any request + +### Requirement: Throttled requests are automatically retried + +When the Gateway returns 429 (Too Many Requests), the SDK SHALL automatically retry after the `x-ms-retry-after-ms` interval, up to the configured maximum retry count. + +#### Scenario: 429 response triggers retry + +- **WHEN** `CommitTransactionAsync` receives a 429 response +- **THEN** the SDK SHALL wait for `x-ms-retry-after-ms` and retry the commit with the same idempotency token + +### Requirement: CommitTransactionAsync is instrumented with OpenTelemetry + +The SDK SHALL emit an OpenTelemetry span for each `CommitTransactionAsync` call. + +#### Scenario: Successful commit span + +- **WHEN** `CommitTransactionAsync` completes successfully +- **THEN** the SDK SHALL emit a span with `db.cosmosdb.operation_type = "CommitDistributedTransaction"`, `db.cosmosdb.status_code = 200`, and `db.cosmosdb.request_charge` set to the total RU charge for the transaction diff --git a/openspec/changes/distributed-transactions/tasks.md b/openspec/changes/distributed-transactions/tasks.md new file mode 100644 index 0000000000..36c033243d --- /dev/null +++ b/openspec/changes/distributed-transactions/tasks.md @@ -0,0 +1,80 @@ +## 1. Setup + +- [ ] 1.1 Create a git worktree for the feature branch (e.g., `users//distributed-transactions`) to work in isolation from the main working directory + +## 2. Write Transaction — Core (already implemented, internal) + +- [x] 2.1 `DistributedTransaction` abstract base class (`src/DistributedTransaction/DistributedTransaction.cs`) +- [x] 2.2 `DistributedWriteTransaction` abstract class with 9 write operation methods (`src/DistributedTransaction/DistributedWriteTransaction.cs`) +- [x] 2.3 `DistributedWriteTransactionCore` sealed concrete implementation (`src/DistributedTransaction/DistributedWriteTransactionCore.cs`) +- [x] 2.4 `DistributedTransactionOperation` operation model (`src/DistributedTransaction/DistributedTransactionOperation.cs`) +- [x] 2.5 `DistributedTransactionRequestOptions` per-operation options class (`src/DistributedTransaction/DistributedTransactionRequestOptions.cs`) +- [x] 2.6 `DistributedTransactionConstants` routing helpers (`src/DistributedTransaction/DistributedTransactionConstants.cs`) +- [x] 2.7 `DistributedTransactionSerializer` JSON serializer for `POST /operations/dtc` body (`src/DistributedTransaction/DistributedTransactionSerializer.cs`) +- [x] 2.8 `DistributedTransactionServerRequest` request builder (`src/DistributedTransaction/DistributedTransactionServerRequest.cs`) +- [x] 2.9 `DistributedTransactionResponse` response parser with MultiStatus promotion and IDisposable (`src/DistributedTransaction/DistributedTransactionResponse.cs`) +- [x] 2.10 `DistributedTransactionOperationResult` per-operation result type (`src/DistributedTransaction/DistributedTransactionOperationResult.cs`) +- [x] 2.11 `DistributedTransactionCommitter` orchestrator: RID resolution → serialization → HTTP dispatch → session merge (`src/DistributedTransaction/DistributedTransactionCommitter.cs`) +- [x] 2.12 Gateway routing: `GatewayStoreModel`, `GatewayStoreClient`, `RequestInvokerHandler` updated for `CommitDistributedTransaction` +- [x] 2.13 `CosmosClient.CreateDistributedWriteTransaction()` entry point (`src/CosmosClient.cs`) +- [x] 2.14 Unit tests: serializer, response parser, committer (RID resolution, session merge) (`tests/Microsoft.Azure.Cosmos.Tests/DistributedTransaction/`) +- [x] 2.15 Emulator integration tests with mock DTC handler (`tests/Microsoft.Azure.Cosmos.EmulatorTests/DistributedTransaction/DistributedTransactionTests.cs`) + +## 3. Write Transaction — Gaps (not yet implemented) + +- [ ] 3.1 Populate outbound session tokens from `ISessionContainer` before dispatch in `DistributedTransactionCommitter` (each op's `sessionToken` field is currently always null before sending) +- [ ] 3.2 Add op-count and total-body-size guards in `DistributedWriteTransactionCore` (reject > 100 ops or > 2 MB before the wire call) +- [ ] 3.3 Wrap `CommitTransactionAsync` in 429-retry logic using the SDK's `RetryHandler` +- [ ] 3.4 Wrap `CommitTransactionAsync` in `OperationHelperAsync` with `OpenTelemetryConstants.Operations.CommitDistributedTransaction = "commit_distributed_transaction"`; add the constant to `OpenTelemetryConstants.cs` +- [ ] 3.5 Remove dead code: `DistributedTransactionRequest` class is never instantiated +- [ ] 3.6 Add emulator end-to-end tests against the real `/operations/dtc` endpoint (no mock handler) +- [ ] 3.7 Add single write-region guard: throw `NotSupportedException` if account has multiple write regions +- [ ] 3.8 Promote all write transaction types from `#if INTERNAL` to public; update API contract baseline with `UpdateContracts.ps1` +- [ ] 3.9 Public documentation and changelog entry + +## 4. Read Transaction — Core Infrastructure + +- [ ] 4.1 `DistributedReadTransaction` abstract class with `ReadItem()` and `CommitTransactionAsync()` methods (`src/DistributedTransaction/DistributedReadTransaction.cs`) +- [ ] 4.2 `DistributedReadTransactionRequestOptions` options class with `ConsistencyLevel`, `ReadConsistencyStrategy` (preview), and `SessionToken` (`src/DistributedTransaction/DistributedReadTransactionRequestOptions.cs`) +- [ ] 4.3 `DistributedReadTransactionOperation` model: `database`, `container`, `id`, `partitionKey`, `sessionToken`, `ifNoneMatchETag`, `index` (`src/DistributedTransaction/DistributedReadTransactionOperation.cs`) +- [ ] 4.4 `DistributedReadTransactionCore` sealed concrete implementation (`src/DistributedTransaction/DistributedReadTransactionCore.cs`) + +## 5. Read Transaction — Wire Protocol + +- [ ] 5.1 `DistributedReadTransactionSerializer`: serializes read ops to JSON body; omits `resourceBody`; adds `ifNoneMatchETag` (`src/DistributedTransaction/DistributedReadTransactionSerializer.cs`) +- [ ] 5.2 `DistributedReadTransactionServerRequest`: builds `RequestMessage` reusing `OperationType.CommitDistributedTransaction` and `ResourceType.DistributedTransactionBatch`; sets the read endpoint URL and `UseGatewayMode = true` to override Direct mode (Gateway and Thin Client modes are supported; Direct mode is not) (`src/DistributedTransaction/DistributedReadTransactionServerRequest.cs`) +- [ ] 5.3 Extend `DistributedTransactionResponse` for read transaction support: change `IdempotencyToken` from `Guid` to `Guid?` (null when returned from a read commit); add `GetItem(index)` typed deserialization helper on the shared response type (`src/DistributedTransaction/DistributedTransactionResponse.cs`) + +## 6. Read Transaction — Orchestration & Gateway Integration + +- [ ] 6.1 `DistributedReadTransactionCommitter`: RID resolution → outbound session token population → serialize → HTTP dispatch → parse response → merge session tokens (`src/DistributedTransaction/DistributedReadTransactionCommitter.cs`) +- [ ] 6.2 Verify no gateway routing changes are required: `GatewayStoreModel` session-token stripping and `GatewayStoreClient`/`RequestInvokerHandler` HTTP method selection already apply correctly via the reused `OperationType.CommitDistributedTransaction` +- [ ] 6.3 `CosmosClient.CreateDistributedReadTransaction()` entry point, guarded by `#if INTERNAL` (`src/CosmosClient.cs`) + +## 7. Read Transaction — Reliability & Observability + +- [ ] 7.1 Wrap `CommitTransactionAsync` in 429-retry logic (reads are safe to retry unconditionally) +- [ ] 7.2 Wrap `CommitTransactionAsync` in `OperationHelperAsync` with a new `OpenTelemetryConstants.Operations.CommitDistributedReadTransaction = "commit_distributed_read_transaction"` constant; add it to `OpenTelemetryConstants.cs` (OTel span names are independent of `OperationType`) +- [ ] 7.3 Add op-count and total-body-size guards (reject > 100 ops or > 2 MB before the wire call) +- [ ] 7.4 Add single write-region guard: throw `NotSupportedException` if account has multiple write regions + +## 8. Read Transaction — Tests + +- [ ] 8.1 Unit tests — serializer: every field name matches server model; null `sessionToken`; `ifNoneMatchETag` present/absent (`tests/Microsoft.Azure.Cosmos.Tests/DistributedTransaction/DistributedReadTransactionSerializerTests.cs`) +- [ ] 8.2 Unit tests — `DistributedTransactionResponse` with read transaction: 200 all-success; 207 mixed; malformed JSON; count mismatch; `GetItem()` deserialization; `IdempotencyToken` is null (`tests/Microsoft.Azure.Cosmos.Tests/DistributedTransaction/DistributedReadTransactionResponseTests.cs`) +- [ ] 8.3 Unit tests — session tokens: outbound populated from `ISessionContainer`; inbound merged back after execution (`tests/Microsoft.Azure.Cosmos.Tests/DistributedTransaction/DistributedReadTransactionCommitterTests.cs`) +- [ ] 8.4 Emulator integration tests: write known items; execute `DistributedReadTransaction` at Strong consistency; assert snapshot guarantee; test 304 for `ifNoneMatchETag`; test 404 for missing item; test session token round-trip (`tests/Microsoft.Azure.Cosmos.EmulatorTests/DistributedTransaction/DistributedReadTransactionTests.cs`) + +## 9. Public Promotion (both features) + +- [ ] 9.1 Remove all `#if INTERNAL` guards from write and read transaction types +- [ ] 9.2 Run `UpdateContracts.ps1` to capture new public types in API contract baseline files +- [ ] 9.3 Add both features to `changelog.md` and public documentation +- [ ] 9.4 Add code samples to `Microsoft.Azure.Cosmos.Samples/` + +## 10. Cross-SDK Implementations + +- [ ] 10.1 Java SDK: implement `DistributedWriteTransaction` and `DistributedReadTransaction` +- [ ] 10.2 Python SDK: implement async `DistributedWriteTransaction` and `DistributedReadTransaction` +- [ ] 10.3 JavaScript/Node.js SDK: implement Promise-based `DistributedWriteTransaction` and `DistributedReadTransaction` +- [ ] 10.4 Go SDK: implement context-based `DistributedWriteTransaction` and `DistributedReadTransaction`