Skip to content

[Client encryption] Adds streaming JSON processing support to feed iterators#5478

Open
MartinSarkany wants to merge 103 commits into
Azure:mainfrom
MartinSarkany:feature/encryption-feed-iterator-rework
Open

[Client encryption] Adds streaming JSON processing support to feed iterators#5478
MartinSarkany wants to merge 103 commits into
Azure:mainfrom
MartinSarkany:feature/encryption-feed-iterator-rework

Conversation

@MartinSarkany
Copy link
Copy Markdown
Contributor

@MartinSarkany MartinSarkany commented Nov 7, 2025

Pull Request Template

Description

What this PR does

Adds stream-mode JSON processing for encryption feed iterators (query, LINQ, change-feed) to Microsoft.Azure.Cosmos.Encryption.Custom. Consumers opt in per-call via RequestOptions or per-container via Container.UseStreamingJsonProcessingByDefault(). Default remains Newtonsoft — existing callers see no behavioral change.

New public API

  • EncryptionContainerExtensions.UseStreamingJsonProcessingByDefault(Container) — opt-in to stream mode per container
  • DecryptableItem.DisposeAsync()DecryptableItem is now IAsyncDisposable so StreamDecryptableItem can release its pooled buffer

No public API removals. No behavioral changes to existing surface.

Dependency changes

Package Project Change
Microsoft.IO.RecyclableMemoryStream 3.0.1 Microsoft.Azure.Cosmos.Encryption.Custom.Performance.Tests Removed. Used for benchmarks; replaced with PooledMemoryStream
System.Text.RegularExpressions 4.3.1 Microsoft.Azure.Cosmos.Encryption.Custom Removed (unused)

No additions.

#4678

Type of change

Please delete options that are not relevant.

  • [] New feature (non-breaking change which adds functionality)

Introduces JsonProcessorPropertyBag to centralize handling of JsonProcessor overrides using RequestOptions.Properties. Updates encryption and decryption workflows to support processor selection, adds related tests and helpers, and refactors code to use the new override mechanism for both production and test scenarios.
Introduces CosmosDiagnosticsContext for lightweight diagnostics and scope tracking, with ActivitySource integration for telemetry. Updates EncryptionProcessor to record diagnostic scopes and enforce streaming mode restrictions. Adds comprehensive unit tests for diagnostics context and processor, and merges stream processor end-to-end tests into MdeCustomEncryptionTests.
Moved diagnostics scope creation for MDE decryption to cover the entire decryption logic in EncryptionProcessor. Updated tests to assert scope presence/absence more robustly and removed conditional compilation for ENCRYPTION_CUSTOM_PREVIEW. Project files now always define ENCRYPTION_CUSTOM_PREVIEW constant.
Introduces direct serialization to output streams via WriteToStream to reduce intermediate memory allocations. Refactors decryption logic to streamline processor selection and diagnostics scope naming, and adds EncryptionDiagnostics constants for improved diagnostics context management. Updates MDE encryption processor to support direct stream handling and processor selection for .NET 8+ preview builds.
Introduces TestEncryptorFactory to centralize and simplify the creation of mock Encryptor and DataEncryptionKey instances in tests, reducing repetitive Moq setup code. Updates all relevant test classes to use this factory. Refactors MdeEncryptionProcessor to improve fallback handling for legacy and unencrypted streams, ensuring correct stream positioning and graceful fallback to legacy decryption paths.
Replaces separate diagnostic prefixes for JSON processor selection with a unified prefix in EncryptionDiagnostics and updates all usages and tests accordingly. Adds documentation for the JSON processor property bag override and introduces new tests for large payloads, concurrency, and cancellation scenarios with the streaming JSON processor.
Adds validation to ensure that streaming encryption only processes JSON documents with an object as the root element. Root arrays and other root types are now explicitly rejected, improving contract enforcement and error handling.
Refactored root validation logic to ensure streaming encryption only accepts JSON documents with an object root. Added tests to verify rejection of root arrays and primitive values, improving contract enforcement and error messaging.
Expanded EncryptionProcessorDiagnosticsTests to cover diagnostics scope emission for various decryption paths, including MDE payloads, provided-output, stream override, malformed JSON, duplicate scope prevention, null input, cancellation, and encryption. Added a fake MDE DataEncryptionKey implementation and a SlowCancelableStream helper for cancellation tests. Conditional assertions ensure correct scope behavior for NET8 preview builds.
Introduces IMdeJsonProcessorAdapter and concrete adapters for Newtonsoft and Stream processors to unify JSON processing logic in MdeEncryptionProcessor. Diagnostic scope tracking for encryption and decryption is improved, and related tests are updated to verify scope tracking for both processors. Legacy logic for processor selection and property inspection is moved into the respective adapters for better maintainability.
Unifies and updates diagnostics scope naming for encryption and decryption selection, removing legacy/impl scopes and ensuring only one selection scope is emitted per operation. Refactors MDE processor logic to directly handle Newtonsoft and Stream processors, improving fallback and error handling. Adds comprehensive unit tests for NewtonsoftAdapter and StreamAdapter, covering encryption, decryption, legacy/unencrypted payloads, and diagnostics scope assertions.
…rypted stream used; sync request options with processor; add legacy override guard and fallback JObject decrypt
Added a test to verify that explicitly overriding the JSON processor with Newtonsoft maintains the expected encryption/decryption behavior. Also added assertions to change feed processor tests to ensure both documents are processed and the processor does not time out.
Changed expected diagnostics.Scopes.Count from 1 to 0 in multiple StreamAdapterTests to reflect updated behavior. Ensures tests align with current diagnostics tracking implementation.
Introduces EncryptionRequestOptionsExperimental for configuring the JSON processor pipeline via request options, including helper methods and tests. Refactors JsonProcessorPropertyBag for improved property handling and updates test helpers to use the new API.
Replaces static calls to RequestOptionsPropertiesExtensions with instance extension method calls on RequestOptions throughout the codebase and tests. This improves code readability and aligns with extension method usage patterns.
Simplifies and unifies decryption logic by removing preprocessor directives and consolidating JsonProcessor selection. Refactors MdeEncryptionProcessor to use switch expressions and helper methods for both stream and Newtonsoft processors, improving maintainability and clarity. Ensures consistent handling of legacy and MDE-encrypted documents, and introduces utility methods for stream position management and result writing.
this.ResponseFactory,
this.Encryptor,
this.CosmosSerializer,
changeFeedRequestOptions?.GetJsonProcessor(this.DefaultJsonProcessor) ?? this.DefaultJsonProcessor);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the last ?? this.DefaultJsonProcessor should not be needed, the extension method already returns the fallback processor if no override is found in the request options

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Meghana-Palaparthi and others added 2 commits May 13, 2026 18:48
…imeout (Azure#5689)

## Description

 ## DTX Commit Retry — Design
 
 ### Two-loop architecture
 
DTX commits flow through two retry loops with distinct responsibilities:
 
 ```
DistributedTransactionCommitter.ExecuteCommitWithRetryAsync ← OUTER LOOP
   └─ clientContext.ProcessResourceOperationStreamAsync
        └─ AbstractRetryHandler (inner while-loop)
└─ ClientRetryPolicy.ShouldRetryInternalAsync ← INNER LOOP
 ```
 
The coordinator returns **two response shapes** for the same HTTP status
codes, which is why a single loop cannot own retries:
 
 | Shape | Status / sub-status | Body | Retriability signal | Owner |
 |---|---|---|---|---|
| **Envelope failure** | 408, 449/5352, 500/5411-5413 | `Content ==
null` | implicit (status code) | Inner (CRP) |
| **Semantic failure** | 408, 449/5352, 452 | JSON with per-op results |
`"isRetriable": true` in body | Outer (committer) |
| **Throttle** | 429/3200 | varies | `Retry-After` header |
`ResourceThrottleRetryPolicy` |
 
 ### The amplification problem
 
Without a routing rule, a 449/5352 carrying a body would be retried by
**both** loops:
inner CRP (10 attempts) × outer committer (10 attempts) = up to **100
wire requests per user call**.
 
 ### Options considered
 
 #### Option A — Status-code partition (outer-loop exclusion list)
 
CRP retries its owned codes unconditionally. The committer adds an
`IsOuterLoopRetriable()` helper that mirrors CRP's owned set and
excludes them.
 
- ❌ **Dual-maintenance hazard.** The owned-code set lives in two files.
Any drift re-introduces the 10×10 amplification.
- ❌ **Discards body data.** `AbstractRetryHandler` abandons the
`ResponseMessage` without reading it on inner-loop retries. Per-op
session tokens, ETags,
and `ResourceStream` payloads are silently dropped on every CRP retry.
- ❌ **Brittle to contract evolution.** If the coordinator starts
attaching bodies to 449 in a future revision, CRP retries blindly and
the `isRetriable`
signal never reaches the committer.
 
 #### Option B — Body-presence deferral ✅ **(chosen)**
 
Before CRP retries `{408, 449/5352, 429/3200}`, it checks
`hasResponseBody = (responseMessage?.Content != null)`. If a body is
present, CRP returns
`NoRetry()` and defers to the outer loop, which parses `isRetriable`
from that body. If no body, CRP retries immediately.
 
 - ✅ **Single source of truth** — one predicate, one file.
- ✅ **Preserves per-op data** — body-bearing responses always reach
`FromResponseMessageAsync → MergeSessionTokens`.
- ✅ **Forward compatible** — new body-bearing codes route through
`isRetriable` automatically with no SDK change.
 
 ### Retry budgets and delays (after this PR)
 
 | Code | Loop | Budget | Delay |
 |---|---|---|---|
| 408, 449/5352 (no body) | Inner (CRP) | 10 | 1 s flat (honors
`Retry-After`) |
| 500/5411-5413 | Inner (CRP) | 10 | 100 ms → 5 s exponential + jitter |
| 408, 449/5352, 452 (body + `isRetriable: true`) | Outer (committer) |
10 | 1 s → 32 s exponential + jitter |
| 429/3200 | `ResourceThrottleRetryPolicy` | 9 (default) | `Retry-After`
header |
 
 ### Other notable changes in this PR
 
- **408 endpoint-marking guard.** Added `if (!isDtxRequest)` before
`TryMarkEndpointUnavailableForPkRange`. DTX always goes through gateway;
marking a
partition endpoint unavailable on a coordinator 408 would skew routing
for unrelated requests.
- **Idempotency on retry.** `DistributedTransactionServerRequest`
pre-serializes the body to `byte[]` and vends a fresh `MemoryStream` per
attempt via
`CreateBodyStream()`. The idempotency token header is re-applied by the
`requestEnricher` closure on each attempt. Both inner and outer retries
are safe.
- **Body parser hardening.** `DistributedTransactionResponse` disposes
the original `ResponseMessage` before constructing the synthetic 500
fallback, and
emits a `DefaultTrace.TraceWarning` on `JsonException`.

---

## PR Summary

This pull request introduces significant improvements to the retry logic
for distributed transactions (DTX) in Cosmos DB, focusing on more
robust, efficient, and nuanced handling of retriable error codes and
infrastructure failures. The changes include new retry limits and
backoff strategies specifically for DTX scenarios, improved
differentiation of retry logic based on request type, and the
introduction of jitter to avoid synchronized client retries.
Additionally, the code is refactored to better support testing and
configuration of retry behavior.

**Distributed Transaction (DTX) Retry Logic Enhancements:**

* Introduced new retry counters and configurable limits for DTX retries
(`MaxDtxRetryCount`, `MaxDtxInfraFailureRetryCount`) and added
request-type detection (`isDtxRequest`) to apply DTX-specific logic.
[[1]](diffhunk://#diff-2b056512ca285b1d95e025e31f60345059fa92d958becc38f90a6fb54ce1bbb4L26-R32)
[[2]](diffhunk://#diff-2b056512ca285b1d95e025e31f60345059fa92d958becc38f90a6fb54ce1bbb4R43-R48)
[[3]](diffhunk://#diff-2b056512ca285b1d95e025e31f60345059fa92d958becc38f90a6fb54ce1bbb4R229-R231)
* Implemented DTX-specific handling in `ShouldRetryInternalAsync`,
including distinct logic for various DTX error codes (e.g., 408,
449/5352, 429/3200, 500/5411-5413), with separate retry budgets and
backoff strategies for infrastructure failures.
* Added bounded exponential backoff with ±25% jitter for DTX
infrastructure failures to prevent synchronized retries across clients.
[[1]](diffhunk://#diff-2b056512ca285b1d95e025e31f60345059fa92d958becc38f90a6fb54ce1bbb4L26-R32)
[[2]](diffhunk://#diff-2b056512ca285b1d95e025e31f60345059fa92d958becc38f90a6fb54ce1bbb4R724-R748)
* Ensured DTX retries are not mistakenly counted as endpoint failures
for non-DTX traffic, preventing negative impact on routing.

**General Retry and Committer Improvements:**

* Updated `ShouldRetryAsync` and related methods to pass additional
context (e.g., `RetryAfter`, response body presence) to internal retry
logic, allowing more nuanced retry decisions.
[[1]](diffhunk://#diff-2b056512ca285b1d95e025e31f60345059fa92d958becc38f90a6fb54ce1bbb4L121-R131)
[[2]](diffhunk://#diff-2b056512ca285b1d95e025e31f60345059fa92d958becc38f90a6fb54ce1bbb4L135-R146)
[[3]](diffhunk://#diff-2b056512ca285b1d95e025e31f60345059fa92d958becc38f90a6fb54ce1bbb4R185-R191)
[[4]](diffhunk://#diff-2b056512ca285b1d95e025e31f60345059fa92d958becc38f90a6fb54ce1bbb4L249-R269)
* Refactored `DistributedTransactionCommitter` to allow configurable
base delay and injected delay provider for improved testability and
flexibility, and added thread-local jitter for retry delays.

**Session Token Retry Policy Adjustments:**

* Updated session token retry logic to clarify and simplify when to set
the hub region processing header, and to stop retries after the correct
number of attempts.
[[1]](diffhunk://#diff-2b056512ca285b1d95e025e31f60345059fa92d958becc38f90a6fb54ce1bbb4R361-R368)
[[2]](diffhunk://#diff-2b056512ca285b1d95e025e31f60345059fa92d958becc38f90a6fb54ce1bbb4L452-L474)

These changes collectively make the retry policy for distributed
transactions more resilient, efficient, and easier to maintain and test,
especially in the face of transient infrastructure failures and complex
error scenarios.
## Type of change

Please delete options that are not relevant.

- [] Bug fix (non-breaking change which fixes an issue)
- [✓] New feature (non-breaking change which adds functionality)
- [] Breaking change (fix or feature that would cause existing
functionality to not work as expected)
- [] This change requires a documentation update

## Closing issues

To automatically close an issue: closes #IssueNumber

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Nalu Tripician <27316859+NaluTripician@users.noreply.github.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
@MartinSarkany MartinSarkany changed the title [Client encryption] Add streaming JSON processing support to feed iterators [Client encryption] Adds streaming JSON processing support to feed iterators May 28, 2026
@juraj-blazek
Copy link
Copy Markdown
Contributor

Review findings:

  1. High - Microsoft.Azure.Cosmos.Encryption.Custom\src\Transformation\StreamProcessor.Decryptor.cs:452 / :467: streaming decrypt matches encrypted property names and _ei by name at any depth. Since encryption paths are root-only, nested properties like { "nested": { "Sensitive": "plain" } } can be treated as encrypted when _ei lists /Sensitive, causing decrypt failures/data corruption; nested _ei user data can also be stripped. Add reader.CurrentDepth == 1 guards for encrypted-path matching and _ei skipping, plus nested-property tests.

  2. High - Microsoft.Azure.Cosmos.Encryption.Custom\src\StreamDecryptableItem.cs:54-78: if decryption succeeds but cosmosSerializer.FromStream<T> throws, decryptedStream is never disposed because disposal only happens after successful deserialization. In stream mode this can leave a ReadOnlyBufferWriterStream/rented buffer containing plaintext undisposed. Track decryptedStream outside the inner try and dispose it in a finally/catch before wrapping the exception.

  3. Medium - Microsoft.Azure.Cosmos.Encryption.Custom\src\EncryptionProcessor.cs:414-421 + Microsoft.Azure.Cosmos.Encryption.Custom\src\DecryptableFeedResponse.cs:58-60: stream-mode feed responses return lazy StreamDecryptableItems backed by pooled streams, but the response does not own/cascade disposal and public examples do not dispose items. If callers skip items or stop early, rented buffers are not returned/cleared. Either avoid exposing pooled streams through the lazy public item path, or provide/enforce a response/item disposal pattern and update examples accordingly.

@MartinSarkany
Copy link
Copy Markdown
Contributor Author

Review findings:

  1. High - Microsoft.Azure.Cosmos.Encryption.Custom\src\Transformation\StreamProcessor.Decryptor.cs:452 / :467: streaming decrypt matches encrypted property names and _ei by name at any depth. Since encryption paths are root-only, nested properties like { "nested": { "Sensitive": "plain" } } can be treated as encrypted when _ei lists /Sensitive, causing decrypt failures/data corruption; nested _ei user data can also be stripped. Add reader.CurrentDepth == 1 guards for encrypted-path matching and _ei skipping, plus nested-property tests.
  2. High - Microsoft.Azure.Cosmos.Encryption.Custom\src\StreamDecryptableItem.cs:54-78: if decryption succeeds but cosmosSerializer.FromStream<T> throws, decryptedStream is never disposed because disposal only happens after successful deserialization. In stream mode this can leave a ReadOnlyBufferWriterStream/rented buffer containing plaintext undisposed. Track decryptedStream outside the inner try and dispose it in a finally/catch before wrapping the exception.
  3. Medium - Microsoft.Azure.Cosmos.Encryption.Custom\src\EncryptionProcessor.cs:414-421 + Microsoft.Azure.Cosmos.Encryption.Custom\src\DecryptableFeedResponse.cs:58-60: stream-mode feed responses return lazy StreamDecryptableItems backed by pooled streams, but the response does not own/cascade disposal and public examples do not dispose items. If callers skip items or stop early, rented buffers are not returned/cleared. Either avoid exposing pooled streams through the lazy public item path, or provide/enforce a response/item disposal pattern and update examples accordingly.

All should be fixed now.

@kundadebdatta
Copy link
Copy Markdown
Member

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown
Member

@kundadebdatta kundadebdatta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deep Review — [Client encryption] Adds streaming JSON processing support to feed iterators @ e09e0bd

Overall assessment: in good shape. The major concerns raised earlier by @juraj-blazek (depth‑1 guard for _ei/encrypted-path matching, decryptedStream disposal on FromStream failure, pooled-stream cascade through DecryptableFeedResponse) all hold up at head. No correctness blockers found in the streaming decrypt path itself.

Cross-PR interaction check: PRs #5780 / #5829 / #5870 / #5844 / #5920 all touch the core Microsoft.Azure.Cosmos client; this PR is entirely contained in Microsoft.Azure.Cosmos.Encryption.Custom. No file or surface overlap.

Findings below in severity order. None are merge-blockers; 1–3 are worth addressing before merge.


🟡 1. DecryptableFeedResponse.DisposeAsync aborts cascade on first item exception (buffer leak)

File: Microsoft.Azure.Cosmos.Encryption.Custom/src/DecryptableFeedResponse.cs:87-96

if (this.Resource != null)
{
    foreach (T item in this.Resource)
    {
        if (item is IAsyncDisposable asyncDisposable)
        {
            await asyncDisposable.DisposeAsync().ConfigureAwait(false);
        }
    }
}

If any one item''s DisposeAsync throws (e.g., a StreamDecryptableItem whose SemaphoreSlim.WaitAsync observes cancellation), the foreach is aborted and the remaining items never have their pooled buffers returned. The class''s own XML doc explicitly says callers "rely on this cascade to release those buffers" — that contract is silently violated.

Suggested fix: Capture per-item exceptions, dispose all items, then aggregate/rethrow:

List<Exception> errors = null;
foreach (T item in this.Resource)
{
    if (item is IAsyncDisposable asyncDisposable)
    {
        try { await asyncDisposable.DisposeAsync().ConfigureAwait(false); }
        catch (Exception ex) { (errors ??= new()).Add(ex); }
    }
}
if (errors != null) throw new AggregateException(errors);

🟡 2. StreamDecryptableItem never disposes its SemaphoreSlim (handle leak)

File: Microsoft.Azure.Cosmos.Encryption.Custom/src/StreamDecryptableItem.cs:30, 252-282

private readonly SemaphoreSlim asyncLock = new (1, 1);
...
public override async ValueTask DisposeAsync()
{
    await this.asyncLock.WaitAsync().ConfigureAwait(false);
    try { ... }
    finally { this.asyncLock.Release(); }
    GC.SuppressFinalize(this);
}

SemaphoreSlim allocates a kernel ManualResetEvent lazily on first contended wait — that handle is only released by Dispose(). A query returning 1000 docs creates 1000 StreamDecryptableItems; under any contention these accumulate OS wait handles until GC.

Suggested fix: after the finally block, call this.asyncLock.Dispose();.


🟡 3. Over-broad NotSupportedException catch silently triggers fallback for unrelated errors

File: Microsoft.Azure.Cosmos.Encryption.Custom/src/EncryptionProcessor.cs:456-475

try
{
    return await MdeEncryptionProcessor.DecryptJsonArrayStreamInPlaceAsync(
        content, encryptor, CosmosDiagnosticsContext.Create(null), cancellationToken);
}
catch (NotSupportedException)
{
    content.Position = 0;
    return await DecryptJsonArrayNewtonsoftAsync(content, encryptor, cancellationToken);
}

The catch is intended for the seekability check at StreamProcessor.Decryptor.cs:45-48. But NotSupportedException is also thrown from DecryptStreamAsync for EncryptionFormatVersion != Mde (StreamProcessor.Decryptor.cs:333) and from MdeEncryptor.GetAdapter (MdeEncryptionProcessor.cs:178) for an unsupported processor enum. Both would now be silently retried on Newtonsoft.

Suggested fix: Pre-validate seekability up-front and let the stream path''s exceptions propagate, or introduce a dedicated StreamNotSupportedException.


🟢 4. Fallback path is unreachable for its only legitimate trigger

File: Microsoft.Azure.Cosmos.Encryption.Custom/src/EncryptionProcessor.cs:471

catch (NotSupportedException)
{
    content.Position = 0;   // <- throws NotSupportedException on non-seekable streams
    return await DecryptJsonArrayNewtonsoftAsync(content, encryptor, cancellationToken);
}

The NotSupportedException this catch was designed to handle is raised exactly when !content.CanSeek. Setting Position = 0 on a non-seekable stream throws another NotSupportedException, masking the original and breaking the fallback. In practice Cosmos returns seekable MemoryStream content, so this never trips today, but the safety net doesn''t catch the case it advertises. Combined with finding 3 if you take the pre-validate approach this disappears.


🟢 5. _ei skip is fragile if the value isn''t a JSON object

File: Microsoft.Azure.Cosmos.Encryption.Custom/src/Transformation/StreamProcessor.Decryptor.cs:469-477 (set) and :403-407 (reset)

else if (reader.ValueTextEquals(this.encryptionPropertiesNameBytes))
{
    if (!reader.TrySkip())
    {
        isIgnoredBlock = true;       // <- assumes the value is an object
    }
    break;
}
...
if (isIgnoredBlock && reader.CurrentDepth == 1 && tokenType == JsonTokenType.EndObject)
{
    isIgnoredBlock = false;          // <- only resets on EndObject
    continue;
}

The current serializer always emits _ei as a JSON object, but the reader makes no such guarantee. If a forged/corrupt response had _ei as an array or a primitive, isIgnoredBlock would never reset (no depth‑1 EndObject) and the remainder of the document would be silently dropped — including the closing }, which would then corrupt downstream JSON state.

Suggested fix: Validate that the next non-skipped token is StartObject before entering ignore mode, or also reset on EndArray/primitives at depth 1. Add a unit test covering _ei: [] and _ei: "foo".


🟢 6. StreamDecryptableItem caches the deserialized item as object — second GetItemAsync<U> cast throws opaquely

File: Microsoft.Azure.Cosmos.Encryption.Custom/src/StreamDecryptableItem.cs:34-90

private object cachedItem;
...
if (this.isDecrypted)
{
    return ((T)this.cachedItem, this.cachedDecryptionContext);
}

After a successful GetItemAsync<Foo>(), the stream is disposed and only the boxed Foo remains. A later GetItemAsync<Bar>() does (Bar)this.cachedItem — throws InvalidCastException with no context. The previous DecryptableItemCore (Newtonsoft) didn''t have this restriction since it cached the JToken. Subtle behavioural divergence between the two processors.

Suggested fix: Store the first-materialized type and throw a descriptive exception on mismatch, or store enough state to re-deserialize.


🟢 7. Missing ConfigureAwait(false) on several awaits in netstandard2.0 paths

Files:

  • Microsoft.Azure.Cosmos.Encryption.Custom/src/EncryptionFeedIterator.cs:42, 50
  • Microsoft.Azure.Cosmos.Encryption.Custom/src/EncryptionContainer.cs:1054, 1059
  • Microsoft.Azure.Cosmos.Encryption.Custom/src/EncryptionProcessor.cs:399, 401, 463, 467, 473

Several awaits lack ConfigureAwait(false) while the rest of the file/PR uses it. The library targets netstandard2.0 consumers (ASP.NET classic / WPF SynchronizationContext) where capturing the context can deadlock or cause unnecessary thread switches.


💬 8. Observation: UseStreamingJsonProcessingByDefault is a write-only, one-shot toggle with no synchronization story

File: Microsoft.Azure.Cosmos.Encryption.Custom/src/EncryptionContainer.cs:19, 1028-1031 + EncryptionContainerExtensions.cs:31-47

  • Mutable state, no opposite method, no idempotency contract documented.
  • Reads in GetItemQueryIterator etc. aren''t synchronized with the setter (enum writes are atomic in .NET so this is observably benign today, but the design is awkward).
  • The extension on Container returns the same Container reference after mutation — easy to misuse (caller may assume a wrapping pattern).

Not a blocker. Consider an options/builder shape or constructor-injected default if revisiting the public API.


💬 9. Observation: TryExtractEncryptionProperties re-parses each captured object

File: Microsoft.Azure.Cosmos.Encryption.Custom/src/Transformation/StreamProcessor.Decryptor.cs:280-294, 173

For every document the streaming path first buffers the entire object, then JsonSerializer.Deserialize<EncryptionPropertiesWrapper>(...) just to check for _ei, then re-parses with Utf8JsonReader to do the actual transform. Two full passes per document on the supposedly "stream-fast" path. A single targeted pre-scan via Utf8JsonReader for the _ei property at depth 1 would avoid the second deserialize. Performance, not correctness — but a worthwhile follow-up.


Recommendation

No blocking findings. The two highest-impact items raised in the prior review thread are clearly addressed. Recommend tightening before merge:

  • Finding 1 (cascade aborts on first failure) — small change, real leak.
  • Finding 2 (SemaphoreSlim leak) — one-line fix.
  • Finding 3 (overly-broad catch) — tighten to avoid silent fallback for unrelated errors.

Findings 4–9 are nice-to-haves and can land here or as a small follow-up. Otherwise approve-with-nits.

…litter throws mid-iteration

ConvertResponseToDecryptableItemsStreamAsync builds a List<DecryptableItem>
locally and only returns it after the JsonArrayStreamSplitter async-enumerator
completes. If the splitter threw after yielding one or more documents (mid-feed
transport error, cancellation, malformed payload past the first doc), every
StreamDecryptableItem already added to the local list owned a PooledMemoryStream
rented from ArrayPool<byte>.Shared. Because the partial list never reaches the
caller and is never wrapped in a DecryptableFeedResponse, the disposal cascade
addressed by an earlier round of review never sees those items. The rented
buffers stay out of the pool (eventual GC reclamation only) and may retain
plaintext residue.

Wrap the foreach in try/catch, drain any partial items via IAsyncDisposable
(swallowing per-item failures so we don't mask the original cause), then rethrow
the original exception unchanged. Adds a regression test in
EncryptionProcessorTests that feeds a ThrowAfterPrefixStream containing one
complete document followed by an IOException, and asserts the original
IOException identity is preserved.

Also strengthens XML doc on UseStreamingJsonProcessingByDefault and
DecryptableItem.DisposeAsync to call out the FeedResponse<DecryptableItem>
disposal contract that stream-mode callers must follow (the cascade is real but
the public type doesn't advertise IAsyncDisposable, so callers must cast). Adds
the missing changelog entry for Azure#5478 under 1.0.0-preview09.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@adamnova
Copy link
Copy Markdown
Contributor

adamnova commented Jun 3, 2026

Picking this up while @MartinSarkany is on leave. New commit 6e0d6d5 addresses one finding from an adversarial pass and strengthens the disposal-contract docs/changelog around the fixes Martin already made.

Adversarial-review verdict on the existing fixes (HEAD e09e0bd):

Prior finding File Verdict
Redundant ?? this.DefaultJsonProcessor EncryptionContainer.cs:837 ✅ Fixed. GetJsonProcessor is null-safe so the dropped ?. is fine.
Depth guard for encrypted-path and _ei matching StreamProcessor.Decryptor.cs:452 ✅ Fixed. reader.CurrentDepth == 1 gates both encrypted-path lookup and _ei skip. EncryptionPropertiesStreamReader also scans _ei only at depth 1.
decryptedStream leak when FromStream<T> throws StreamDecryptableItem.cs:56-79 ✅ Fixed. Aliasing check + inner finally dispose; StreamDecryptableItemTests.GetItemAsync_WhenSerializerThrowsAfterDecryption_DisposesDecryptedStream covers it.
Lazy StreamDecryptableItems leak if caller skips DecryptableFeedResponse.cs ✅ Cascade works. The compile-time return type is still FeedResponse<T> — callers must runtime-cast to IAsyncDisposable. Locked in by DecryptableFeedResponseTests.DisposeAsync_ResponseIsCastableToIAsyncDisposable and the updated example on DecryptableItem. Strengthened the XML docs on UseStreamingJsonProcessingByDefault / DecryptableItem.DisposeAsync to call this out at the opt-in surface (this commit).

One NEW issue this commit fixes (HIGH — orphan-leak narrower than #4 above):

EncryptionProcessor.ConvertResponseToDecryptableItemsStreamAsync accumulates StreamDecryptableItems in a local List<DecryptableItem> while consuming JsonArrayStreamSplitter.SplitIntoSubstreamsAsync. If the splitter threw after yielding one or more documents (mid-feed transport error, cancellation, malformed payload past the first doc), every already-appended StreamDecryptableItem owned a PooledMemoryStream rented from ArrayPool<byte>.Shared. Because the partial list never reaches the caller and is never wrapped in a DecryptableFeedResponse, the disposal cascade addressed by the previous review round never sees those items. The rentals stay out of the pool (eventual GC reclamation only) and may retain plaintext residue.

The fix wraps the await foreach in try/catch, drains any partial items via IAsyncDisposable (swallowing per-item failures so we don't mask the original cause), then rethrows the original exception unchanged. EncryptionProcessorTests.ConvertResponseToDecryptableItemsAsync_Stream_WhenSplitterThrowsMidFeed_PreservesOriginalException feeds a ThrowAfterPrefixStream containing one complete document followed by an IOException and asserts the original IOException identity is preserved through the catch/rethrow.

Also added the missing changelog entry for #5478 under 1.0.0-preview09.

Local validation: net8.0 unit-test suite 689 passed / 0 failed (previously 688 — adds my new regression test). Encryption.Custom build clean (0 warnings).

CI: The two failures on the previous build (EmulatorTests Release - MultiRegion / MultiMaster) are core-SDK MultiRegion/MultiMaster tests; this PR touches only files under Microsoft.Azure.Cosmos.Encryption.Custom and the Encryption.Custom EmulatorTests Release job passed. Treating those failures as flaky/infra unrelated to this PR. Re-triggering with /azp run below — if MultiRegion/MultiMaster still fail on the rerun and the failure modes are not encryption-related, they should be ignored for merge gating per the standard flaky-test handling.

/azp run

…ade, and stream-mode buffer cap

Addresses second-pass review findings (HIGH/MEDIUM) on top of the orphan-leak
fix already in 6e0d6d5.

HIGH — StreamDecryptableItem loses the DEK id in EncryptionException
  Previously hard-coded dataEncryptionKeyId: string.Empty on every decryption
  failure, dropping the diagnostic surface customers use to correlate
  key-store / DEK-revocation failures. Newtonsoft DecryptableItemCore extracts
  _ei.DataEncryptionKeyId before throwing; stream mode now matches:
   1. If decryption succeeded and serialization later threw, the DEK id is
      already in the in-flight DecryptionContext (most accurate source).
   2. Else if contentStream is still readable, parse it out of _ei via the
      existing EncryptionPropertiesStreamReader (seekable-only path; resets
      Position to 0 on return so the subsequent encrypted-content read is
      unaffected).
   3. Else fall back to string.Empty (matches the prior behavior and avoids
      re-throwing inside the catch — EncryptionException's ctor rejects null).

HIGH — DecryptableFeedResponse.DisposeAsync aborts cascade on first throw
  The XML doc promises every pooled buffer is released, but a single throwing
  item silently stranded the rented buffers of every remaining item. The
  cascade is now best-effort:
   - Per-item DisposeAsync is wrapped in try/catch.
   - Failures are collected in a local list and surfaced AFTER the cascade.
   - Single failure: rethrown via ExceptionDispatchInfo.Capture so the original
     stack/identity is preserved (existing
     DisposeAsync_PropagatesExceptionFromItemDispose test still passes).
   - Multiple failures: surfaced as AggregateException.
  The isDisposed flag is also tightened to int + Interlocked.Exchange so
  concurrent disposers can't both pass the idempotency guard.

MEDIUM — StreamProcessor per-document buffer grows unbounded
  JsonArrayStreamSplitter / JsonFeedStreamHelper already enforce a 64 MiB cap
  on growth; the per-document inner loop in StreamProcessor.Decryptor and
  StreamProcessor.Encryptor did not. A single malformed or maliciously-large
  encrypted property could drive the buffer to OOM rather than throwing a
  clean InvalidOperationException. Both inner loops now mirror the splitter
  cap and message-style.

MEDIUM — JsonFeedStreamHelper.HandleLeftOver message clarified
  Trip condition is no-progress on the chunk (document or token doesn't fit),
  not strictly a token. Reworded accordingly.

MEDIUM — UseStreamingJsonProcessingByDefault configure-once contract
  Added XML doc note on the EncryptionContainer instance method explaining the
  one-way, configure-before-use semantics and the undefined behavior of
  mutating it while iterators are in flight. (No code change — adding
  Interlocked here would be over-engineering for a configuration setter and
  could mislead callers into thinking mid-operation mutation is supported.)

Tests
  - StreamDecryptableItemTests.GetItemAsync_WhenDecryptionFails_PopulatesDataEncryptionKeyIdOnException
    asserts EncryptionException.DataEncryptionKeyId == dekId after a forced
    decryption failure (extracted via EncryptionPropertiesStreamReader from _ei).
  - DecryptableFeedResponseTests.DisposeAsync_WhenItemThrows_StillDisposesRemainingItems
    asserts items before, between, and after a throwing item all see exactly one
    DisposeAsync call, and that the original InvalidOperationException identity
    is preserved (not wrapped in AggregateException for the single-failure case).
  - DecryptableFeedResponseTests.DisposeAsync_WhenMultipleItemsThrow_AggregatesAndStillDrains
    asserts that two throwing items surface as an AggregateException with both
    inner exceptions, and that the non-throwing item between them is still
    disposed.

  Local: net8.0 unit-test suite 692 passed / 0 failed (up from 689 after the
  three new regression tests). Build clean (0 warnings, 0 errors).

  Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@adamnova
Copy link
Copy Markdown
Contributor

adamnova commented Jun 3, 2026

Second-pass review verdict (commit 38e6e38). Independently verified every finding before fixing, applied four code-changing fixes (one HIGH regression, one HIGH cascade-leak, two MEDIUMs), one doc-strengthening, and added three regression tests.

# Sev Finding Verdict Action
H-1 HIGH StreamDecryptableItem loses DEK id in EncryptionException (regression vs DecryptableItemCore) ✅ CONFIRM Fixed. Catch block now extracts the DEK id from DecryptionContext (if decrypt succeeded then serializer threw) or from on-stream _ei via EncryptionPropertiesStreamReader.ReadAsync. Falls back to string.Empty only when both sources are unavailable (preserves the original no-throw-in-catch behavior). Test: GetItemAsync_WhenDecryptionFails_PopulatesDataEncryptionKeyIdOnException.
H-2 HIGH DecryptableFeedResponse.DisposeAsync aborts cascade on first throw ✅ CONFIRM Fixed. Per-item try/catch; single failure rethrown via ExceptionDispatchInfo.Capture (preserves identity; existing DisposeAsync_PropagatesExceptionFromItemDispose still green); multiple failures surface as AggregateException. isDisposed tightened to int + Interlocked.Exchange for the L-1 concurrent-disposer race. Tests: DisposeAsync_WhenItemThrows_StillDisposesRemainingItems, DisposeAsync_WhenMultipleItemsThrow_AggregatesAndStillDrains.
H-3 HIGH Missing changelog entry ✅ CONFIRM Done in 6e0d6d5. Extended in 38e6e38 with three #### Fixes entries for the H-1/H-2/M-2 fixes.
H-4 HIGH No unit test exercises actual MDE stream-decryption path through StreamDecryptableItem ✅ CONFIRM Will file as a follow-up issue. Out of scope for this commit — addressing it cleanly requires either a fake MdeStreamProcessor or a coverage hook around the algorithm dispatch in SystemTextJsonStreamAdapter. The new H-1 test does exercise the MDE encrypt → stream-decrypt-attempt → MDE-failure path, but that's a partial credit, not full coverage.
M-1 MED EncryptionContainer.DefaultJsonProcessor mutation racy and one-way ✅ CONFIRM Doc-only fix. Added XML doc note on EncryptionContainer.UseStreamingJsonProcessingByDefault() calling out the configure-once-before-use contract and the undefined behavior of in-flight mutation. Skipping Interlocked.Exchange here intentionally — for a configure-once setter, adding atomics misleads callers into thinking mid-operation mutation is supported; the right contract is "set during setup, before any encryption op". Matches the PooledStreamConfiguration pattern in the same package.
M-2 MED StreamProcessor per-document buffer grows unbounded ✅ CONFIRM Fixed. Both StreamProcessor.Decryptor (line 368-378) and StreamProcessor.Encryptor (line 68-77) now enforce the same MaxBufferSize = 64 MiB cap that JsonArrayStreamSplitter / JsonFeedStreamHelper already enforce, throwing the same InvalidOperationException style. Note: confirmed via git log -L that the unbounded path was introduced by this PR (commit 0779f5130 "Replace RecyclableMemoryStream with ArrayPool-backed streams"), so this isn't a pure pre-existing concern — appropriate to fix here.
M-3 MED Misleading HandleLeftOver error message ✅ CONFIRM Fixed. Reworded to "JSON document or token does not fit within the maximum buffer size of {N} bytes" — trip condition is no-progress on the chunk, which is most often the document not fitting, not strictly a token. M-2 fixes use the same wording for consistency.
M-4 MED UseStreamingJsonProcessingByDefault return type is Container ⚠️ ACKNOWLEDGE Skipping. Reviewer correctly notes this can't be cleanly improved today (EncryptionContainer is internal sealed); raising the surface would require a wider public-API change. Worth a follow-up if the customization surface grows.
L-1 LOW isDisposed not atomic ✅ CONFIRM Fixed alongside H-2 — see above.
L-2 LOW PR description has empty checkbox - [] ✅ CONFIRM Can't fix from my account — editing the base-repo PR body needs maintainer access I don't have (my token has repo scope but not write on Azure/azure-cosmos-dotnet-v3). Cosmetic only; please flip - []- [x] when someone with the perm does the merge pass.
L-3 LOW Doc example forces customers to manually cast to IAsyncDisposable ⚠️ ACKNOWLEDGE Skipping. The reviewer's own suggestion (throw if not IAsyncDisposable) would break the inheritance contract (DecryptableItem base could legitimately not need disposal). Strengthened the XML doc in 6e0d6d5 to make the contract loud at the opt-in surface; further hardening is an SDK-level concern.
L-4 LOW DecryptContentStreamAsync drops CancellationToken ⚠️ ACKNOWLEDGE Skipping — same limitation as DecryptableItemCore, would require a new public overload on DecryptableItem (worth a follow-up issue).
L-5 LOW Two EncryptionFeedIterator ctor overloads with same arity ⚠️ ACKNOWLEDGE Skipping — small ergonomic concern, removing the unused overload is a non-trivial refactor for the value. Filed in my head for follow-up.
L-6 LOW Duplicate-Documents-property test pinning unusual semantics ⚠️ ACKNOWLEDGE Skipping — a code comment would be nice but pinning the behavior is intentional defense-in-depth for backend payloads.

Local validation: 692 passed / 0 failed on net8.0 (up from 689 after the three new regression tests). Build clean (0 warnings, 0 errors).

Status: PR HEAD is now 38e6e38. Still waiting for an authorized user (e.g. @kundadebdatta) to comment /azp run to re-trigger the dotnet-v3-ci pipeline on the new SHA — my comment alone is silently ignored by the bot, same restriction Martin's PR always had. Will file the H-4 coverage-gap and L-4 cancellation-token concerns as follow-up issues separately so they don't block this merge.

/azp run

…nts splitter test

The test pins unusual semantics (yield from every Documents array seen when
duplicated), which differs from the strict-JSON last-wins reading. The
Cosmos gateway will never emit duplicate root-property keys, so the splitter
is intentionally permissive rather than strict. Without this comment a future
maintainer could read the test as accidentally exercising a bug and 'fix' it
to last-wins, silently breaking the splitter's invariant that any object
inside a Documents array is a document.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@adamnova
Copy link
Copy Markdown
Contributor

adamnova commented Jun 4, 2026

H-4 and L-4 follow-up fixes pushed in commit b62e01e. Both deferred items now resolved in-PR.

# Status Summary
L-4 CancellationToken plumbed through stream-mode decrypt ✅ Fixed New public virtual DecryptableItem.GetItemAsync<T>(CancellationToken) overload. Default impl honors pre-call cancellation and delegates to GetItemAsync<T>() so out-of-package subclasses are binary-compat. StreamDecryptableItem overrides it to plumb the token through SemaphoreSlim.WaitAsync and EncryptionProcessor.DecryptAsync. DecryptableItemCore overrides it to forward to the existing EncryptionProcessor.DecryptAsync(JObject, ...) overload (which already took a CT param that the parameterless caller was hard-coding to default). OperationCanceledException is rethrown unwrapped on both paths. The parameterless GetItemAsync<T>() now delegates to GetItemAsync<T>(default) so the impl lives in one place. Tests: GetItemAsync_WithCancellationToken_PreCancelled_Throws + _NotCancelled_DecryptsNormally + GetItemAsync_DefaultOverload_DelegatesToCancellationTokenOverload. Public API surface contracts (net6 + net8) updated; diff is purely additive.
H-4 MDE stream-decrypt path coverage ✅ Fixed GetItemAsync_StreamMode_WithMdeAlgorithm_TakesGenuineMdeStreamPath encrypts with MdeAeadAes256CbcHmac256Randomized via TestEncryptorFactory.CreateMde (the only mock that supplies the DataEncryptionKey methods the MDE stream encryptor needs — EncryptByteCount/DecryptByteCount/EncryptData/DecryptData), then asserts via an ActivityListener on the Microsoft.Azure.Cosmos.Encryption.Custom source that the ScopeDecryptModeSelectionPrefix+JsonProcessor.Stream scope WAS created AND that the ScopeDecryptModeSelectionPrefix+JsonProcessor.Newtonsoft scope was NOT — locking in "we actually take the MDE/Stream branch", not the legacy fallback the existing tests silently exercised.

Validation: net8.0 unit-test suite 696 passed / 0 failed. Build clean (0 warnings, 0 errors).

Status: PR HEAD is now b62e01e. Still pending an authorized user to comment /azp run to re-trigger CI on the new SHA.

/azp run

Every existing Stream-variant decrypt test in StreamDecryptableItemTests
encrypts with AEAes256CbcHmacSha256Randomized (legacy), which makes
SystemTextJsonStreamAdapter.DecryptAsync throw NotSupportedException at
its algorithm check and routes the request through the Newtonsoft legacy
fallback (EncryptionProcessor.DecryptAsync with legacyFallback: true).
Net effect: the MDE stream-decrypt path was never actually exercised by
the StreamDecryptableItem suite, even though tests appeared to cover it.

GetItemAsync_StreamMode_WithMdeAlgorithm_TakesGenuineMdeStreamPath
encrypts with MdeAeadAes256CbcHmac256Randomized via
TestEncryptorFactory.CreateMde (the only mock that supplies the
DataEncryptionKey methods the MDE stream encryptor needs:
EncryptByteCount / DecryptByteCount / EncryptData / DecryptData), then
asserts via an ActivityListener on the
'Microsoft.Azure.Cosmos.Encryption.Custom' source that the
ScopeDecryptModeSelectionPrefix+JsonProcessor.Stream scope WAS created
AND the ScopeDecryptModeSelectionPrefix+JsonProcessor.Newtonsoft scope
was NOT created. Together with round-trip equality assertions, this
locks in 'we actually take the MDE/Stream branch', not the legacy
fallback the existing tests silently exercised.

Test-only; no shipped src changes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@adamnova adamnova force-pushed the feature/encryption-feed-iterator-rework branch from b62e01e to 5cecb2b Compare June 4, 2026 14:03
@adamnova
Copy link
Copy Markdown
Contributor

adamnova commented Jun 4, 2026

Reverting L-4 (the GetItemAsync<T>(CancellationToken) overload) and force-pushing — your pushback on the default impl was correct, and looking at where the token could actually fire, the value is near zero:

Where CT could observe cancellation inside GetItemAsync<T>:

  • SemaphoreSlim.WaitAsync(CT) — only under contention on the same item (rare)
  • encryptor.GetEncryptionKeyAsync(CT) — only if the customer's DEK provider does remote I/O and the DEK isn't cached
  • encryptor.DecryptAsync(CT) — only if the customer's encryptor is slow

Where CT cannot fire (the bulk of the work):

  • PooledMemoryStream.ReadAsync — in-memory, ignores CT
  • StreamProcessor.TransformDecryptBuffer per-byte inner loop — purely synchronous, no ThrowIfCancellationRequested
  • Utf8JsonWriter / Utf8JsonReader — sync
  • cosmosSerializer.FromStream<T> — sync
  • Newtonsoft path's JObject traversal — sync

For the typical case (page already fetched, DEK cached, small-to-medium documents) GetItemAsync<T> is sub-millisecond sync CPU work and CT will never fire even with proper plumbing. Customers who want fail-fast can already do ct.ThrowIfCancellationRequested(); await item.GetItemAsync<T>(); from their side. The original L-4 rationale ("important for the stream path where decryption can be slow for large documents") doesn't survive scrutiny — the per-byte loop is exactly where slowness happens, and it's CT-deaf.

Cost vs benefit: API-surface expansion + breaking change for external DecryptableItem subclassers (when made abstract) for a CT that fires only in the HSM-slow-DEK-fetch corner case. Not worth it.

Reverted L-4 in commit 5cecb2b33 (force-pushed over b62e01e63). Kept the H-4 MDE-path coverage test — that one is standalone and lands real coverage on a path that was silently falling through to the Newtonsoft fallback. The H-4 commit message is now test-only.

Validation: 693 passed / 0 failed on net8.0 after the revert (was 696; -3 for the dropped L-4 regression tests, kept the +1 H-4 test). Build clean.

Final PR HEAD: 5cecb2b33. Still pending an authorized commenter for /azp run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants