Skip to content
Open
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,12 @@ public EncryptionKeyStoreProviderImpl(IKeyEncryptionKeyResolver keyEncryptionKey
{
this.keyEncryptionKeyResolver = keyEncryptionKeyResolver;
this.ProviderName = providerName;
this.DataEncryptionKeyCacheTimeToLive = TimeSpan.Zero;

// Enable the MDE library's built-in DEK byte cache. When ProtectedDataEncryptionKey cache
// expires (every 1–2 hours), the DEK byte cache still holds the unwrapped key bytes, so
// reconstruction avoids Key Vault HTTP calls. The 2-hour TTL outlives the default
// ProtectedDataEncryptionKey TTL (1 hour), covering most steady-state cache misses.
this.DataEncryptionKeyCacheTimeToLive = TimeSpan.FromHours(2);
}

public override string ProviderName { get; }
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
//------------------------------------------------------------
// Copyright (c) Microsoft Corporation. All rights reserved.
//------------------------------------------------------------

namespace Microsoft.Azure.Cosmos.Encryption.Tests
{
using System;
using global::Azure.Core.Cryptography;
using Microsoft.VisualStudio.TestTools.UnitTesting;
using Moq;

[TestClass]
public class EncryptionKeyStoreProviderImplTests
{
[TestMethod]
public void Constructor_DekByteCacheEnabled_TwoHours()
{
Mock<IKeyEncryptionKeyResolver> mockResolver = new Mock<IKeyEncryptionKeyResolver>();
EncryptionKeyStoreProviderImpl provider = new EncryptionKeyStoreProviderImpl(mockResolver.Object, "testProvider");

Assert.AreEqual(TimeSpan.FromHours(2), provider.DataEncryptionKeyCacheTimeToLive);
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-03-05
285 changes: 285 additions & 0 deletions openspec/changes/reduce-encryption-contention/design.md

Large diffs are not rendered by default.

29 changes: 29 additions & 0 deletions openspec/changes/reduce-encryption-contention/proposal.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
## Why

Customer-side encryption operations throw `OperationCanceledException` under concurrent load because a global `SemaphoreSlim(1,1)` in `BuildProtectedDataEncryptionKeyAsync` serializes construction of `ProtectedDataEncryptionKey` objects — the resolved, unwrapped encryption keys needed for every encrypt/decrypt operation. The semaphore guards `KeyEncryptionKey.GetOrCreate` and `ProtectedDataEncryptionKey.GetOrCreate` — the Microsoft Data Encryption library's static get-or-create cache operations that, on cache miss, invoke the `createItem` delegate which triggers synchronous Key Vault HTTP calls (Resolve + UnwrapKey). The semaphore ensures only one thread at a time performs the expensive key creation, preventing duplicate Key Vault calls for the same key. However, this means every encrypted leaf value in every document on every page contends on this single-permit lock — even when the cache is warm and the hold time would be microseconds. The root cause: `ProtectedDataEncryptionKey` resolution is synchronous, happens on the hot path under the semaphore, and makes two blocking HTTP calls to Key Vault on cache miss. An internal customer is actively impacted.

## What Changes

- **Async prefetch of `ProtectedDataEncryptionKey` resolution outside the semaphore** via `ResolveAsync()` + `UnwrapKeyAsync()` before semaphore acquisition. The sync `UnwrapKey` inside the semaphore cannot be removed — the Microsoft Data Encryption library's `ProtectedDataEncryptionKey.GetOrCreate` constructor chain calls it unconditionally. Instead, the prefetch populates a cache so that when the Microsoft Data Encryption library's sync `UnwrapKey` fires, it returns cached bytes instantly instead of making HTTP calls to Key Vault. The semaphore is still acquired, but held for microseconds (cache read) instead of 200ms–2.4s (HTTP I/O).
- **Proactive background refresh** of the prefetched unwrapped Data Encryption Key bytes (the plaintext AES-256 key material returned by Key Vault's `UnwrapKey`) approximately 5 minutes before time-to-live expiry, deduplicated to one Azure Key Vault call per key per interval. Prevents thundering herd at time-to-live boundary.
- **Cache the resolved `IKeyEncryptionKey` (CryptographyClient)** per Customer Master Key URL so each refresh makes one Azure Key Vault call (`UnwrapKey`) instead of two (`Resolve` + `UnwrapKey`).
- All changes gated behind an opt-in environment variable (`AZURE_COSMOS_ENCRYPTION_OPTIMISTIC_DECRYPTION_ENABLED`), off by default. No public API changes. No breaking changes.

## Capabilities

### New Capabilities
- `async-dek-prefetch`: Async prefetch of `ProtectedDataEncryptionKey` resolution outside the semaphore using `ResolveAsync()` + `UnwrapKeyAsync()`, with a `ConcurrentDictionary` prefetch cache that the sync `UnwrapKey` reads from. Includes proactive background refresh before time-to-live expiry and lifecycle management via `CancellationTokenSource`.
- `resolved-client-cache`: Cache the `IKeyEncryptionKey` (CryptographyClient) returned by `Resolve()` per Customer Master Key URL to eliminate redundant Key Vault HTTP GETs on each refresh.
- `env-var-feature-gate`: Environment variable gate (`AZURE_COSMOS_ENCRYPTION_OPTIMISTIC_DECRYPTION_ENABLED`) to opt in to all caching/prefetch layers. Off by default. Follows existing SDK `ConfigurationManager` pattern.

### Modified Capabilities
<!-- No existing spec-level requirement changes. All changes are additive and gated. -->

## Impact

- **Files**: `EncryptionKeyStoreProviderImpl.cs` (or new subclass), `EncryptionCosmosClient.cs`, `EncryptionSettingForProperty.cs` (prefetch wiring only)
- **Dependencies**: No new packages. Uses existing `IKeyEncryptionKeyResolver.ResolveAsync()` / `IKeyEncryptionKey.UnwrapKeyAsync()` from `Azure.Core.Cryptography`.
- **APIs**: No public API changes. Internal-only.
- **Security**: Plaintext Data Encryption Key bytes are already cached in-process by the Microsoft Data Encryption library's `ProtectedDataEncryptionKey`. New caches hold the same bytes with the same time-to-live in the same process — no new attack surface.
- **Risk**: New caching layers introduce lifecycle complexity (background refresh, disposal, cache coherence on key rotation). Gated by environment variable for safe rollout.
- **Testing**: Unit tests for each cache layer + concurrency. Emulator end-to-end tests with environment variable enabled.
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
## ADDED Requirements

### Requirement: Async prefetch of unwrapped Data Encryption Key bytes outside the semaphore
The system SHALL provide a `PrefetchUnwrapKeyAsync` method that calls `ResolveAsync()` + `UnwrapKeyAsync()` asynchronously and stores the result in a `ConcurrentDictionary<string, byte[]>` prefetch cache. This method SHALL be called before semaphore acquisition in `BuildProtectedDataEncryptionKeyAsync`.

#### Scenario: Prefetch warms cache before semaphore
- **WHEN** `BuildEncryptionAlgorithmForSettingAsync` enters the cold path and calls `PrefetchUnwrapKeyAsync` before acquiring the semaphore
- **THEN** `ResolveAsync` and `UnwrapKeyAsync` SHALL execute asynchronously (yielding the thread), and the result SHALL be stored in the prefetch cache

#### Scenario: Sync UnwrapKey reads from prefetch cache
- **WHEN** the Microsoft Data Encryption library's sync `UnwrapKey` is called inside the semaphore and the prefetch cache has a valid entry for the wrapped key
- **THEN** `UnwrapKey` SHALL return the cached bytes immediately without calling `Resolve()` or `UnwrapKey()` on Key Vault

#### Scenario: Sync fallback on prefetch cache miss
- **WHEN** the Microsoft Data Encryption library's sync `UnwrapKey` is called inside the semaphore and the prefetch cache does NOT have an entry (race condition, prefetch failed, or prefetch not called)
- **THEN** `UnwrapKey` SHALL fall through to the existing sync `Resolve()` + `UnwrapKey()` path (identical to current behavior)

### Requirement: Concurrent prefetch deduplication
The system SHALL deduplicate concurrent prefetch calls for the same wrapped key so that only one async Key Vault call flies per key at a time.

#### Scenario: Multiple threads prefetch same key
- **WHEN** N threads simultaneously call `PrefetchUnwrapKeyAsync` for the same wrapped key (N can be any number of concurrent callers)
- **THEN** only one `ResolveAsync` + `UnwrapKeyAsync` call SHALL be made to Key Vault; all N threads SHALL await the same `Task`
- **NOTE**: The deduplication guarantee is independent of the number of concurrent callers. Test scenarios SHOULD use a representative concurrency level (e.g. 50) but the invariant holds for any N ≥ 2.

### Requirement: Proactive background refresh before time-to-live expiry
The system SHALL schedule a background refresh of the prefetch cache entry when the entry is within the refresh window of its time-to-live expiry (20% of cache time-to-live, capped at 5 minutes maximum), so that the next consumer finds a warm cache.

#### Scenario: Background refresh fires before expiry
- **WHEN** a prefetch cache entry is within the refresh window (20% of time-to-live, max 5 minutes) of expiry and is accessed
- **THEN** the system SHALL initiate a background `Task.Run` that calls `ResolveAsync` + `UnwrapKeyAsync` and updates the cache entry

#### Scenario: Background refresh failure does not crash
- **WHEN** the background refresh call fails (Key Vault down, 429 throttle, network error)
- **THEN** the failure SHALL be caught and logged; the existing cache entry SHALL remain until its time-to-live expires; the sync fallback path SHALL handle the next call
- **NOTE**: The background refresh SHALL NOT retry on failure. Retrying with backoff risks spanning past the cache entry's time-to-live expiry — at which point the entry is gone, concurrent threads find no cache hit, and all fall through to the sync `Resolve()` + `UnwrapKey()` path under the semaphore, recreating the thundering herd problem this design prevents. Instead, fail fast: log the failure, keep serving the existing entry until time-to-live expiry, and rely on the next natural prefetch call (on the next cache access) to retry organically. The prefetch path already deduplicates concurrent calls, so retry coordination is built in. No explicit cache invalidation is needed — time-to-live expiry naturally clears stale entries.

### Requirement: Prefetch cache time-to-live matches `ProtectedDataEncryptionKey` cache time-to-live
The prefetch cache entry time-to-live SHALL match the `ProtectedDataEncryptionKey.TimeToLive` value.

#### Scenario: Cache entry expires with `ProtectedDataEncryptionKey`
- **WHEN** the `ProtectedDataEncryptionKey` cache time-to-live (1–2 hours) elapses
- **THEN** the prefetch cache entry for the same key SHALL also be expired, ensuring a fresh Key Vault call on the next cold path

### Requirement: Lifecycle management via CancellationTokenSource
The async prefetch layer SHALL use a `CancellationTokenSource` to cancel in-flight background refresh tasks on disposal.

#### Scenario: Disposal cancels background tasks
- **WHEN** `EncryptionCosmosClient.Dispose()` is called
- **THEN** the `CancellationTokenSource` SHALL be cancelled, all in-flight background refresh tasks SHALL observe cancellation, and the prefetch cache SHALL be cleared

#### Scenario: Double-dispose is safe
- **WHEN** `Dispose()` is called multiple times
- **THEN** the second and subsequent calls SHALL be no-ops (idempotent via `Interlocked.Exchange`)

### Requirement: Prefetch errors do not propagate to callers
The prefetch call in `BuildEncryptionAlgorithmForSettingAsync` SHALL be best-effort.

#### Scenario: Prefetch throws non-cancellation exception
- **WHEN** `PrefetchUnwrapKeyAsync` throws an exception that is not `OperationCanceledException`
- **THEN** the exception SHALL be caught and swallowed; execution SHALL continue to semaphore acquisition and the normal sync path

#### Scenario: Prefetch throws OperationCanceledException
- **WHEN** `PrefetchUnwrapKeyAsync` throws `OperationCanceledException` (caller's token fired)
- **THEN** the exception SHALL propagate to the caller (same as current behavior when `WaitAsync` is cancelled)
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
## ADDED Requirements

### Requirement: Single environment variable gates all optimization layers
The system SHALL use a single environment variable `AZURE_COSMOS_ENCRYPTION_OPTIMISTIC_DECRYPTION_ENABLED` to enable or disable all caching and prefetch layers for Data Encryption Key resolution (resolved-client cache, async Data Encryption Key prefetch, proactive background Data Encryption Key refresh).

#### Scenario: Environment variable not set — all layers disabled
- **WHEN** `AZURE_COSMOS_ENCRYPTION_OPTIMISTIC_DECRYPTION_ENABLED` is not set in the environment
- **THEN** all behavior SHALL be identical to the current codebase: no resolved-client cache, no async prefetch, no proactive background refresh

#### Scenario: Environment variable set to true — all layers enabled
- **WHEN** `AZURE_COSMOS_ENCRYPTION_OPTIMISTIC_DECRYPTION_ENABLED` is set to `true` (case-insensitive)
- **THEN** all caching and prefetch layers SHALL be active

#### Scenario: Environment variable set to false or invalid — all layers disabled
- **WHEN** `AZURE_COSMOS_ENCRYPTION_OPTIMISTIC_DECRYPTION_ENABLED` is set to `false`, empty, or any value that does not parse as `true`
- **THEN** all behavior SHALL be identical to the env-var-not-set case

### Requirement: Environment variable read at EncryptionCosmosClient construction time
The environment variable SHALL be read once during `EncryptionCosmosClient` construction and the result cached for the client's lifetime. Subsequent changes to the environment variable SHALL NOT affect an already-constructed client.

#### Scenario: Environment variable read once at startup
- **WHEN** `EncryptionCosmosClient` is constructed
- **THEN** the environment variable SHALL be read via the SDK's `ConfigurationManager` pattern (or `Environment.GetEnvironmentVariable`) and the boolean result stored as a readonly field

#### Scenario: Environment variable change after construction has no effect
- **WHEN** the environment variable is changed after `EncryptionCosmosClient` is constructed
- **THEN** the existing client instance SHALL continue using the value read at construction time

### Requirement: No public API changes
There SHALL be no new public classes, methods, properties, or parameters exposed. All optimization layers SHALL be entirely internal, activated only by the environment variable.

#### Scenario: Public API surface unchanged
- **WHEN** the encryption package is built with optimizations enabled
- **THEN** the public API contract (as captured in the contracts file) SHALL be identical to the current version
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
## ADDED Requirements

### Requirement: Resolved IKeyEncryptionKey cached per Customer Master Key URL
The system SHALL maintain a `ConcurrentDictionary<string, IKeyEncryptionKey>` keyed by `encryptionKeyId` (the Customer Master Key URL) that caches the `CryptographyClient` returned by `IKeyEncryptionKeyResolver.Resolve()`.

#### Scenario: First resolve for a Customer Master Key URL
- **WHEN** `UnwrapKey` (or prefetch) is called for a Customer Master Key URL not yet in the resolved-client cache
- **THEN** the system SHALL call `Resolve(keyId)` (or `ResolveAsync(keyId)` on the async path), store the returned `IKeyEncryptionKey` in the cache, and use it for the `UnwrapKey` call

#### Scenario: Subsequent resolve for the same Customer Master Key URL
- **WHEN** `UnwrapKey` is called for a Customer Master Key URL that is already in the resolved-client cache
- **THEN** the system SHALL skip the `Resolve()` call and use the cached `IKeyEncryptionKey` directly for the `UnwrapKey` call

#### Scenario: Key Vault calls halved on true cache miss
- **WHEN** a `ProtectedDataEncryptionKey` cache miss triggers `UnwrapKey` with a warm resolved-client cache
- **THEN** only one Key Vault HTTP call SHALL be made (`UnwrapKey` POST) instead of two (`Resolve` GET + `UnwrapKey` POST)

### Requirement: No secret material in resolved-client cache
The `IKeyEncryptionKey` object SHALL contain only the Key Vault URL, key name, key version, and HTTP pipeline configuration. It SHALL NOT contain any private key material.

#### Scenario: Cache contents are non-secret
- **WHEN** the resolved-client cache is inspected
- **THEN** each entry SHALL be a `CryptographyClient` (or equivalent) containing only the Customer Master Key URL and auth pipeline reference — no RSA private key bytes

### Requirement: Cache invalidation on Customer Master Key URL change
The resolved-client cache entry SHALL be invalidated when the Customer Master Key URL changes (key rotation where the Client Encryption Key is rewrapped to a different Customer Master Key).

#### Scenario: Customer Master Key URL changes after rewrap
- **WHEN** `ClientEncryptionKeyProperties.EncryptionKeyWrapMetadata.Value` returns a different Customer Master Key URL than the one cached
- **THEN** the system SHALL call `Resolve()` with the new URL and update the cache entry
Loading
Loading