Customer-side encryption operations throw OperationCanceledException under concurrent load because a global SemaphoreSlim(1,1) in BuildProtectedDataEncryptionKeyAsync serializes construction of ProtectedDataEncryptionKey objects — the resolved, unwrapped encryption keys needed for every encrypt/decrypt operation. The semaphore guards KeyEncryptionKey.GetOrCreate and ProtectedDataEncryptionKey.GetOrCreate — the Microsoft Data Encryption library's static get-or-create cache operations that, on cache miss, invoke the createItem delegate which triggers synchronous Key Vault HTTP calls (Resolve + UnwrapKey). The semaphore ensures only one thread at a time performs the expensive key creation, preventing duplicate Key Vault calls for the same key. However, this means every encrypted leaf value in every document on every page contends on this single-permit lock — even when the cache is warm and the hold time would be microseconds. The root cause: ProtectedDataEncryptionKey resolution is synchronous, happens on the hot path under the semaphore, and makes two blocking HTTP calls to Key Vault on cache miss. An internal customer is actively impacted.
- Async prefetch of
ProtectedDataEncryptionKeyresolution outside the semaphore viaResolveAsync()+UnwrapKeyAsync()before semaphore acquisition. The syncUnwrapKeyinside the semaphore cannot be removed — the Microsoft Data Encryption library'sProtectedDataEncryptionKey.GetOrCreateconstructor chain calls it unconditionally. Instead, the prefetch populates a cache so that when the Microsoft Data Encryption library's syncUnwrapKeyfires, it returns cached bytes instantly instead of making HTTP calls to Key Vault. The semaphore is still acquired, but held for microseconds (cache read) instead of 200ms–2.4s (HTTP I/O). - Proactive background refresh of the prefetched unwrapped Data Encryption Key bytes (the plaintext AES-256 key material returned by Key Vault's
UnwrapKey) approximately 5 minutes before time-to-live expiry, deduplicated to one Azure Key Vault call per key per interval. Prevents thundering herd at time-to-live boundary. - Cache the resolved
IKeyEncryptionKey(CryptographyClient) per Customer Master Key URL so each refresh makes one Azure Key Vault call (UnwrapKey) instead of two (Resolve+UnwrapKey). - All changes gated behind an opt-in environment variable (
AZURE_COSMOS_ENCRYPTION_OPTIMISTIC_DECRYPTION_ENABLED), off by default. No public API changes. No breaking changes.
async-dek-prefetch: Async prefetch ofProtectedDataEncryptionKeyresolution outside the semaphore usingResolveAsync()+UnwrapKeyAsync(), with aConcurrentDictionaryprefetch cache that the syncUnwrapKeyreads from. Includes proactive background refresh before time-to-live expiry and lifecycle management viaCancellationTokenSource.resolved-client-cache: Cache theIKeyEncryptionKey(CryptographyClient) returned byResolve()per Customer Master Key URL to eliminate redundant Key Vault HTTP GETs on each refresh.env-var-feature-gate: Environment variable gate (AZURE_COSMOS_ENCRYPTION_OPTIMISTIC_DECRYPTION_ENABLED) to opt in to all caching/prefetch layers. Off by default. Follows existing SDKConfigurationManagerpattern.
- Files:
EncryptionKeyStoreProviderImpl.cs(or new subclass),EncryptionCosmosClient.cs,EncryptionSettingForProperty.cs(prefetch wiring only) - Dependencies: No new packages. Uses existing
IKeyEncryptionKeyResolver.ResolveAsync()/IKeyEncryptionKey.UnwrapKeyAsync()fromAzure.Core.Cryptography. - APIs: No public API changes. Internal-only.
- Security: Plaintext Data Encryption Key bytes are already cached in-process by the Microsoft Data Encryption library's
ProtectedDataEncryptionKey. New caches hold the same bytes with the same time-to-live in the same process — no new attack surface. - Risk: New caching layers introduce lifecycle complexity (background refresh, disposal, cache coherence on key rotation). Gated by environment variable for safe rollout.
- Testing: Unit tests for each cache layer + concurrency. Emulator end-to-end tests with environment variable enabled.