Query: Refactors TryCatch.FromException to defer StackTrace capture#5900
Draft
kirankumarkolli wants to merge 7 commits into
Draft
Query: Refactors TryCatch.FromException to defer StackTrace capture#5900kirankumarkolli wants to merge 7 commits into
kirankumarkolli wants to merge 7 commits into
Conversation
## Description Refactors `TryCatch<TResult>.FromException` to skip the eager `new StackTrace(skipFrames: 1)` allocation and defer to the inner exception's own already-thrown stack. This is the dominant allocation source on the query pipeline failure path under 429 (and other non-success) status code storms. ## Customer-visible impact The CosmosException surfaced to callers from `GetItemQueryIterator.ReadNextAsync()` (and the read-feed / change-feed equivalents) is bit-identical to current master for all status codes that flow through `CosmosQueryClientCore.GetCosmosElementResponse`: 429, 503, 500, 408, 449, 404/1002, 401, etc. Same Status, SubStatusCode, Message, RetryAfter, ActivityId, ChainDepth, diagnostics tree, and stack-trace top frame. The synthetic stack captured by `TryCatch.FromException` is purely an internal allocation - it never reaches the consumer for the query data-plane path because `ExceptionToCosmosException.TryCreateFromException` unwraps the wrapper and `QueryIterator` rethrows a fresh CosmosException whose stack is set by the runtime at the throw site. ## Benchmark (FaultInjection 429 storm, 128 workers, 20 s phases) | | Master | Patched | Delta | |--------------|--------:|--------:|------:| | WARMUP CPU(s)| 33.19 | 24.58 | -26 % | | BURST CPU(s) | 21.58 | 18.97 | -12 % | | WARMUP AllocMB | 4608 | 3672 | -20 % | | BURST AllocMB | 1561 | 1240 | -21 % | | BURST exceptions thrown | 185,770 | 174,182 | -6 % | OTel-on path shows comparable improvements; the patch is independent of distributed tracing being subscribed. ## Approach `ExceptionWithStackTraceException` now accepts `null` for its synthetic `stackTrace` parameter. Its `StackTrace` getter falls back to `base.StackTrace` (runtime-captured stack if/when the wrapper is thrown) and then to `InnerException.StackTrace` (the actual throw site of the wrapped exception). Wrapper instance structure and `Exception.InnerException` chain are preserved, so all callers using `.InnerException` to unwrap continue to work unchanged. ## Test coverage - Adds `TryCatchTests.TestFromException_PreservesInnerException` which asserts wrapper is preserved, `InnerException` is the caller-supplied exception, and `StackTrace` falls back to the inner exception's throw site. - Adds `TryCatchTests.TestFromException_ToStringIncludesInnerStack` which asserts `ex.ToString()` still walks the inner exception's stack. - Updates `CosmosClientTests.InvalidKey_ExceptionFullStacktrace` to assert on `EnsureValidClientAsync` (a frame that survives in the inner exception's stack) rather than on a frame that depended on the now-removed synthetic capture. Full unit suite: 2,556 passed, 11 skipped, 0 failed. ## Repro / benchmark tooling Adds `Microsoft.Azure.Cosmos.Samples/Tools/FaultInjection429Bench` and `FaultInjection429Dump` console programs that exercise the SDK via Cosmos.FaultInjection and capture CPU/allocation/exception metrics across multiple status codes. Both projects are emulator- or live-account-driven, do not run in CI, and target net6.0. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Description Refactors `TryCatch<TResult>.FromException` to read the inner exception's already-captured stack frames via `new StackTrace(exception)` instead of walking the live thread stack via `new StackTrace(skipFrames: 1)`. The dominant CPU cost of the original implementation was the native thread-stack walk performed on every failed `TryCatch` propagation. Under exception-storm workloads (e.g., 429 / 503 / 408 storms on the query data-plane path), this was the single largest contributor to managed-heap pressure and Gen-0 GC churn. `new StackTrace(Exception)` does not walk the live thread; it parses the exception's existing `_stackTrace` data captured at throw time. The allocation is small and bounded by the inner exception's frame count. ## Customer-visible impact The `CosmosException` surfaced to callers from `GetItemQueryIterator.ReadNextAsync()` (and the read-feed / change-feed equivalents) is bit-identical to current master for all status codes that flow through `CosmosQueryClientCore.GetCosmosElementResponse`: 429, 503, 500, 408, 449, 404/1002, 401, etc. Same Status, SubStatusCode, Message, RetryAfter, ActivityId, ChainDepth, diagnostics tree, and stack-trace top frame. The synthetic stack captured by `TryCatch.FromException` is purely an internal allocation - it never reaches the consumer for the query data-plane path because `ExceptionToCosmosException.TryCreateFromException` unwraps the wrapper and `QueryIterator` rethrows a fresh `CosmosException` whose stack is set by the runtime at the throw site. ## Benchmark (FaultInjection 429 storm, 128 workers, 20 s phases) | | Master | Patched | Delta | |--------------|--------:|--------:|------:| | WARMUP CPU(s)| 33.19 | 23.20 | -30 % | | BURST CPU(s) | 21.58 | 19.67 | -9 % | | WARMUP AllocMB | 4608 | 3701 | -20 % | | BURST AllocMB | 1561 | 1188 | -24 % | | BURST exceptions thrown | 185,770 | 183,880 | -1 % | OTel-on path shows comparable improvements; the patch is independent of distributed tracing being subscribed. ## Approach `TryCatch<TResult>.FromException(Exception exception)` now passes `new StackTrace(exception)` to `ExceptionWithStackTraceException`. This is a single-line change. `ExceptionWithStackTraceException` is unchanged from master - its constructor's non-null `stackTrace` guard is preserved, its `StackTrace` getter still returns `this.stackTrace.ToString()`, and the wrapper instance shape / `Exception.InnerException` chain are unchanged. The wrapper's `StackTrace` content now reflects the inner exception's throw site (the same frames you would get from `exception.StackTrace`) instead of the synthetic live-thread capture at the wrap site. For the 99 % case (CosmosException from the SDK pipeline), both give equivalent diagnostic information. ## Test coverage - Adds `TryCatchTests.TestFromException_PreservesInnerException` which asserts the wrapper is preserved, `InnerException` is the caller-supplied exception, and `StackTrace` reflects the inner exception's throw site. - Adds `TryCatchTests.TestFromException_ToStringIncludesInnerStack` which asserts `ex.ToString()` walks the inner exception's stack. - Updates `CosmosClientTests.InvalidKey_ExceptionFullStacktrace` to assert on `EnsureValidClientAsync` (a frame that survives in the inner exception's stack) rather than on a frame that depended on the now-removed synthetic live-thread capture. Full unit suite: 2,556 passed, 11 skipped, 0 failed. ## Repro / benchmark tooling Adds `Microsoft.Azure.Cosmos.Samples/Tools/FaultInjection429Bench` and `FaultInjection429Dump` console programs that exercise the SDK via `Cosmos.FaultInjection` and capture CPU/allocation/exception metrics across multiple status codes. Both projects are emulator- or live- account-driven, do not run in CI, and target net6.0. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… (new StackTrace(exception)) Reconciles divergence between origin (Option A: pass null + lazy fallback in ExceptionWithStackTraceException) and local (Option B: new StackTrace(exception) with ExceptionWithStackTraceException unchanged from master). Option B preferred: cleaner one-line behavioral change; ExceptionWithStackTraceException reverts to master state; equivalent perf to Option A (within run-to-run noise) on FaultInjection 429 storm bench. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Description
Refactors `TryCatch<TResult>.FromException` to eliminate the native live
thread-stack walk performed on every failed `TryCatch` propagation, while
preserving diagnostic stack content across all call paths.
The original implementation unconditionally called `new StackTrace(skipFrames: 1)`
which walks the live thread stack via native code. Under exception-storm
workloads (e.g., 429 / 503 / 408 storms on the query data-plane path),
instrumentation shows this dominates managed-heap pressure and Gen-0 GC
churn - ~9,000 stack walks per second at customer-realistic concurrency.
## Approach
Three-branch typed gating:
1. **Rewrap of an already-wrapped `ExceptionWithStackTraceException`**
(the dominant case at ~61% of FromException calls during a 429 burst):
reuse the existing captured StackTrace - no allocation, no walk,
and preserves the diagnostic that was captured at the first wrap.
2. **Typed SDK exception (CosmosException, DocumentClientException)**:
the inner has its own already-captured _stackTrace from when it was
thrown. Use `new StackTrace(exception)` which parses those existing
frames and does NOT walk the live thread.
3. **Synthetic, never-thrown inner exception** (e.g., `MalformedContinuationTokenException`
created at continuation-token parse failures): fall back to the
original live-thread capture so `ExceptionToCosmosException.TryCreateFromExceptionWithStackTrace`
can still surface a useful SDK call chain to the customer. This case
is ~0% of calls on the data-plane hot path; the live walk is paid
only once on the first wrap and reused via branch 1 on all rewraps.
Adds `ExceptionWithStackTraceException.GetCapturedStackTrace()` internal
accessor to enable branch 1.
## Customer-visible impact
The `CosmosException` surfaced to callers from
`GetItemQueryIterator.ReadNextAsync()` (and the read-feed / change-feed
equivalents) is bit-identical to current master for all status codes that
flow through `CosmosQueryClientCore.GetCosmosElementResponse`:
429, 503, 500, 408, 449, 404/1002, 401, etc.
Same Status, SubStatusCode, Message, RetryAfter, ActivityId, ChainDepth,
diagnostics tree, and stack-trace top frame.
Diagnostic stack content for synthetic-error paths (`MalformedContinuationTokenException`
etc.) is preserved - the live walk still happens once on the first wrap
of the synthetic exception, then reused via the rewrap branch.
## Benchmark (FaultInjection 429 storm, 128 workers, 20 s phases)
| | Master | Patched | Delta |
|--------------|--------:|--------:|------:|
| OFF-WARM CPU(s) | 33.19 | 23.69 | -29 % |
| OFF-BURST CPU(s) | 21.58 | 18.39 | -15 % |
| OFF-RECOV CPU(s) | 27.03 | 19.17 | -29 % |
| OFF-WARM AllocMB | 4,608 | 3,701 | -20 % |
| OFF-BURST AllocMB | 1,561 | 1,169 | -25 % |
| ON-WARM CPU(s) | 28.62 | 19.25 | -33 % |
| ON-BURST CPU(s) | 22.03 | 19.56 | -11 % |
## Call distribution validation
Instrumented run on the same 429-storm workload (6 phases, ~71k total calls):
| Inner exception type | % of FromException calls | Patch behavior |
|---|---:|---|
| ExceptionWithStackTraceException (rewrap) | 60.9 % | branch 1 - reuse existing stack |
| CosmosException | 39.1 % | branch 2 - cheap `new StackTrace(exception)` |
| DocumentClientException | 0.0 % | branch 2 - cheap |
| Other (e.g., MalformedContinuationToken) | 0.0 % | branch 3 - live walk once |
The 61% rewrap path was the dominant blind spot of an earlier typed-only
gating attempt; explicit handling of it recovers the full perf benefit
without losing the synthetic-error diagnostic.
## Test coverage
- Adds `TryCatchTests.TestFromException_PreservesInnerException` asserting
wrapper preservation, `InnerException` semantics, and `StackTrace`
fallback to the inner exception's throw site.
- Adds `TryCatchTests.TestFromException_ToStringIncludesInnerStack`
asserting `ex.ToString()` walks the inner exception's stack.
- Updates `CosmosClientTests.InvalidKey_ExceptionFullStacktrace` to assert
on `EnsureValidClientAsync` (a frame that survives in the inner
exception's stack) rather than on a frame that depended on the
now-eliminated synthetic live-thread capture at the wrap site.
Full unit suite: 2,556 passed, 11 skipped, 0 failed.
## Repro / benchmark tooling
Adds `Microsoft.Azure.Cosmos.Samples/Tools/FaultInjection429Bench` and
`FaultInjection429Dump` console programs that exercise the SDK via
`Cosmos.FaultInjection` and capture CPU/allocation/exception metrics
across multiple status codes. Both projects are emulator- or live-
account-driven, do not run in CI, and target net6.0.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ub.com/Azure/azure-cosmos-dotnet-v3 into users/kirankk/perf-trycatch-stacktrace
…ub.com/Azure/azure-cosmos-dotnet-v3 into users/kirankk/perf-trycatch-stacktrace
… field-internal simplification Drops GetCapturedStackTrace() accessor in favor of changing the existing 'stackTrace' field from private readonly to internal readonly - idiomatic per existing SDK patterns (DocumentClient.cosmosAuthorization, GatewayStoreModel.endpointManager, TokenCredentialCache.tokenCredential). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Query: Refactors TryCatch.FromException to defer StackTrace capture
TL;DR
Replace one line in
TryCatch<TResult>.FromException:new StackTrace(Exception)reads the inner exception's already-captured framedata — it does NOT perform a native walk of the live thread stack. The native
walk is the dominant CPU cost on the failure path under exception-storm
workloads.
Description
TryCatch<TResult>.FromExceptionis called by every non-success page-fetchresult on the query / read-feed / change-feed paths (see
CosmosQueryClientCore.cs:343,NetworkAttachedDocumentContainer.cs:252/:404, plus 60+ other call sites across the pipeline). On master itunconditionally allocates a new
StackTraceby walking the live threadstack and wraps it in an
ExceptionWithStackTraceException. Under a 429storm at customer-realistic concurrency, this is ~9,000 stack walks per
second and the single largest contributor to managed-heap pressure and Gen-0
GC churn on the failure path.
The synthetic stack is purely an internal allocation —
ExceptionToCosmosException.TryCreateFromExceptionunwraps the chain andQueryIterator.ReadNextAsyncrethrows a freshCosmosExceptionwhose.StackTraceis the runtime-captured throw site, not the syntheticcapture. So the consumer never observes the synthetic stack for any
data-plane query failure.
Benchmark (FaultInjection 429 storm, 128 workers, 20 s phases)
Workload:
SELECT c.n, c.pk FROM c ORDER BY c.ncross-partition queryagainst the Cosmos DB Emulator, customer-realistic client config
(
RequestTimeout=1s,MaxRetryWaitTimeOnRateLimitedRequests=2s,MaxRetryAttemptsOnRateLimitedRequests=1, eventual consistency,Direct mode). Three phases per OTel mode: warmup (no rule), 429-burst
(100% inject), recovery (rule disabled).
Recovery phase reaches WARMUP-equivalent baseline within run, confirming
no stuck post-burst loop.
Reproduction harness:
Microsoft.Azure.Cosmos.Samples/Tools/FaultInjection429Bench/(in this PR).
Customer-visible behavior across status codes (E2E validation)
Tool:
Microsoft.Azure.Cosmos.Samples/Tools/FaultInjection429Dump/runs onecross-partition query per fault type and dumps the caller-visible exception
CosmosDiagnostics. Comparison BEFORE (master) vs AFTER (patched):The only path where
ex.StackTracecontent changes is client-init failuresthat catch the
ExceptionWithStackTraceExceptionwrapper directly withoutgoing through
ExceptionToCosmosException.TryCreateFromException. Theexisting
CosmosClientTests.InvalidKey_ExceptionFullStacktraceexercisesthis. The test's assertion has been updated to check a frame that survives
in the inner exception's stack (
EnsureValidClientAsync). All otherassertions in the unit suite are unchanged.
Approach
TryCatch<TResult>.FromException(Exception)now passesnew StackTrace(exception)toExceptionWithStackTraceException.ExceptionWithStackTraceExceptionis unchanged — its constructor'snon-null
stackTraceguard is preserved, itsStackTracegetter stillreturns
this.stackTrace.ToString(), and the wrapper instance shape /Exception.InnerExceptionchain are unchanged.The wrapper's
StackTracecontent now reflects the inner exception'sthrow site (same frames as
exception.StackTrace) instead of thesynthetic live-thread capture at the wrap site. For the 99 % case
(CosmosException from the SDK pipeline), both give equivalent diagnostic
information.
Files changed
Microsoft.Azure.Cosmos/src/Query/Core/Monads/TryCatch{TResult}.csnew StackTrace(skipFrames:1)withnew StackTrace(exception)— one-line behavioral changeMicrosoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.Tests/Query/TryCatchTests.csInnerException/StackTrace/ToStringsemanticsMicrosoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.Tests/CosmosClientTests.csInvalidKey_ExceptionFullStacktraceassertion toex.ToString()+EnsureValidClientAsyncframeMicrosoft.Azure.Cosmos.Samples/Tools/FaultInjection429Bench/Microsoft.Azure.Cosmos.Samples/Tools/FaultInjection429Dump/ExceptionWithStackTraceException.csis unchanged from master.Test results
Query / TryCatch / Pagination / Monad / Feed / InvalidKeyfilter)Microsoft.Azure.Cosmos.TestsRisk
ExceptionWithStackTraceExceptionwrapper directly without going throughExceptionToCosmosException(e.g., bad-key init failures).ex.StackTraceshows the inner exception's throw site instead of the synthetic call-site
capture.
ex.ToString(),ex.InnerException, andex.Diagnosticsareunchanged. The one affected unit test in this repo has been updated.
Samples/Tools/and do not run in CI.Type of change
Closing issues
None.