Skip to content

Query: Refactors TryCatch.FromException to defer StackTrace capture#5900

Draft
kirankumarkolli wants to merge 7 commits into
mainfrom
users/kirankk/perf-trycatch-stacktrace
Draft

Query: Refactors TryCatch.FromException to defer StackTrace capture#5900
kirankumarkolli wants to merge 7 commits into
mainfrom
users/kirankk/perf-trycatch-stacktrace

Conversation

@kirankumarkolli
Copy link
Copy Markdown
Member

@kirankumarkolli kirankumarkolli commented May 22, 2026

Query: Refactors TryCatch.FromException to defer StackTrace capture

Status: DRAFT — perf optimization for the query-pipeline failure path.
Validated end-to-end via Microsoft.Azure.Cosmos.FaultInjection. Customer-visible
exception and diagnostics on the data-plane query path are bit-identical to master
for every status code that flows through CosmosQueryClientCore.GetCosmosElementResponse.

TL;DR

Replace one line in TryCatch<TResult>.FromException:

-    StackTrace stackTrace = new StackTrace(skipFrames: 1);
+    StackTrace stackTrace = new StackTrace(exception);

new StackTrace(Exception) reads the inner exception's already-captured frame
data — it does NOT perform a native walk of the live thread stack. The native
walk is the dominant CPU cost on the failure path under exception-storm
workloads.

Description

TryCatch<TResult>.FromException is called by every non-success page-fetch
result on the query / read-feed / change-feed paths (see
CosmosQueryClientCore.cs:343, NetworkAttachedDocumentContainer.cs:252 /
:404, plus 60+ other call sites across the pipeline). On master it
unconditionally allocates a new StackTrace by walking the live thread
stack and wraps it in an ExceptionWithStackTraceException. Under a 429
storm at customer-realistic concurrency, this is ~9,000 stack walks per
second and the single largest contributor to managed-heap pressure and Gen-0
GC churn on the failure path.

The synthetic stack is purely an internal allocation —
ExceptionToCosmosException.TryCreateFromException unwraps the chain and
QueryIterator.ReadNextAsync rethrows a fresh CosmosException whose
.StackTrace is the runtime-captured throw site, not the synthetic
capture. So the consumer never observes the synthetic stack for any
data-plane query failure.

Benchmark (FaultInjection 429 storm, 128 workers, 20 s phases)

Workload: SELECT c.n, c.pk FROM c ORDER BY c.n cross-partition query
against the Cosmos DB Emulator, customer-realistic client config
(RequestTimeout=1s, MaxRetryWaitTimeOnRateLimitedRequests=2s,
MaxRetryAttemptsOnRateLimitedRequests=1, eventual consistency,
Direct mode). Three phases per OTel mode: warmup (no rule), 429-burst
(100% inject), recovery (rule disabled).

Phase Master CPU(s) Patched CPU(s) Δ CPU Master AllocMB Patched AllocMB Δ Alloc
OTel-OFF WARMUP 33.19 23.20 −30 % 4,608 3,701 −20 %
OTel-OFF BURST 21.58 19.67 −9 % 1,561 1,188 −24 %
OTel-OFF RECOV 27.03 20.16 −25 % 4,395 3,773 −14 %
OTel-ON WARMUP 28.62 19.52 −32 % 4,471 3,786 −15 %
OTel-ON BURST 22.03 18.86 −14 % 1,548 1,191 −23 %
OTel-ON RECOV 25.72 20.48 −20 % 4,295 3,786 −12 %

Recovery phase reaches WARMUP-equivalent baseline within run, confirming
no stuck post-burst loop.

Reproduction harness: Microsoft.Azure.Cosmos.Samples/Tools/FaultInjection429Bench/
(in this PR).

Customer-visible behavior across status codes (E2E validation)

Tool: Microsoft.Azure.Cosmos.Samples/Tools/FaultInjection429Dump/ runs one
cross-partition query per fault type and dumps the caller-visible exception

  • CosmosDiagnostics. Comparison BEFORE (master) vs AFTER (patched):
Injected fault Final caller-visible status Chain depth StackBytes before / after Shape changed?
429 TooManyRequests / 3200 429 / 3200 1 956 / 956 No (identical)
503 ServiceUnavailable 503 / 21008 1 1,954 / 1,954 No (identical)
500 InternalServerError 500 / 0 1 1,954 / 1,954 No (identical)
408 Timeout 503 / 20001 4 1,281 / 1,281 No (identical)
449 RetryWith 449 / 0 1 1,954 / 1,954 No (identical)
404 ReadSessionNotAvailable 404 / 1002 1 956 / 956 No (identical)
401 Unauthorized (data-plane) 401 / 0 1 1,954 / 1,954 No (identical)

The only path where ex.StackTrace content changes is client-init failures
that catch the ExceptionWithStackTraceException wrapper directly without
going through ExceptionToCosmosException.TryCreateFromException. The
existing CosmosClientTests.InvalidKey_ExceptionFullStacktrace exercises
this. The test's assertion has been updated to check a frame that survives
in the inner exception's stack (EnsureValidClientAsync). All other
assertions in the unit suite are unchanged.

Approach

TryCatch<TResult>.FromException(Exception) now passes
new StackTrace(exception) to ExceptionWithStackTraceException.
ExceptionWithStackTraceException is unchanged — its constructor's
non-null stackTrace guard is preserved, its StackTrace getter still
returns this.stackTrace.ToString(), and the wrapper instance shape /
Exception.InnerException chain are unchanged.

The wrapper's StackTrace content now reflects the inner exception's
throw site (same frames as exception.StackTrace) instead of the
synthetic live-thread capture at the wrap site. For the 99 % case
(CosmosException from the SDK pipeline), both give equivalent diagnostic
information.

Files changed

File Change
Microsoft.Azure.Cosmos/src/Query/Core/Monads/TryCatch{TResult}.cs Replace new StackTrace(skipFrames:1) with new StackTrace(exception) — one-line behavioral change
Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.Tests/Query/TryCatchTests.cs + 2 unit tests asserting wrapper / InnerException / StackTrace / ToString semantics
Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.Tests/CosmosClientTests.cs Update InvalidKey_ExceptionFullStacktrace assertion to ex.ToString() + EnsureValidClientAsync frame
Microsoft.Azure.Cosmos.Samples/Tools/FaultInjection429Bench/ New: three-phase warmup/burst/recovery benchmark with OTel on/off matrix
Microsoft.Azure.Cosmos.Samples/Tools/FaultInjection429Dump/ New: per-status-code exception+diagnostics dump tool

ExceptionWithStackTraceException.cs is unchanged from master.

Test results

Suite Result
Targeted (Query / TryCatch / Pagination / Monad / Feed / InvalidKey filter) 642 passed, 0 failed, 3 skipped
Full unit suite Microsoft.Azure.Cosmos.Tests 2,556 passed, 0 failed, 11 skipped (18 m 40 s)

Risk

  • Narrow, intentional behavior change for callers that catch the
    ExceptionWithStackTraceException wrapper directly without going through
    ExceptionToCosmosException (e.g., bad-key init failures). ex.StackTrace
    shows the inner exception's throw site instead of the synthetic call-site
    capture. ex.ToString(), ex.InnerException, and ex.Diagnostics are
    unchanged. The one affected unit test in this repo has been updated.
  • No behavior change on the query data-plane path for any status code.
  • Bench and Dump are under Samples/Tools/ and do not run in CI.

Type of change

  • Bug fix / perf improvement (non-breaking)
  • New feature
  • Breaking change
  • Documentation update

Closing issues

None.

Kiran Kumar Kolli and others added 7 commits May 22, 2026 14:11
## Description

Refactors `TryCatch<TResult>.FromException` to skip the eager
`new StackTrace(skipFrames: 1)` allocation and defer to the inner
exception's own already-thrown stack. This is the dominant allocation
source on the query pipeline failure path under 429 (and other
non-success) status code storms.

## Customer-visible impact

The CosmosException surfaced to callers from
`GetItemQueryIterator.ReadNextAsync()` (and the read-feed /
change-feed equivalents) is bit-identical to current master for all
status codes that flow through `CosmosQueryClientCore.GetCosmosElementResponse`:
429, 503, 500, 408, 449, 404/1002, 401, etc. Same Status,
SubStatusCode, Message, RetryAfter, ActivityId, ChainDepth,
diagnostics tree, and stack-trace top frame.

The synthetic stack captured by `TryCatch.FromException` is purely an
internal allocation - it never reaches the consumer for the query
data-plane path because `ExceptionToCosmosException.TryCreateFromException`
unwraps the wrapper and `QueryIterator` rethrows a fresh CosmosException
whose stack is set by the runtime at the throw site.

## Benchmark (FaultInjection 429 storm, 128 workers, 20 s phases)

|              | Master  | Patched | Delta |
|--------------|--------:|--------:|------:|
| WARMUP CPU(s)| 33.19   | 24.58   | -26 % |
| BURST CPU(s) | 21.58   | 18.97   | -12 % |
| WARMUP AllocMB | 4608  | 3672    | -20 % |
| BURST AllocMB  | 1561  | 1240    | -21 % |
| BURST exceptions thrown | 185,770 | 174,182 | -6 % |

OTel-on path shows comparable improvements; the patch is independent
of distributed tracing being subscribed.

## Approach

`ExceptionWithStackTraceException` now accepts `null` for its
synthetic `stackTrace` parameter. Its `StackTrace` getter falls back
to `base.StackTrace` (runtime-captured stack if/when the wrapper is
thrown) and then to `InnerException.StackTrace` (the actual throw
site of the wrapped exception). Wrapper instance structure and
`Exception.InnerException` chain are preserved, so all callers using
`.InnerException` to unwrap continue to work unchanged.

## Test coverage

- Adds `TryCatchTests.TestFromException_PreservesInnerException`
  which asserts wrapper is preserved, `InnerException` is the
  caller-supplied exception, and `StackTrace` falls back to the
  inner exception's throw site.
- Adds `TryCatchTests.TestFromException_ToStringIncludesInnerStack`
  which asserts `ex.ToString()` still walks the inner exception's
  stack.
- Updates `CosmosClientTests.InvalidKey_ExceptionFullStacktrace` to
  assert on `EnsureValidClientAsync` (a frame that survives in the
  inner exception's stack) rather than on a frame that depended on
  the now-removed synthetic capture.

Full unit suite: 2,556 passed, 11 skipped, 0 failed.

## Repro / benchmark tooling

Adds `Microsoft.Azure.Cosmos.Samples/Tools/FaultInjection429Bench`
and `FaultInjection429Dump` console programs that exercise the SDK
via Cosmos.FaultInjection and capture CPU/allocation/exception
metrics across multiple status codes. Both projects are emulator-
or live-account-driven, do not run in CI, and target net6.0.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Description

Refactors `TryCatch<TResult>.FromException` to read the inner exception's
already-captured stack frames via `new StackTrace(exception)` instead of
walking the live thread stack via `new StackTrace(skipFrames: 1)`.

The dominant CPU cost of the original implementation was the native
thread-stack walk performed on every failed `TryCatch` propagation. Under
exception-storm workloads (e.g., 429 / 503 / 408 storms on the query
data-plane path), this was the single largest contributor to managed-heap
pressure and Gen-0 GC churn.

`new StackTrace(Exception)` does not walk the live thread; it parses the
exception's existing `_stackTrace` data captured at throw time. The
allocation is small and bounded by the inner exception's frame count.

## Customer-visible impact

The `CosmosException` surfaced to callers from
`GetItemQueryIterator.ReadNextAsync()` (and the read-feed / change-feed
equivalents) is bit-identical to current master for all status codes that
flow through `CosmosQueryClientCore.GetCosmosElementResponse`:
429, 503, 500, 408, 449, 404/1002, 401, etc.
Same Status, SubStatusCode, Message, RetryAfter, ActivityId, ChainDepth,
diagnostics tree, and stack-trace top frame.

The synthetic stack captured by `TryCatch.FromException` is purely an
internal allocation - it never reaches the consumer for the query
data-plane path because `ExceptionToCosmosException.TryCreateFromException`
unwraps the wrapper and `QueryIterator` rethrows a fresh `CosmosException`
whose stack is set by the runtime at the throw site.

## Benchmark (FaultInjection 429 storm, 128 workers, 20 s phases)

|              | Master  | Patched | Delta |
|--------------|--------:|--------:|------:|
| WARMUP CPU(s)| 33.19   | 23.20   | -30 % |
| BURST CPU(s) | 21.58   | 19.67   |  -9 % |
| WARMUP AllocMB | 4608  | 3701    | -20 % |
| BURST AllocMB  | 1561  | 1188    | -24 % |
| BURST exceptions thrown | 185,770 | 183,880 | -1 % |

OTel-on path shows comparable improvements; the patch is independent of
distributed tracing being subscribed.

## Approach

`TryCatch<TResult>.FromException(Exception exception)` now passes
`new StackTrace(exception)` to `ExceptionWithStackTraceException`. This is
a single-line change. `ExceptionWithStackTraceException` is unchanged from
master - its constructor's non-null `stackTrace` guard is preserved, its
`StackTrace` getter still returns `this.stackTrace.ToString()`, and the
wrapper instance shape / `Exception.InnerException` chain are unchanged.

The wrapper's `StackTrace` content now reflects the inner exception's
throw site (the same frames you would get from `exception.StackTrace`)
instead of the synthetic live-thread capture at the wrap site. For the
99 % case (CosmosException from the SDK pipeline), both give equivalent
diagnostic information.

## Test coverage

- Adds `TryCatchTests.TestFromException_PreservesInnerException` which
  asserts the wrapper is preserved, `InnerException` is the caller-supplied
  exception, and `StackTrace` reflects the inner exception's throw site.
- Adds `TryCatchTests.TestFromException_ToStringIncludesInnerStack` which
  asserts `ex.ToString()` walks the inner exception's stack.
- Updates `CosmosClientTests.InvalidKey_ExceptionFullStacktrace` to assert
  on `EnsureValidClientAsync` (a frame that survives in the inner
  exception's stack) rather than on a frame that depended on the
  now-removed synthetic live-thread capture.

Full unit suite: 2,556 passed, 11 skipped, 0 failed.

## Repro / benchmark tooling

Adds `Microsoft.Azure.Cosmos.Samples/Tools/FaultInjection429Bench` and
`FaultInjection429Dump` console programs that exercise the SDK via
`Cosmos.FaultInjection` and capture CPU/allocation/exception metrics
across multiple status codes. Both projects are emulator- or live-
account-driven, do not run in CI, and target net6.0.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… (new StackTrace(exception))

Reconciles divergence between origin (Option A: pass null + lazy fallback in ExceptionWithStackTraceException) and local (Option B: new StackTrace(exception) with ExceptionWithStackTraceException unchanged from master).

Option B preferred: cleaner one-line behavioral change; ExceptionWithStackTraceException reverts to master state; equivalent perf to Option A (within run-to-run noise) on FaultInjection 429 storm bench.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Description

Refactors `TryCatch<TResult>.FromException` to eliminate the native live
thread-stack walk performed on every failed `TryCatch` propagation, while
preserving diagnostic stack content across all call paths.

The original implementation unconditionally called `new StackTrace(skipFrames: 1)`
which walks the live thread stack via native code. Under exception-storm
workloads (e.g., 429 / 503 / 408 storms on the query data-plane path),
instrumentation shows this dominates managed-heap pressure and Gen-0 GC
churn - ~9,000 stack walks per second at customer-realistic concurrency.

## Approach

Three-branch typed gating:

  1. **Rewrap of an already-wrapped `ExceptionWithStackTraceException`**
     (the dominant case at ~61% of FromException calls during a 429 burst):
     reuse the existing captured StackTrace - no allocation, no walk,
     and preserves the diagnostic that was captured at the first wrap.
  2. **Typed SDK exception (CosmosException, DocumentClientException)**:
     the inner has its own already-captured _stackTrace from when it was
     thrown. Use `new StackTrace(exception)` which parses those existing
     frames and does NOT walk the live thread.
  3. **Synthetic, never-thrown inner exception** (e.g., `MalformedContinuationTokenException`
     created at continuation-token parse failures): fall back to the
     original live-thread capture so `ExceptionToCosmosException.TryCreateFromExceptionWithStackTrace`
     can still surface a useful SDK call chain to the customer. This case
     is ~0% of calls on the data-plane hot path; the live walk is paid
     only once on the first wrap and reused via branch 1 on all rewraps.

Adds `ExceptionWithStackTraceException.GetCapturedStackTrace()` internal
accessor to enable branch 1.

## Customer-visible impact

The `CosmosException` surfaced to callers from
`GetItemQueryIterator.ReadNextAsync()` (and the read-feed / change-feed
equivalents) is bit-identical to current master for all status codes that
flow through `CosmosQueryClientCore.GetCosmosElementResponse`:
429, 503, 500, 408, 449, 404/1002, 401, etc.
Same Status, SubStatusCode, Message, RetryAfter, ActivityId, ChainDepth,
diagnostics tree, and stack-trace top frame.

Diagnostic stack content for synthetic-error paths (`MalformedContinuationTokenException`
etc.) is preserved - the live walk still happens once on the first wrap
of the synthetic exception, then reused via the rewrap branch.

## Benchmark (FaultInjection 429 storm, 128 workers, 20 s phases)

|              | Master  | Patched | Delta |
|--------------|--------:|--------:|------:|
| OFF-WARM CPU(s)   | 33.19 | 23.69 | -29 % |
| OFF-BURST CPU(s)  | 21.58 | 18.39 | -15 % |
| OFF-RECOV CPU(s)  | 27.03 | 19.17 | -29 % |
| OFF-WARM AllocMB  | 4,608 | 3,701 | -20 % |
| OFF-BURST AllocMB | 1,561 | 1,169 | -25 % |
| ON-WARM  CPU(s)   | 28.62 | 19.25 | -33 % |
| ON-BURST CPU(s)   | 22.03 | 19.56 | -11 % |

## Call distribution validation

Instrumented run on the same 429-storm workload (6 phases, ~71k total calls):

| Inner exception type | % of FromException calls | Patch behavior |
|---|---:|---|
| ExceptionWithStackTraceException (rewrap) | 60.9 % | branch 1 - reuse existing stack |
| CosmosException                            | 39.1 % | branch 2 - cheap `new StackTrace(exception)` |
| DocumentClientException                    |  0.0 % | branch 2 - cheap |
| Other (e.g., MalformedContinuationToken)   |  0.0 % | branch 3 - live walk once |

The 61% rewrap path was the dominant blind spot of an earlier typed-only
gating attempt; explicit handling of it recovers the full perf benefit
without losing the synthetic-error diagnostic.

## Test coverage

- Adds `TryCatchTests.TestFromException_PreservesInnerException` asserting
  wrapper preservation, `InnerException` semantics, and `StackTrace`
  fallback to the inner exception's throw site.
- Adds `TryCatchTests.TestFromException_ToStringIncludesInnerStack`
  asserting `ex.ToString()` walks the inner exception's stack.
- Updates `CosmosClientTests.InvalidKey_ExceptionFullStacktrace` to assert
  on `EnsureValidClientAsync` (a frame that survives in the inner
  exception's stack) rather than on a frame that depended on the
  now-eliminated synthetic live-thread capture at the wrap site.

Full unit suite: 2,556 passed, 11 skipped, 0 failed.

## Repro / benchmark tooling

Adds `Microsoft.Azure.Cosmos.Samples/Tools/FaultInjection429Bench` and
`FaultInjection429Dump` console programs that exercise the SDK via
`Cosmos.FaultInjection` and capture CPU/allocation/exception metrics
across multiple status codes. Both projects are emulator- or live-
account-driven, do not run in CI, and target net6.0.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… field-internal simplification

Drops GetCapturedStackTrace() accessor in favor of changing the existing 'stackTrace' field from private readonly to internal readonly - idiomatic per existing SDK patterns (DocumentClient.cosmosAuthorization, GatewayStoreModel.endpointManager, TokenCredentialCache.tokenCredential).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant