AsyncCache: Adds support for stack trace optimization during exceptions for AsyncCache and AsyncCacheNonblocking #5069

ananth7592 · 2025-03-17T23:58:55Z

Pull Request Template

Description

This PR addresses the issue of shared exception objects in the AsyncCache class, which was causing significant performance degradation and memory bloat recently for open AI .

The fix involves creating shallow copies of known offenders and rethrowing the rest. This approach ensures that each AsyncCache.GetAsync call receives its own exception instance, preventing the dangerous growth of the exception object's StackTrace and reducing GC overhead.

All ICloneable implemenations of exceptions will benefit from this.

https://learn.microsoft.com/en-us/dotnet/api/system.icloneable?view=net-9.0

Key changes:

Added support to post-process the exception in above mentioned two cache classes that creates a shallow copy of specific exception types to prevent stack trace proliferation and rethrows them, while propagating all other exceptions unchanged.
Added support for any exception that is ICloneable supported , TimeoutException and TaskCanceledException in this drop
Improved exception handling to avoid shared exception objects across threads.

Feature is enabled by default. To turn it off, set EnableAsyncCacheExceptionNoSharing CosmosClientOptions to false

Proof of testing:

Repro Steps
To reproduce the issue, I followed these high-level steps:

Create a Cosmos client with the necessary settings to preempt gateway calls for address refresh to fail with a specific exception.
Create a TaskCompletionSource that waits until all the threads are created.
Initiate calls to createItem that will eventually block, with one of them erroring out with the pre-set exception.
Observe the exception storm as the same exception trace is copied over to all the threads.

After the fix

Tested with TimeoutException (2K Threads) and TaskCanceledException (10K threads) using TaskCompletionSource with up to 10,000 threads without issues on the devbox.
Implemented the fix in AsyncNonBlockingCache by creating shallow copies of known offenders and rethrowing the rest.
Re-ran the setup and observed a significant reduction in the number of stack trace frames.

TaskCancelledExcpetion with 10K threads ( Even the file size is giving away how bad the stack trace profileration problem was )

TaskCancExceptionWithFix.log

Filesize: 10 KB
Standard Output:
With cache fix and 10K threads 4923736 bytes allocated in (Gen0:0 1:0 2:0) Frames=2

TaskCancExceptionWithoutFix.log

Filesize: 1,168KB
Standard Output:
Without cache fix and 10K threads 5612888 bytes allocated in (Gen0:0 1:0 2:0) Frames=2

TImeoutException with 2K threads

TimeoutExceptionWithFix.log
TimeoutExceptionWithoutFix.log

Type of change

Please delete options that are not relevant.

[] Bug fix (non-breaking change which fixes an issue)
[] New feature (non-breaking change which adds functionality)
[] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[] This change requires a documentation update

Closing issues

To automatically close an issue: closes #IssueNumber

NaluTripician

Is there any unit testing we can do to verify change?

ananth7592 · 2025-03-18T01:11:10Z

Is there any unit testing we can do to verify change?

This is more of a performance optimization and will be materialized as such in high scale scenarios. Hence, I performed a local repro with high concurrency and captured the performance improvements in the log attached.

Pilchie

Looks fine to me.

Microsoft.Azure.Cosmos/src/ExceptionHandlingUtility.cs

Microsoft.Azure.Cosmos/src/Routing/AsyncCache.cs

Microsoft.Azure.Cosmos/src/ExceptionHandlingUtility.cs

- Added support to post-process the exception in above mentioned two cache classes that creates a shallow copy of specific exception types to prevent stack trace proliferation and rethrows them, while propagating all other exceptions unchanged. - Added support for TaskCanceledException, TimeoutException and OperationCanceledException in this drop

Microsoft.Azure.Cosmos/src/Routing/AsyncCache.cs

Microsoft.Azure.Cosmos/src/Routing/AsyncCacheNonBlocking.cs

…ing `CosmosClientOptions` (#5073) # Pull Request Template ## Description This PR enables `AsyncCache` and `AsyncCacheNonBlocking` Exception Handling Using `CosmosClientOptions`. This new client option will eventually used to get exception handling optimization (see PR #5069) shipped behind this new client option: `EnableAsyncCacheExceptionSharing`. By default this option is set to `true`. ## Type of change Please delete options that are not relevant. - [x] New feature (non-breaking change which adds functionality) ## Closing issues To automatically close an issue: closes #IssueNumber --------- Co-authored-by: Kiran Kumar Kolli <[email protected]>

…rough client options.

Microsoft.Azure.Cosmos/src/ExceptionHandlingUtility.cs

Microsoft.Azure.Cosmos/src/Routing/AsyncCache.cs

Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.EmulatorTests/ClientTests.cs

ananth7592 · 2025-03-24T17:05:45Z

@Pilchie , The change that resulted in adding support for ICloneable for CosmosException and CosmosOperationCanceledException was a contract change and hence had to generate new contract JSON

ananth7592 · 2025-03-24T17:48:47Z

Fixed the line endings on Contract files from CRLF to LF thereby correcting the number of changes seen in the contract file

Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.EmulatorTests/ClientTests.cs

Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.Tests/ExceptionHandlingUtilityTests.cs

Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.Tests/Routing/AsyncCacheNonBlockingTests.cs

…he caller throw the exception

Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.EmulatorTests/ClientTests.cs

Microsoft.Azure.Cosmos/src/ExceptionHandlingUtility.cs

…ionHandlingUtility

kundadebdatta

LGTM.

ananth7592 requested review from kirankumarkolli, FabianMeiswinkel, sboshra and Pilchie as code owners March 17, 2025 23:58

ananth7592 force-pushed the ananth/openai-exception-handling-asynccache branch 2 times, most recently from f96dd1f to 1ef6ec3 Compare March 18, 2025 00:02

NaluTripician reviewed Mar 18, 2025

View reviewed changes

Pilchie previously approved these changes Mar 18, 2025

View reviewed changes

Microsoft.Azure.Cosmos/src/ExceptionHandlingUtility.cs Outdated Show resolved Hide resolved

Microsoft.Azure.Cosmos/src/ExceptionHandlingUtility.cs Outdated Show resolved Hide resolved

kirankumarkolli reviewed Mar 18, 2025

View reviewed changes

Microsoft.Azure.Cosmos/src/ExceptionHandlingUtility.cs Show resolved Hide resolved

kirankumarkolli reviewed Mar 18, 2025

View reviewed changes

Microsoft.Azure.Cosmos/src/ExceptionHandlingUtility.cs Outdated Show resolved Hide resolved

kirankumarkolli reviewed Mar 18, 2025

View reviewed changes

Microsoft.Azure.Cosmos/src/Routing/AsyncCache.cs Outdated Show resolved Hide resolved

kundadebdatta reviewed Mar 18, 2025

View reviewed changes

Microsoft.Azure.Cosmos/src/ExceptionHandlingUtility.cs Outdated Show resolved Hide resolved

ananth7592 dismissed Pilchie’s stale review via 235c8f1 March 18, 2025 21:35

ananth7592 force-pushed the ananth/openai-exception-handling-asynccache branch from 1ef6ec3 to 235c8f1 Compare March 18, 2025 21:35

kundadebdatta mentioned this pull request Mar 20, 2025

[INTERNAL] AsyncCache: Adds Parameter to Enable Exception Handling Using CosmosClientOptions #5073

Merged

1 task

ananth7592 force-pushed the ananth/openai-exception-handling-asynccache branch 2 times, most recently from b927aaf to 6e9fdee Compare March 20, 2025 19:01

ananth7592 changed the title ~~Enhancement: Adds Async Cache and Async Cache Nonblocking exception handling changes~~ Enhancement: Adds support for stack trace optimization during exceptions for Async Cache and Async Cache Nonblocking Mar 20, 2025

kirankumarkolli reviewed Mar 20, 2025

View reviewed changes

Microsoft.Azure.Cosmos/src/ExceptionHandlingUtility.cs Outdated Show resolved Hide resolved

kirankumarkolli reviewed Mar 20, 2025

View reviewed changes

Microsoft.Azure.Cosmos/src/ExceptionHandlingUtility.cs Outdated Show resolved Hide resolved

kirankumarkolli reviewed Mar 20, 2025

View reviewed changes

Microsoft.Azure.Cosmos/src/ExceptionHandlingUtility.cs Outdated Show resolved Hide resolved

ananth7592 force-pushed the ananth/openai-exception-handling-asynccache branch from 867fbf3 to 182c3a0 Compare March 20, 2025 20:29

kirankumarkolli reviewed Mar 20, 2025

View reviewed changes

Microsoft.Azure.Cosmos/src/Routing/AsyncCache.cs Outdated Show resolved Hide resolved

kirankumarkolli reviewed Mar 20, 2025

View reviewed changes

Microsoft.Azure.Cosmos/src/Routing/AsyncCacheNonBlocking.cs Outdated Show resolved Hide resolved

kundadebdatta added 2 commits March 20, 2025 14:27

Merge branch 'master' into ananth/openai-exception-handling-asynccache

cd88058

Code changes to merge master and hook exception stack optimization th…

a66c65c

…rough client options.

ananth7592 requested a review from khdang as a code owner March 20, 2025 22:03

kirankumarkolli reviewed Mar 24, 2025

View reviewed changes

Microsoft.Azure.Cosmos/src/ExceptionHandlingUtility.cs Outdated Show resolved Hide resolved

kirankumarkolli reviewed Mar 24, 2025

View reviewed changes

Microsoft.Azure.Cosmos/src/Routing/AsyncCache.cs Show resolved Hide resolved

kirankumarkolli reviewed Mar 24, 2025

View reviewed changes

Microsoft.Azure.Cosmos/src/Routing/AsyncCache.cs Outdated Show resolved Hide resolved

kirankumarkolli reviewed Mar 24, 2025

View reviewed changes

Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.EmulatorTests/ClientTests.cs Outdated Show resolved Hide resolved

ananth7592 force-pushed the ananth/openai-exception-handling-asynccache branch from b3df593 to 281d61e Compare March 24, 2025 17:42

kirankumarkolli reviewed Mar 24, 2025

View reviewed changes

Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.EmulatorTests/ClientTests.cs Outdated Show resolved Hide resolved

kirankumarkolli reviewed Mar 24, 2025

View reviewed changes

Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.Tests/ExceptionHandlingUtilityTests.cs Outdated Show resolved Hide resolved

kirankumarkolli reviewed Mar 24, 2025

View reviewed changes

Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.Tests/Routing/AsyncCacheNonBlockingTests.cs Show resolved Hide resolved

Pilchie previously approved these changes Mar 24, 2025

View reviewed changes

ananth7592 dismissed Pilchie’s stale review via 10fb8b3 March 24, 2025 18:54

ananth7592 force-pushed the ananth/openai-exception-handling-asynccache branch from 1311468 to 10fb8b3 Compare March 24, 2025 18:54

Modified the ExceptionHandlingUtility to TryClone model and letting t…

8ab0077

…he caller throw the exception

ananth7592 force-pushed the ananth/openai-exception-handling-asynccache branch from 10fb8b3 to 8ab0077 Compare March 24, 2025 19:00

Merge branch 'master' into ananth/openai-exception-handling-asynccache

dd621d3

kirankumarkolli reviewed Mar 24, 2025

View reviewed changes

Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.EmulatorTests/ClientTests.cs Outdated Show resolved Hide resolved

kirankumarkolli changed the title ~~Enhancement: Adds support for stack trace optimization during exceptions for Async Cache and Async Cache Nonblocking~~ AsyncCache: Adds support for stack trace optimization during exceptions for AsyncCache and AsyncCacheNonblocking Mar 24, 2025

kirankumarkolli reviewed Mar 25, 2025

View reviewed changes

Microsoft.Azure.Cosmos/src/ExceptionHandlingUtility.cs Outdated Show resolved Hide resolved

kirankumarkolli and others added 2 commits March 25, 2025 14:40

Fixing the unit test

8384882

Added support for all OperationCanceledException base types to Except…

dfc7b78

…ionHandlingUtility

ananth7592 force-pushed the ananth/openai-exception-handling-asynccache branch from bd93ecf to dfc7b78 Compare March 25, 2025 22:20

Updating unit tests

bea502e

kirankumarkolli previously approved these changes Mar 26, 2025

View reviewed changes

Including two more types

644bc99

kirankumarkolli dismissed their stale review via 644bc99 March 26, 2025 00:34

kirankumarkolli approved these changes Mar 26, 2025

View reviewed changes

kundadebdatta approved these changes Mar 26, 2025

View reviewed changes

kirankumarkolli merged commit 175443c into master Mar 26, 2025
26 checks passed

kirankumarkolli deleted the ananth/openai-exception-handling-asynccache branch March 26, 2025 04:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AsyncCache: Adds support for stack trace optimization during exceptions for AsyncCache and AsyncCacheNonblocking #5069

AsyncCache: Adds support for stack trace optimization during exceptions for AsyncCache and AsyncCacheNonblocking #5069

ananth7592 commented Mar 17, 2025 •

edited

Loading

NaluTripician left a comment

ananth7592 commented Mar 18, 2025

Pilchie left a comment

ananth7592 commented Mar 24, 2025

ananth7592 commented Mar 24, 2025

kundadebdatta left a comment

AsyncCache: Adds support for stack trace optimization during exceptions for AsyncCache and AsyncCacheNonblocking #5069

AsyncCache: Adds support for stack trace optimization during exceptions for AsyncCache and AsyncCacheNonblocking #5069

Conversation

ananth7592 commented Mar 17, 2025 • edited Loading

Pull Request Template

Description

Type of change

Closing issues

NaluTripician left a comment

Choose a reason for hiding this comment

ananth7592 commented Mar 18, 2025

Pilchie left a comment

Choose a reason for hiding this comment

ananth7592 commented Mar 24, 2025

ananth7592 commented Mar 24, 2025

kundadebdatta left a comment

Choose a reason for hiding this comment

ananth7592 commented Mar 17, 2025 •

edited

Loading