Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AsyncCache: Adds support for stack trace optimization during exceptions for AsyncCache and AsyncCacheNonblocking #5069

Merged
merged 10 commits into from
Mar 26, 2025

Conversation

ananth7592
Copy link
Contributor

@ananth7592 ananth7592 commented Mar 17, 2025

Pull Request Template

Description

This PR addresses the issue of shared exception objects in the AsyncCache class, which was causing significant performance degradation and memory bloat recently for open AI .

The fix involves creating shallow copies of known offenders and rethrowing the rest. This approach ensures that each AsyncCache.GetAsync call receives its own exception instance, preventing the dangerous growth of the exception object's StackTrace and reducing GC overhead.

All ICloneable implemenations of exceptions will benefit from this.

https://learn.microsoft.com/en-us/dotnet/api/system.icloneable?view=net-9.0

Key changes:

  • Added support to post-process the exception in above mentioned two cache classes that creates a shallow copy of specific exception types to prevent stack trace proliferation and rethrows them, while propagating all other exceptions unchanged.
  • Added support for any exception that is ICloneable supported , TimeoutException and TaskCanceledException in this drop
  • Improved exception handling to avoid shared exception objects across threads.

Feature is enabled by default. To turn it off, set EnableAsyncCacheExceptionNoSharing CosmosClientOptions to false

Proof of testing:

Repro Steps
To reproduce the issue, I followed these high-level steps:

  • Create a Cosmos client with the necessary settings to preempt gateway calls for address refresh to fail with a specific exception.

  • Create a TaskCompletionSource that waits until all the threads are created.

  • Initiate calls to createItem that will eventually block, with one of them erroring out with the pre-set exception.

  • Observe the exception storm as the same exception trace is copied over to all the threads.

After the fix

  • Tested with TimeoutException (2K Threads) and TaskCanceledException (10K threads) using TaskCompletionSource with up to 10,000 threads without issues on the devbox.

  • Implemented the fix in AsyncNonBlockingCache by creating shallow copies of known offenders and rethrowing the rest.

  • Re-ran the setup and observed a significant reduction in the number of stack trace frames.

TaskCancelledExcpetion with 10K threads ( Even the file size is giving away how bad the stack trace profileration problem was )

TaskCancExceptionWithFix.log

Filesize: 10 KB
Standard Output: 
With cache fix and 10K threads 4923736 bytes allocated in (Gen0:0 1:0 2:0) Frames=2

TaskCancExceptionWithoutFix.log

Filesize: 1,168KB
Standard Output: 
Without cache fix and 10K threads 5612888 bytes allocated in (Gen0:0 1:0 2:0) Frames=2

TImeoutException with 2K threads

TimeoutExceptionWithFix.log
TimeoutExceptionWithoutFix.log

Type of change

Please delete options that are not relevant.

  • [] Bug fix (non-breaking change which fixes an issue)
  • [] New feature (non-breaking change which adds functionality)
  • [] Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • [] This change requires a documentation update

Closing issues

To automatically close an issue: closes #IssueNumber

@ananth7592 ananth7592 force-pushed the ananth/openai-exception-handling-asynccache branch 2 times, most recently from f96dd1f to 1ef6ec3 Compare March 18, 2025 00:02
Copy link
Contributor

@NaluTripician NaluTripician left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any unit testing we can do to verify change?

@ananth7592
Copy link
Contributor Author

Is there any unit testing we can do to verify change?

This is more of a performance optimization and will be materialized as such in high scale scenarios. Hence, I performed a local repro with high concurrency and captured the performance improvements in the log attached.

Pilchie
Pilchie previously approved these changes Mar 18, 2025
Copy link
Member

@Pilchie Pilchie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me.

@ananth7592 ananth7592 force-pushed the ananth/openai-exception-handling-asynccache branch from 1ef6ec3 to 235c8f1 Compare March 18, 2025 21:35
@ananth7592 ananth7592 force-pushed the ananth/openai-exception-handling-asynccache branch 2 times, most recently from b927aaf to 6e9fdee Compare March 20, 2025 19:01
@ananth7592 ananth7592 changed the title Enhancement: Adds Async Cache and Async Cache Nonblocking exception handling changes Enhancement: Adds support for stack trace optimization during exceptions for Async Cache and Async Cache Nonblocking Mar 20, 2025
- Added support to post-process the exception in above mentioned two cache classes that
  creates a shallow copy of specific exception types to prevent stack trace proliferation and
  rethrows them, while propagating all other exceptions unchanged.
 - Added support for TaskCanceledException, TimeoutException and OperationCanceledException in this drop
@ananth7592 ananth7592 force-pushed the ananth/openai-exception-handling-asynccache branch from 867fbf3 to 182c3a0 Compare March 20, 2025 20:29
kirankumarkolli added a commit that referenced this pull request Mar 20, 2025
…ing `CosmosClientOptions` (#5073)

# Pull Request Template

## Description

This PR enables `AsyncCache` and `AsyncCacheNonBlocking` Exception
Handling Using `CosmosClientOptions`. This new client option will
eventually used to get exception handling optimization (see PR #5069)
shipped behind this new client option:
`EnableAsyncCacheExceptionSharing`.

By default this option is set to `true`.

## Type of change

Please delete options that are not relevant.

- [x] New feature (non-breaking change which adds functionality)

## Closing issues

To automatically close an issue: closes #IssueNumber

---------

Co-authored-by: Kiran Kumar Kolli <[email protected]>
@ananth7592 ananth7592 requested a review from khdang as a code owner March 20, 2025 22:03
@ananth7592
Copy link
Contributor Author

@Pilchie , The change that resulted in adding support for ICloneable for CosmosException and CosmosOperationCanceledException was a contract change and hence had to generate new contract JSON

@ananth7592 ananth7592 force-pushed the ananth/openai-exception-handling-asynccache branch from b3df593 to 281d61e Compare March 24, 2025 17:42
@ananth7592
Copy link
Contributor Author

Fixed the line endings on Contract files from CRLF to LF thereby correcting the number of changes seen in the contract file

Pilchie
Pilchie previously approved these changes Mar 24, 2025
@ananth7592 ananth7592 force-pushed the ananth/openai-exception-handling-asynccache branch from 1311468 to 10fb8b3 Compare March 24, 2025 18:54
@ananth7592 ananth7592 force-pushed the ananth/openai-exception-handling-asynccache branch from 10fb8b3 to 8ab0077 Compare March 24, 2025 19:00
@kirankumarkolli kirankumarkolli changed the title Enhancement: Adds support for stack trace optimization during exceptions for Async Cache and Async Cache Nonblocking AsyncCache: Adds support for stack trace optimization during exceptions for AsyncCache and AsyncCacheNonblocking Mar 24, 2025
@ananth7592 ananth7592 force-pushed the ananth/openai-exception-handling-asynccache branch from bd93ecf to dfc7b78 Compare March 25, 2025 22:20
Copy link
Member

@kundadebdatta kundadebdatta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@kirankumarkolli kirankumarkolli merged commit 175443c into master Mar 26, 2025
26 checks passed
@kirankumarkolli kirankumarkolli deleted the ananth/openai-exception-handling-asynccache branch March 26, 2025 04:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants