-
Notifications
You must be signed in to change notification settings - Fork 507
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AsyncCache: Adds support for stack trace optimization during exceptions for AsyncCache and AsyncCacheNonblocking #5069
AsyncCache: Adds support for stack trace optimization during exceptions for AsyncCache and AsyncCacheNonblocking #5069
Conversation
f96dd1f
to
1ef6ec3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any unit testing we can do to verify change?
This is more of a performance optimization and will be materialized as such in high scale scenarios. Hence, I performed a local repro with high concurrency and captured the performance improvements in the log attached. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine to me.
1ef6ec3
to
235c8f1
Compare
b927aaf
to
6e9fdee
Compare
- Added support to post-process the exception in above mentioned two cache classes that creates a shallow copy of specific exception types to prevent stack trace proliferation and rethrows them, while propagating all other exceptions unchanged. - Added support for TaskCanceledException, TimeoutException and OperationCanceledException in this drop
867fbf3
to
182c3a0
Compare
…ing `CosmosClientOptions` (#5073) # Pull Request Template ## Description This PR enables `AsyncCache` and `AsyncCacheNonBlocking` Exception Handling Using `CosmosClientOptions`. This new client option will eventually used to get exception handling optimization (see PR #5069) shipped behind this new client option: `EnableAsyncCacheExceptionSharing`. By default this option is set to `true`. ## Type of change Please delete options that are not relevant. - [x] New feature (non-breaking change which adds functionality) ## Closing issues To automatically close an issue: closes #IssueNumber --------- Co-authored-by: Kiran Kumar Kolli <[email protected]>
Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.EmulatorTests/ClientTests.cs
Outdated
Show resolved
Hide resolved
@Pilchie , The change that resulted in adding support for |
b3df593
to
281d61e
Compare
Fixed the line endings on Contract files from CRLF to LF thereby correcting the number of changes seen in the contract file |
Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.EmulatorTests/ClientTests.cs
Outdated
Show resolved
Hide resolved
Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.Tests/ExceptionHandlingUtilityTests.cs
Outdated
Show resolved
Hide resolved
Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.Tests/Routing/AsyncCacheNonBlockingTests.cs
Show resolved
Hide resolved
1311468
to
10fb8b3
Compare
…he caller throw the exception
10fb8b3
to
8ab0077
Compare
Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.EmulatorTests/ClientTests.cs
Outdated
Show resolved
Hide resolved
bd93ecf
to
dfc7b78
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Pull Request Template
Description
This PR addresses the issue of shared exception objects in the AsyncCache class, which was causing significant performance degradation and memory bloat recently for open AI .
The fix involves creating shallow copies of known offenders and rethrowing the rest. This approach ensures that each AsyncCache.GetAsync call receives its own exception instance, preventing the dangerous growth of the exception object's StackTrace and reducing GC overhead.
All ICloneable implemenations of exceptions will benefit from this.
https://learn.microsoft.com/en-us/dotnet/api/system.icloneable?view=net-9.0
Key changes:
Feature is enabled by default. To turn it off, set EnableAsyncCacheExceptionNoSharing CosmosClientOptions to false
Proof of testing:
Repro Steps
To reproduce the issue, I followed these high-level steps:
Create a Cosmos client with the necessary settings to preempt gateway calls for address refresh to fail with a specific exception.
Create a TaskCompletionSource that waits until all the threads are created.
Initiate calls to createItem that will eventually block, with one of them erroring out with the pre-set exception.
Observe the exception storm as the same exception trace is copied over to all the threads.
After the fix
Tested with TimeoutException (2K Threads) and TaskCanceledException (10K threads) using TaskCompletionSource with up to 10,000 threads without issues on the devbox.
Implemented the fix in AsyncNonBlockingCache by creating shallow copies of known offenders and rethrowing the rest.
Re-ran the setup and observed a significant reduction in the number of stack trace frames.
TaskCancelledExcpetion with 10K threads ( Even the file size is giving away how bad the stack trace profileration problem was )
TaskCancExceptionWithFix.log
Filesize: 10 KB
Standard Output:
With cache fix and 10K threads 4923736 bytes allocated in (Gen0:0 1:0 2:0) Frames=2
TaskCancExceptionWithoutFix.log
Filesize: 1,168KB
Standard Output:
Without cache fix and 10K threads 5612888 bytes allocated in (Gen0:0 1:0 2:0) Frames=2
TImeoutException with 2K threads
TimeoutExceptionWithFix.log
TimeoutExceptionWithoutFix.log
Type of change
Please delete options that are not relevant.
Closing issues
To automatically close an issue: closes #IssueNumber