[Bug] Race conditions when CCA is singleton cause cache misses #3285
Description
Which version of MSAL.NET are you using?
Microsoft.Identity.Client 4.42.0
Microsoft.Identity.Client.Extensions.Msal 2.16.5
Microsoft.Identity.Web.TokenCache 1.23.1
Overview
- Use a confidential client app in a web site scenario (S2S should repro this too). The app must be a singleton
- Configure some distributed cache (can be MemoryDistributedCache, but I recommend a file based DistributedCache)
- Make many requests in parallel
Actual: race conditions occur which make 2 accounts land in a single cache node. This causes cache misses later.
Possible cause: cache accessor objects are set per application, and not per request. Multiple requests share the same accessor.
Workaround: create the application object for each request.
Repro
We have a .NET Framework MVC app using hybrid flow with OWIN to authenticate users and then retrieve access tokens via authorization code/refresh token. The tokens are fetched for each of 3 downstream APIs with separate app registrations in Azure. The app used to use ADAL, and at that time it had a naive token cache that was serialized and written to the auth cookie, but after adding a third API endpoint the cookie would end up too large and cause a 400 error, so we moved to MSAL and a distributed token cache using Azure Cache for Redis with the default L1/L2 setup. The tokens are fetched with a ConfidentialClientApplication that is singleton-scoped. We have no session state provider; "state" is light, and values are stored as claims in the identity and persist in the auth cookie.
We've had persistent issues with exceptions related to refresh tokens not being found in the token cache. Because of this, the app is set up to ensure a token for each of these APIs is able to be acquired silently at the beginning of each (MVC) request, and if it is not (via caught MsalUiRequiredExceptions), the user is redirected through the MS auth endpoint to get a new auth code and attempt to fetch and re-cache new tokens that can then be acquired silently. We've also implemented an eager refresh to check the expiration of the token and force refresh to get a new one if it will expire in less than 30 minutes. Even with these measures in place, frequent MsalUiRequiredExceptions persist and, rarely, more pernicious issues, like auth loops and "cannot log you in" pages if the tokens are repeatedly unable to be acquired silently.
One trend that we've seen is that the issues seem to be exacerbated by successive cache hits in a short duration. A theory I've had is that somehow there's contention while writing to the cache if, say, 2 requests happen at the same time for the same user, the token cannot be acquired silently, there are 2 calls for a new auth code, and a new refresh token is acquired for both, but maybe the first is invalidated because the second was provided and the first ends up in the cache. I'm grasping at straws, as I'm not sure how the underlying cache sequence works, if after the token fetched in L1 is tried and fails if it looks in L2 or goes straight to throwing an MsalUiRequiredException, if when a new token is cached if it's written to L1 first and then pushed to Redis, or what.
Another thing we've discovered is that we're seeing users' tokens cached under a key that doesn't match the user's account ID. This again leads me to think that there's some kind of thread-unsafe behavior with the cache, rescoping the CCA in DI to per-request and transient did not help prevent this from happening.
Expected behavior
A valid refresh token is always available in the token cache.
An MsalUiRequiredException is almost never thrown after a user has signed into the app.
Actual behavior
Frequent MsalUiRequiredExceptions.
Rare authentication loops and "Could not sign you in" pages after tokens for downstream APIs are repeatedly unable to be acquired silently even after auth code redemption.