Skip to content

Background token refresh ignores cancellation, causing threads to block for hours #6053

@jayesh-a-shah

Description

@jayesh-a-shah

Library version used

4.59.0 through 4.84.2 (latest)

.NET version

All (reproduced on net462 and net8.0)

Scenario

ManagedIdentityClient - managed identity (also affects ConfidentialClient, OBO, and Silent flows)

Is this a new or an existing app?

The behavior has been observed in a large-scale production service.

Issue description

PR #4471 (fixing #4473) changed the ProcessFetchInBackground lambda from:

() => GetAccessTokenAsync(cancellationToken, logger)

to:

() =>
{
    using var tokenSource = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken);
    return GetAccessTokenAsync(tokenSource.Token, logger);
}

This lambda is not async. using var disposes tokenSource when the lambda body returns at return, which is before GetAccessTokenAsync completes. After disposal, the linked token is disconnected from its parent — cancelling the parent CTS no longer cancels the linked token.

Inside GetAccessTokenAsync, the code calls:

await s_semaphoreSlim.WaitAsync(cancellationToken).ConfigureAwait(false);

where s_semaphoreSlim is a static SemaphoreSlim(1,1) — a process-wide single-concurrency lock. Because the linked token is disconnected, WaitAsync can only complete when the semaphore is released — it can never be cancelled.

How this causes thread starvation

Azure.Core's BearerTokenAuthenticationPolicy has a background token refresh mechanism. When a cached token approaches expiry, it spawns a background refresh with a 30-second timeout:

// BearerTokenAuthenticationPolicy.cs, line 37
private static readonly TimeSpan _tokenRefreshRetryDelay = TimeSpan.FromSeconds(30);

// line 371 — background refresh path
var cts = new CancellationTokenSource(_tokenRefreshRetryDelay);

This 30s CTS is passed through Azure.Identity → MSAL → ExecuteAsync(30sToken). When MSAL's NeedsRefresh() is true, it returns the cached token to the caller and spawns a Task.Run via ProcessFetchInBackground to refresh the token in the background. The lambda captures the 30s token.

Without the bug (before 4.59.0): The cancellation token was passed directly. If the background task couldn't acquire the semaphore within 30s, the parent CTS would fire, WaitAsync would throw OperationCanceledException, and the task would exit the semaphore queue. The queue stayed small.

With the bug (4.59.0+): The linked token is disconnected from the 30s parent CTS. The background task calls WaitAsync(disconnectedToken). The 30s timeout fires but has no effect — the task remains in the semaphore queue indefinitely, waiting for the semaphore to be released by whoever is ahead of it.

When the token endpoint (IMDS) is temporarily unreachable, the semaphore holder takes ~100s (retry loop) before failing and releasing. During the ~20-minute NeedsRefresh window, each incoming GetToken call spawns a new background task that joins the queue permanently. ~100+ tasks accumulate.

These tasks drain one-at-a-time: each acquires the semaphore, calls IMDS, fails after ~100s of retries, releases. Foreground threads that need a token enter the same queue behind all accumulated tasks. With ~139 tasks ahead, each taking ~100s, a foreground thread waits ~232 minutes.

Affected code locations (all introduced by PR #4471)

  1. ManagedIdentityAuthRequest.csExecuteAsync
  2. ClientCredentialRequest.csExecuteAsync
  3. OnBehalfOfRequest.csExecuteAsync
  4. CacheSilentStrategy.csExecuteAsync

Production observation

In a production service targeting net462, 4 foreground threads were blocked for 189–232 minutes behind ~139 accumulated background tasks when IMDS was temporarily unreachable on one node. All 4 threads completed within 1 second after IMDS recovered — confirming they were queued behind the convoy, not individually stuck.

Thread Entered Queue Completed Duration
7032 20:48:12 00:40:32 232.3 min
4388 20:50:06 00:40:33 230.4 min
8096 20:50:11 00:40:33 230.4 min
2032 21:30:41 00:40:33 189.9 min

Fix

Make the lambda async and await the inner call:

async () =>
{
    using var tokenSource = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken);
    return await GetAccessTokenAsync(tokenSource.Token, logger).ConfigureAwait(false);
}

In an async method, using var disposal happens after the await completes (the compiler transforms it into a state machine). The linked token stays connected to the parent CTS for the entire duration of the async operation. Background tasks exit the semaphore queue after 30s instead of accumulating permanently.

Standalone reproduction

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Threading;
using System.Threading.Tasks;

class Program
{
    static readonly SemaphoreSlim s_semaphoreSlim = new SemaphoreSlim(1, 1);

    static async Task<string> GetAccessTokenAsync(CancellationToken cancellationToken)
    {
        await s_semaphoreSlim.WaitAsync(cancellationToken).ConfigureAwait(false);
        try
        {
            await Task.Delay(2000, CancellationToken.None).ConfigureAwait(false);
            return "token-value";
        }
        finally
        {
            s_semaphoreSlim.Release();
        }
    }

    // BUG: using var disposes linked CTS before async op completes.
    // After disposal, the link to the parent CTS is broken — the linked
    // token can NEVER be cancelled — so WaitAsync waits indefinitely.
    static Task<string> BuggyFetchAction(CancellationToken cancellationToken)
    {
        using var tokenSource = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken);
        return GetAccessTokenAsync(tokenSource.Token);
    }

    // FIX: Making the method async means using var disposal happens AFTER
    // the await completes. The linked CTS stays alive, so parent cancellation
    // propagates correctly and WaitAsync can be cancelled.
    static async Task<string> FixedFetchAction(CancellationToken cancellationToken)
    {
        using var tokenSource = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken);
        return await GetAccessTokenAsync(tokenSource.Token).ConfigureAwait(false);
    }

    static async Task Main()
    {
        Console.WriteLine("=== BUGGY: tasks should cancel at 3s but DON'T ===");
        await s_semaphoreSlim.WaitAsync();
        var parentCts = new CancellationTokenSource(TimeSpan.FromSeconds(3));
        var tasks = new List<Task>();
        for (int i = 0; i < 5; i++)
        {
            int id = i;
            tasks.Add(Task.Run(async () =>
            {
                var sw = Stopwatch.StartNew();
                try { await BuggyFetchAction(parentCts.Token); Console.WriteLine($"  Task {id}: completed in {sw.ElapsedMilliseconds}ms (NOT cancelled!)"); }
                catch (OperationCanceledException) { Console.WriteLine($"  Task {id}: CANCELLED after {sw.ElapsedMilliseconds}ms"); }
            }));
        }
        await Task.Delay(5000);
        s_semaphoreSlim.Release();
        await Task.WhenAny(Task.WhenAll(tasks), Task.Delay(20000));

        Console.WriteLine("\n=== FIXED: tasks correctly cancel at 3s ===");
        await s_semaphoreSlim.WaitAsync();
        parentCts = new CancellationTokenSource(TimeSpan.FromSeconds(3));
        tasks.Clear();
        for (int i = 0; i < 5; i++)
        {
            int id = i;
            tasks.Add(Task.Run(async () =>
            {
                var sw = Stopwatch.StartNew();
                try { await FixedFetchAction(parentCts.Token); Console.WriteLine($"  Task {id}: completed in {sw.ElapsedMilliseconds}ms"); }
                catch (OperationCanceledException) { Console.WriteLine($"  Task {id}: CANCELLED after {sw.ElapsedMilliseconds}ms"); }
            }));
        }
        await Task.Delay(5000);
        s_semaphoreSlim.Release();
        await Task.WhenAny(Task.WhenAll(tasks), Task.Delay(20000));
    }
}

Output (both net462 and net8.0):

=== BUGGY: tasks should cancel at 3s but DON'T ===
  Task 0: completed in 7016ms (NOT cancelled!)
  Task 3: completed in 9017ms (NOT cancelled!)
  Task 1: completed in 11023ms (NOT cancelled!)
  Task 4: completed in 13031ms (NOT cancelled!)
  Task 2: completed in 15041ms (NOT cancelled!)

=== FIXED: tasks correctly cancel at 3s ===
  Task 0: CANCELLED after 3013ms
  Task 1: CANCELLED after 3013ms
  Task 2: CANCELLED after 3013ms
  Task 3: CANCELLED after 3013ms
  Task 4: CANCELLED after 3013ms

Expected behavior

Background proactive refresh tasks should respect the parent cancellation token's timeout. When the parent CTS fires, WaitAsync should throw OperationCanceledException and the task should exit the semaphore queue.

Regression

Introduced in 4.59.0 by PR #4471. Versions before 4.59.0 passed cancellationToken directly without a linked CTS and did not have this issue.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions