|
| 1 | +# VALIDATION — RNTBD Dispatcher idle-timer thread-pool starvation fix |
| 2 | + |
| 3 | +## 1. Summary |
| 4 | + |
| 5 | +`Dispatcher.OnIdleTimer` ran on a thread-pool worker and called |
| 6 | +`WaitTask(receiveTaskCopy)`, which performs `Task.Wait()` on a |
| 7 | +receive loop that does not complete until the connection is |
| 8 | +disposed. Under sustained scale, every channel's idle timer fire |
| 9 | +parked one worker, starving the pool (issue #4393). The fix |
| 10 | +converts the timer callback to async (`OnIdleTimerAsync` + |
| 11 | +`WaitTaskAsync`), wires it via `.ContinueWith(...).Unwrap()`, and |
| 12 | +preserves the sync `WaitTask` for any other callers. Evidence: |
| 13 | +deterministic unit test (5/5) directly demonstrating the |
| 14 | +calling-thread behavior change, plus one integration run during |
| 15 | +investigation that reproduced the production pathology end-to-end |
| 16 | +(48s probe latency, 46-thread pool growth) in this environment. |
| 17 | + |
| 18 | +Issue: https://github.com/Azure/azure-cosmos-dotnet-v3/issues/4393 |
| 19 | + |
| 20 | +## 2. The fix |
| 21 | + |
| 22 | +Three edits to `Microsoft.Azure.Cosmos/src/direct/Dispatcher.cs`. |
| 23 | + |
| 24 | +**Edit 1 — `OnIdleTimer` becomes `OnIdleTimerAsync`:** |
| 25 | + |
| 26 | +```csharp |
| 27 | +private async Task OnIdleTimerAsync(Task precedentTask) |
| 28 | +{ |
| 29 | + DefaultTrace.TraceInformation( |
| 30 | + "[RNTBD Dispatcher {0}][{1}] Idle timer fired.", |
| 31 | + this.ConnectionCorrelationId, this); |
| 32 | + |
| 33 | + Task receiveTaskCopy = null; |
| 34 | + // ... unchanged body that ends with: |
| 35 | + await this.WaitTaskAsync(receiveTaskCopy, "receive loop") |
| 36 | + .ConfigureAwait(false); |
| 37 | +} |
| 38 | +``` |
| 39 | + |
| 40 | +The omitted body of `OnIdleTimerAsync` is byte-for-byte identical |
| 41 | +to the pre-fix `OnIdleTimer` body: lock blocks remain synchronous, |
| 42 | +no `await` appears inside any lock, and the existing |
| 43 | +`Debug.Assert(!Monitor.IsEntered(...))` lock guards on `callLock` |
| 44 | +and `connectionLock` are preserved. |
| 45 | + |
| 46 | +**Edit 2 — new `WaitTaskAsync` alongside the still-present sync `WaitTask`:** |
| 47 | + |
| 48 | +```csharp |
| 49 | +private async Task WaitTaskAsync(Task t, string description) |
| 50 | +{ |
| 51 | + if (t == null) { return; } |
| 52 | + try |
| 53 | + { |
| 54 | + Debug.Assert(!Monitor.IsEntered(this.callLock)); |
| 55 | + Debug.Assert(!Monitor.IsEntered(this.connectionLock)); |
| 56 | + await t.ConfigureAwait(false); |
| 57 | + } |
| 58 | + catch (Exception e) |
| 59 | + { |
| 60 | + DefaultTrace.TraceWarning( |
| 61 | + "[RNTBD Dispatcher {0}][{1}] Parallel task failed: {2}. " + |
| 62 | + "Exception: {3}: {4}", |
| 63 | + this.ConnectionCorrelationId, this, description, |
| 64 | + e.GetType().Name, e.Message); |
| 65 | + } |
| 66 | +} |
| 67 | +``` |
| 68 | + |
| 69 | +**Edit 3 — `ScheduleIdleTimer` uses `.Unwrap()`:** |
| 70 | + |
| 71 | +```csharp |
| 72 | +private void ScheduleIdleTimer(TimeSpan timeToIdle) |
| 73 | +{ |
| 74 | + Debug.Assert(Monitor.IsEntered(this.connectionLock)); |
| 75 | + this.idleTimer = this.idleTimerPool.GetPooledTimer((int)timeToIdle.TotalSeconds); |
| 76 | + // IMPORTANT: .Unwrap() is essential here. Without it, idleTimerTask |
| 77 | + // would be Task<Task> and would complete when OnIdleTimerAsync |
| 78 | + // returns its inner Task (at the first await), not when it |
| 79 | + // finishes. StopIdleTimer() waits on idleTimerTask during |
| 80 | + // shutdown; if idleTimerTask completes early, shutdown proceeds |
| 81 | + // while OnIdleTimerAsync is still running, causing |
| 82 | + // use-after-dispose on the connection. Do not remove .Unwrap(). |
| 83 | + this.idleTimerTask = this.idleTimer.StartTimerAsync() |
| 84 | + .ContinueWith(this.OnIdleTimerAsync, TaskContinuationOptions.OnlyOnRanToCompletion) |
| 85 | + .Unwrap(); |
| 86 | + // ... existing failure-trace ContinueWith chain unchanged |
| 87 | +} |
| 88 | +``` |
| 89 | + |
| 90 | +## 3. Why this bug is hard to test |
| 91 | + |
| 92 | +The minimum effective idle-timer arm in this codebase is **750 |
| 93 | +seconds** (12.5 minutes). `StoreClientFactory` enforces a 600s |
| 94 | +floor on `IdleTcpConnectionTimeout`, and `Connection` |
| 95 | +adds a `2 * (sendHang + receiveHang) = 150s` race buffer on top |
| 96 | +before the dispatcher idle timer is armed. Any single-client |
| 97 | +test that wants to observe an idle-timer fire must idle the SDK |
| 98 | +for at least that long; CI suites do not. Beyond runtime, the |
| 99 | +production trigger requires sustained scale and timing |
| 100 | +conditions across many channels — connection counts, partition |
| 101 | +distribution, and backend replica state — that a single test |
| 102 | +client cannot synthesize on demand. This is why #4393 sat from |
| 103 | +2024 to 2026 without being caught: nobody runs a 13-minute idle |
| 104 | +test in CI, and even when one is run, the conditions that turn |
| 105 | +"timer fires" into "thread pool starves" are not deterministic |
| 106 | +from one client. The testing strategy below reflects this: |
| 107 | +deterministic unit-test evidence at the changed line, |
| 108 | +integration test as wiring guard, plus the one investigation run |
| 109 | +that did reproduce the end-to-end pathology preserved as |
| 110 | +artifact. |
| 111 | + |
| 112 | +## 4. Evidence |
| 113 | + |
| 114 | +### 4a. Unit test (canonical evidence) |
| 115 | + |
| 116 | +`Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.Tests/DispatcherIdleTimerFixTests.cs` |
| 117 | +isolates the exact line the fix changes. It uses |
| 118 | +`InternalsVisibleTo` and reflection to invoke `WaitTaskAsync` |
| 119 | +and `WaitTask` directly on a `Dispatcher` instance, passing each |
| 120 | +a `TaskCompletionSource<bool>().Task` that is never completed — |
| 121 | +the same shape `OnIdleTimer` saw in production when it called |
| 122 | +`.Wait()` on a receive loop. Probes measure thread-pool |
| 123 | +behavior during the wait window. |
| 124 | + |
| 125 | +Result over 5/5 runs: the **async path** completes synchronously |
| 126 | +from the caller's perspective (the `await` yields the calling |
| 127 | +thread; the awaited task remains pending in the background, with |
| 128 | +10+ distinct thread IDs observed in `WaitingForActivation` over |
| 129 | +the probe window). The **sync path** blocks the calling thread |
| 130 | +for the full measurement window — directly demonstrating the |
| 131 | +pre-fix pathology that, multiplied across N idle channels in |
| 132 | +production, exhausts the pool. This test runs in <1 second per |
| 133 | +case and is the canonical proof that the fix changes the |
| 134 | +thread-blocking behavior at the line that matters. |
| 135 | + |
| 136 | +### 4b. Integration test (regression guard) |
| 137 | + |
| 138 | +`Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.EmulatorTests/RntbdIdleTimerStarvationTests.cs` |
| 139 | +opens N=50 channels against a live Azure Cosmos DB account by |
| 140 | +forcing `MaxRequestsPerTcpConnection=1` and firing N concurrent |
| 141 | +reads, then waits 810 seconds for every channel's idle timer to |
| 142 | +fire. It asserts: |
| 143 | + |
| 144 | +- `delta < 10` (peak − baseline thread count) — guards against |
| 145 | + the thread-pool growth signature of the production bug |
| 146 | +- `timersFired > 0` — guards against `.Unwrap()` chain |
| 147 | + breakage; if a future change removes `.Unwrap()`, |
| 148 | + `OnIdleTimerAsync` stops being reached via the timer |
| 149 | + continuation and this assertion catches it. Counted via a |
| 150 | + `TraceListener` attached (by reflection) to |
| 151 | + `DefaultTrace.TraceSource`, observing the |
| 152 | + `TraceInformation` line at `OnIdleTimerAsync` entry. |
| 153 | + Reflection is used because `DefaultTrace.TraceSource` is |
| 154 | + internal; adding a public test seam was considered and |
| 155 | + rejected as unjustified API expansion for a regression-only |
| 156 | + assertion. |
| 157 | + |
| 158 | +The previous max-probe-latency assertion was dropped: across six |
| 159 | +clean runs of the integration test on the unpatched baseline the |
| 160 | +latency clustered at ~1001ms regardless of N, which is a probe- |
| 161 | +implementation ceiling artifact, not a Dispatcher signal. |
| 162 | +Asserting on it would either flap or test the wrong thing. |
| 163 | + |
| 164 | +### 4c. Bug-exists-in-environment evidence |
| 165 | + |
| 166 | +A single integration run on the unpatched baseline at N=50 |
| 167 | +against the same live account, with Stopwatch instrumentation |
| 168 | +around every `.Wait()` call inside `WaitTask`: |
| 169 | + |
| 170 | +- Max probe latency: **48,093 ms** |
| 171 | +- Thread count delta: **46** (baseline → peak during idle-fire window) |
| 172 | +- Probes never executed: 2 |
| 173 | +- `.Wait()` durations: |
| 174 | + - Median: 12.5 s |
| 175 | + - Mean: 21.2 s |
| 176 | + - p95: 64.1 s |
| 177 | + - p99: 67.6 s |
| 178 | + - Max: 67.6 s |
| 179 | +- 67/110 `.Wait()` calls blocked ≥1 s; 57/110 blocked ≥10 s |
| 180 | + |
| 181 | +This run reproduced the production pathology once. Six |
| 182 | +subsequent runs (3 at N=50, 3 at N=200) did not reproduce it; |
| 183 | +all six showed the ~1001ms ceiling and delta=0. We interpret |
| 184 | +this as confirmation that the bug exists in this environment, |
| 185 | +but is not reliably triggerable from a single test client at |
| 186 | +the scales we tested. The numbers above are preserved verbatim |
| 187 | +as artifact; we did not chase the question of why one run |
| 188 | +reproduced and others did not (likely Azure-side timing, |
| 189 | +network conditions, or backend replica state — outside our |
| 190 | +control). |
| 191 | + |
| 192 | +## 5. Results table |
| 193 | + |
| 194 | +| Test | N | Max latency | Thread delta | Result | |
| 195 | +|------|---|-------------|--------------|--------| |
| 196 | +| Unit test (sync path) | n/a | (blocks calling thread) | n/a | Demonstrates pre-fix behavior | |
| 197 | +| Unit test (async path) | n/a | (yields calling thread) | n/a | 5/5 PASS | |
| 198 | +| Integration baseline (single dramatic run) | 50 | 48,093ms | 46 | Bug reproduced | |
| 199 | +| Integration baseline (6 subsequent runs) | 50, 200 | ~1001ms | 0 | Bug not reproduced | |
| 200 | +| Integration fix (sanity run) | 50 | 900ms | 0 | PASS | |
| 201 | + |
| 202 | +## 6. What this PR does NOT do |
| 203 | + |
| 204 | +- Does not add `DisposeAsync` or `IAsyncDisposable` to any class |
| 205 | +- Does not modify `Channel`, `ChannelDictionary`, |
| 206 | + `LoadBalancingChannel`, `LoadBalancingPartition`, |
| 207 | + `LbChannelState`, or `IChannel` |
| 208 | +- Does not change `bool disposed` to `int disposed` or add |
| 209 | + `Interlocked.CompareExchange` guards |
| 210 | +- Does not modify any public API surface |
| 211 | +- Does not change behavior of the existing sync `WaitTask` |
| 212 | + (preserved untouched in case other callers exist) |
| 213 | +- Only modifies `Dispatcher.cs` plus one new unit-test file plus |
| 214 | + one reshaped integration-test file |
| 215 | + |
| 216 | +## 7. Limitations |
| 217 | + |
| 218 | +- Tested only on Linux (matches the production bug environment; |
| 219 | + Windows not tested in this round) |
| 220 | +- Test endpoint was a live Azure Cosmos DB account; emulator |
| 221 | + validation not performed |
| 222 | +- Integration test does not gate on starvation behavior because |
| 223 | + starvation could not be reliably reproduced at N=50 or N=200 |
| 224 | + from a single test client |
| 225 | +- The single dramatic baseline reproduction is one data point; |
| 226 | + we did not pursue investigation into why that run reproduced |
| 227 | + and others did not (likely Azure-side timing, network |
| 228 | + conditions, or backend replica state — outside our control) |
| 229 | +- The async disposal path described in the original PR |
| 230 | + description (Path 2 in issue #4393) is not addressed by this |
| 231 | + PR and remains a separate follow-up |
0 commit comments