Skip to content

Commit af8aa32

Browse files
committed
Add validation report for #4393 fix
1 parent c05528e commit af8aa32

1 file changed

Lines changed: 231 additions & 0 deletions

File tree

VALIDATION.md

Lines changed: 231 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,231 @@
1+
# VALIDATION — RNTBD Dispatcher idle-timer thread-pool starvation fix
2+
3+
## 1. Summary
4+
5+
`Dispatcher.OnIdleTimer` ran on a thread-pool worker and called
6+
`WaitTask(receiveTaskCopy)`, which performs `Task.Wait()` on a
7+
receive loop that does not complete until the connection is
8+
disposed. Under sustained scale, every channel's idle timer fire
9+
parked one worker, starving the pool (issue #4393). The fix
10+
converts the timer callback to async (`OnIdleTimerAsync` +
11+
`WaitTaskAsync`), wires it via `.ContinueWith(...).Unwrap()`, and
12+
preserves the sync `WaitTask` for any other callers. Evidence:
13+
deterministic unit test (5/5) directly demonstrating the
14+
calling-thread behavior change, plus one integration run during
15+
investigation that reproduced the production pathology end-to-end
16+
(48s probe latency, 46-thread pool growth) in this environment.
17+
18+
Issue: https://github.com/Azure/azure-cosmos-dotnet-v3/issues/4393
19+
20+
## 2. The fix
21+
22+
Three edits to `Microsoft.Azure.Cosmos/src/direct/Dispatcher.cs`.
23+
24+
**Edit 1 — `OnIdleTimer` becomes `OnIdleTimerAsync`:**
25+
26+
```csharp
27+
private async Task OnIdleTimerAsync(Task precedentTask)
28+
{
29+
DefaultTrace.TraceInformation(
30+
"[RNTBD Dispatcher {0}][{1}] Idle timer fired.",
31+
this.ConnectionCorrelationId, this);
32+
33+
Task receiveTaskCopy = null;
34+
// ... unchanged body that ends with:
35+
await this.WaitTaskAsync(receiveTaskCopy, "receive loop")
36+
.ConfigureAwait(false);
37+
}
38+
```
39+
40+
The omitted body of `OnIdleTimerAsync` is byte-for-byte identical
41+
to the pre-fix `OnIdleTimer` body: lock blocks remain synchronous,
42+
no `await` appears inside any lock, and the existing
43+
`Debug.Assert(!Monitor.IsEntered(...))` lock guards on `callLock`
44+
and `connectionLock` are preserved.
45+
46+
**Edit 2 — new `WaitTaskAsync` alongside the still-present sync `WaitTask`:**
47+
48+
```csharp
49+
private async Task WaitTaskAsync(Task t, string description)
50+
{
51+
if (t == null) { return; }
52+
try
53+
{
54+
Debug.Assert(!Monitor.IsEntered(this.callLock));
55+
Debug.Assert(!Monitor.IsEntered(this.connectionLock));
56+
await t.ConfigureAwait(false);
57+
}
58+
catch (Exception e)
59+
{
60+
DefaultTrace.TraceWarning(
61+
"[RNTBD Dispatcher {0}][{1}] Parallel task failed: {2}. " +
62+
"Exception: {3}: {4}",
63+
this.ConnectionCorrelationId, this, description,
64+
e.GetType().Name, e.Message);
65+
}
66+
}
67+
```
68+
69+
**Edit 3 — `ScheduleIdleTimer` uses `.Unwrap()`:**
70+
71+
```csharp
72+
private void ScheduleIdleTimer(TimeSpan timeToIdle)
73+
{
74+
Debug.Assert(Monitor.IsEntered(this.connectionLock));
75+
this.idleTimer = this.idleTimerPool.GetPooledTimer((int)timeToIdle.TotalSeconds);
76+
// IMPORTANT: .Unwrap() is essential here. Without it, idleTimerTask
77+
// would be Task<Task> and would complete when OnIdleTimerAsync
78+
// returns its inner Task (at the first await), not when it
79+
// finishes. StopIdleTimer() waits on idleTimerTask during
80+
// shutdown; if idleTimerTask completes early, shutdown proceeds
81+
// while OnIdleTimerAsync is still running, causing
82+
// use-after-dispose on the connection. Do not remove .Unwrap().
83+
this.idleTimerTask = this.idleTimer.StartTimerAsync()
84+
.ContinueWith(this.OnIdleTimerAsync, TaskContinuationOptions.OnlyOnRanToCompletion)
85+
.Unwrap();
86+
// ... existing failure-trace ContinueWith chain unchanged
87+
}
88+
```
89+
90+
## 3. Why this bug is hard to test
91+
92+
The minimum effective idle-timer arm in this codebase is **750
93+
seconds** (12.5 minutes). `StoreClientFactory` enforces a 600s
94+
floor on `IdleTcpConnectionTimeout`, and `Connection`
95+
adds a `2 * (sendHang + receiveHang) = 150s` race buffer on top
96+
before the dispatcher idle timer is armed. Any single-client
97+
test that wants to observe an idle-timer fire must idle the SDK
98+
for at least that long; CI suites do not. Beyond runtime, the
99+
production trigger requires sustained scale and timing
100+
conditions across many channels — connection counts, partition
101+
distribution, and backend replica state — that a single test
102+
client cannot synthesize on demand. This is why #4393 sat from
103+
2024 to 2026 without being caught: nobody runs a 13-minute idle
104+
test in CI, and even when one is run, the conditions that turn
105+
"timer fires" into "thread pool starves" are not deterministic
106+
from one client. The testing strategy below reflects this:
107+
deterministic unit-test evidence at the changed line,
108+
integration test as wiring guard, plus the one investigation run
109+
that did reproduce the end-to-end pathology preserved as
110+
artifact.
111+
112+
## 4. Evidence
113+
114+
### 4a. Unit test (canonical evidence)
115+
116+
`Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.Tests/DispatcherIdleTimerFixTests.cs`
117+
isolates the exact line the fix changes. It uses
118+
`InternalsVisibleTo` and reflection to invoke `WaitTaskAsync`
119+
and `WaitTask` directly on a `Dispatcher` instance, passing each
120+
a `TaskCompletionSource<bool>().Task` that is never completed —
121+
the same shape `OnIdleTimer` saw in production when it called
122+
`.Wait()` on a receive loop. Probes measure thread-pool
123+
behavior during the wait window.
124+
125+
Result over 5/5 runs: the **async path** completes synchronously
126+
from the caller's perspective (the `await` yields the calling
127+
thread; the awaited task remains pending in the background, with
128+
10+ distinct thread IDs observed in `WaitingForActivation` over
129+
the probe window). The **sync path** blocks the calling thread
130+
for the full measurement window — directly demonstrating the
131+
pre-fix pathology that, multiplied across N idle channels in
132+
production, exhausts the pool. This test runs in <1 second per
133+
case and is the canonical proof that the fix changes the
134+
thread-blocking behavior at the line that matters.
135+
136+
### 4b. Integration test (regression guard)
137+
138+
`Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.EmulatorTests/RntbdIdleTimerStarvationTests.cs`
139+
opens N=50 channels against a live Azure Cosmos DB account by
140+
forcing `MaxRequestsPerTcpConnection=1` and firing N concurrent
141+
reads, then waits 810 seconds for every channel's idle timer to
142+
fire. It asserts:
143+
144+
- `delta < 10` (peak − baseline thread count) — guards against
145+
the thread-pool growth signature of the production bug
146+
- `timersFired > 0` — guards against `.Unwrap()` chain
147+
breakage; if a future change removes `.Unwrap()`,
148+
`OnIdleTimerAsync` stops being reached via the timer
149+
continuation and this assertion catches it. Counted via a
150+
`TraceListener` attached (by reflection) to
151+
`DefaultTrace.TraceSource`, observing the
152+
`TraceInformation` line at `OnIdleTimerAsync` entry.
153+
Reflection is used because `DefaultTrace.TraceSource` is
154+
internal; adding a public test seam was considered and
155+
rejected as unjustified API expansion for a regression-only
156+
assertion.
157+
158+
The previous max-probe-latency assertion was dropped: across six
159+
clean runs of the integration test on the unpatched baseline the
160+
latency clustered at ~1001ms regardless of N, which is a probe-
161+
implementation ceiling artifact, not a Dispatcher signal.
162+
Asserting on it would either flap or test the wrong thing.
163+
164+
### 4c. Bug-exists-in-environment evidence
165+
166+
A single integration run on the unpatched baseline at N=50
167+
against the same live account, with Stopwatch instrumentation
168+
around every `.Wait()` call inside `WaitTask`:
169+
170+
- Max probe latency: **48,093 ms**
171+
- Thread count delta: **46** (baseline → peak during idle-fire window)
172+
- Probes never executed: 2
173+
- `.Wait()` durations:
174+
- Median: 12.5 s
175+
- Mean: 21.2 s
176+
- p95: 64.1 s
177+
- p99: 67.6 s
178+
- Max: 67.6 s
179+
- 67/110 `.Wait()` calls blocked ≥1 s; 57/110 blocked ≥10 s
180+
181+
This run reproduced the production pathology once. Six
182+
subsequent runs (3 at N=50, 3 at N=200) did not reproduce it;
183+
all six showed the ~1001ms ceiling and delta=0. We interpret
184+
this as confirmation that the bug exists in this environment,
185+
but is not reliably triggerable from a single test client at
186+
the scales we tested. The numbers above are preserved verbatim
187+
as artifact; we did not chase the question of why one run
188+
reproduced and others did not (likely Azure-side timing,
189+
network conditions, or backend replica state — outside our
190+
control).
191+
192+
## 5. Results table
193+
194+
| Test | N | Max latency | Thread delta | Result |
195+
|------|---|-------------|--------------|--------|
196+
| Unit test (sync path) | n/a | (blocks calling thread) | n/a | Demonstrates pre-fix behavior |
197+
| Unit test (async path) | n/a | (yields calling thread) | n/a | 5/5 PASS |
198+
| Integration baseline (single dramatic run) | 50 | 48,093ms | 46 | Bug reproduced |
199+
| Integration baseline (6 subsequent runs) | 50, 200 | ~1001ms | 0 | Bug not reproduced |
200+
| Integration fix (sanity run) | 50 | 900ms | 0 | PASS |
201+
202+
## 6. What this PR does NOT do
203+
204+
- Does not add `DisposeAsync` or `IAsyncDisposable` to any class
205+
- Does not modify `Channel`, `ChannelDictionary`,
206+
`LoadBalancingChannel`, `LoadBalancingPartition`,
207+
`LbChannelState`, or `IChannel`
208+
- Does not change `bool disposed` to `int disposed` or add
209+
`Interlocked.CompareExchange` guards
210+
- Does not modify any public API surface
211+
- Does not change behavior of the existing sync `WaitTask`
212+
(preserved untouched in case other callers exist)
213+
- Only modifies `Dispatcher.cs` plus one new unit-test file plus
214+
one reshaped integration-test file
215+
216+
## 7. Limitations
217+
218+
- Tested only on Linux (matches the production bug environment;
219+
Windows not tested in this round)
220+
- Test endpoint was a live Azure Cosmos DB account; emulator
221+
validation not performed
222+
- Integration test does not gate on starvation behavior because
223+
starvation could not be reliably reproduced at N=50 or N=200
224+
from a single test client
225+
- The single dramatic baseline reproduction is one data point;
226+
we did not pursue investigation into why that run reproduced
227+
and others did not (likely Azure-side timing, network
228+
conditions, or backend replica state — outside our control)
229+
- The async disposal path described in the original PR
230+
description (Path 2 in issue #4393) is not addressed by this
231+
PR and remains a separate follow-up

0 commit comments

Comments
 (0)