Skip to content

RNTBD Dispatcher: Fixes idle timer thread pool starvation (#4393)#5817

Open
NaluTripician wants to merge 4 commits intomsdata/directfrom
users/ntripician/fix-dispatcher-idle-timer-starvation
Open

RNTBD Dispatcher: Fixes idle timer thread pool starvation (#4393)#5817
NaluTripician wants to merge 4 commits intomsdata/directfrom
users/ntripician/fix-dispatcher-idle-timer-starvation

Conversation

@NaluTripician
Copy link
Copy Markdown
Contributor

@NaluTripician NaluTripician commented Apr 28, 2026

Fixes #4393.

Converts Dispatcher.OnIdleTimer to async and wires it via
.ContinueWith(...).Unwrap() so idle timer callbacks no longer
block thread-pool workers on Task.Wait().

Three-edit change to Dispatcher.cs. Full validation, evidence, and
honest limitations in VALIDATION.md.

Marked draft because this branch is the staging copy; the
canonical fix will land in msdata/direct via a separate PR
informed by this validation work.

Validation summary

The fix (Dispatcher.cs, three edits):

  • OnIdleTimerOnIdleTimerAsync (sync → async Task)
  • New WaitTaskAsync alongside the still-present sync WaitTask
  • ScheduleIdleTimer updated to .ContinueWith(OnIdleTimerAsync, ...).Unwrap() with safety comment explaining why .Unwrap() is essential (without it the timer task completes early and disposal can race the still-running continuation)

The omitted body of OnIdleTimerAsync is byte-for-byte identical to the pre-fix OnIdleTimer body: lock blocks remain synchronous, no await inside any lock, and the existing Debug.Assert(!Monitor.IsEntered(...)) guards on callLock and connectionLock are preserved.

Why this bug is hard to test: The Dispatcher's idle timer arms with a 750-second minimum (StoreClientFactory.cs 600s floor + Connection.cs 2*(sendHang+receiveHang)=150s race buffer). Production starvation requires sustained scale and timing conditions. This is why issue #4393 sat from 2024 to 2026 without a CI catch — single-client tests cannot reliably synthesize the production trigger.

Evidence:

Test N Max latency Thread delta Result
Unit test (sync path) n/a (blocks calling thread) n/a Demonstrates pre-fix behavior
Unit test (async path) n/a (yields calling thread) n/a 5/5 PASS
Integration baseline (single dramatic run) 50 48,093ms 46 Bug reproduced
Integration baseline (6 subsequent runs) 50, 200 ~1001ms 0 Bug not reproduced
Integration fix (sanity run) 50 900ms 0 PASS
  • Unit test DispatcherIdleTimerFixTests (canonical evidence): uses InternalsVisibleTo + reflection to exercise WaitTaskAsync and WaitTask directly with a TaskCompletionSource that is never completed. Probe-based thread observation: async path yields the calling thread; sync path blocks it. 5/5 passing runs.
  • Integration test RntbdIdleTimerStarvationTests (regression guard): asserts wiring stays intact (timersFired > 0 via reflection-based TraceListener on DefaultTrace.TraceSource, observing the new TraceInformation line at OnIdleTimerAsync entry) and thread delta stays bounded (< 10). Latency assertion was dropped because the ~1001ms ceiling observed across 6 clean runs is a probe-implementation artifact, not a Dispatcher signal.
  • Bug-exists-in-environment evidence: one integration run during investigation reproduced the production pathology end-to-end on the unpatched baseline at N=50: 48,093ms max probe latency, 46-thread pool growth, 2 probes never executed, .Wait() durations median 12.5s / mean 21.2s / p95 64.1s / max 67.6s, 67/110 calls blocked ≥1s. Six subsequent runs (3 at N=50, 3 at N=200) did not reproduce — bug is real in this environment but not reliably triggerable from a single test client at the scales we tested.

What this PR does NOT do:

  • Does not add DisposeAsync / IAsyncDisposable
  • Does not modify Channel, ChannelDictionary, LoadBalancingChannel, LoadBalancingPartition, LbChannelState, or IChannel
  • Does not change bool disposed to int disposed or add Interlocked.CompareExchange guards
  • Does not modify any public API surface
  • Does not change behavior of the existing sync WaitTask (preserved untouched in case other callers exist)
  • Only modifies Dispatcher.cs plus one new unit test plus one reshaped integration test (and supporting Linux test-run scaffolding under scripts/)

Limitations:

  • Tested only on Linux (matches production bug environment; Windows not tested in this round)
  • Test endpoint was a live Azure Cosmos DB account; emulator not validated
  • Integration test does not gate on starvation behavior because starvation could not be reliably reproduced at N=50 or N=200 from a single test client
  • The single dramatic baseline reproduction is one data point; we did not pursue why that run reproduced and others did not (likely Azure-side timing, network conditions, or backend replica state — outside our control)
  • The async disposal path described as Path 2 in issue [FEATURE REQ] Avoid blocking Asynchronous calls as it leads to thread starvation. #4393 is not addressed by this PR and remains a separate follow-up

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
1 pipeline(s) were filtered out due to trigger conditions.

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
1 pipeline(s) were filtered out due to trigger conditions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant