fix(stress): per-fixture hang watchdog + WaitForIdle cap raise#431
Merged
Conversation
Two race-driven flakes surfaced in the 500x selftest stress run on main (GitHub Actions run 26513480330): 4/10000 iterations failed, with 2 of 4 in NativeDocking_* fixtures. WaitForIdleAsync silently returned after 10 low-priority dispatcher yields whether or not the render loop had settled. Under CI VM contention this dropped tests onto a half-rendered tree, producing e.g. SplitDocArea_CloseNonLast finding no `body:d1` on iter 3/500. Raise the default cap to 50 and log a Debug.WriteLine when it hits, so the next flake is greppable instead of silent. The host-level HangWatchdogLoop used a global 60 s constant — equal to EventSubscriptionLeakBaseline's own FixtureTimeout. The two raced on the worst-case CI tick (~60 s for 200 renders + 200 reconciles) and the watchdog won, FailFasting the shard instead of letting the fixture's graceful timeout fire first. Make the watchdog threshold per-fixture: max(60 s, FixtureTimeout + 30 s). The fixture's own budget always gets first crack; dispatcher-starvation FailFast only fires after that. Also: bump EventSubscriptionLeakBaseline FixtureTimeout 60 → 120 s (auto-bumps its watchdog to 150 s under the new rule), and add heartbeat H.Check calls every 25 cycles so a future hang reveals which loop window it lived in rather than just "no progress". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR hardens the selftest stress infrastructure to reduce race-driven flakes under CI contention by improving “wait until UI settles” behavior and making the hang watchdog threshold fixture-aware (so long-budget fixtures don’t get fail-fasted prematurely).
Changes:
- Increase
ReactorHost.WaitForIdleAsyncyield cap (10 → 50) and add a diagnostic trace when the cap is hit. - Make the hang watchdog threshold per-fixture (
max(60s, FixtureTimeout + 30s)) instead of a single global ceiling. - Extend
EventSubscriptionLeakBaselinefixture timeout (60s → 120s) and add periodic TAP “heartbeat” checks for better log locality during stress.
Show a summary per file
| File | Description |
|---|---|
| src/Reactor/Hosting/ReactorHost.cs | Raises WaitForIdleAsync yield cap and logs when the cap is hit. |
| tests/Reactor.AppTests.Host/SelfTest/SelfTestRunner.cs | Introduces per-fixture hang thresholds for the watchdog and adjusts progress publication. |
| tests/Reactor.AppTests.Host/SelfTest/SelfTestFixtureBase.cs | Updates documentation to reflect the new watchdog threshold rule. |
| tests/Reactor.AppTests.Host/SelfTest/Fixtures/NativeDockingReliabilityFixture.cs | Increases leak baseline timeout and adds TAP heartbeat checks. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 4/4 changed files
- Comments generated: 4
- WaitForIdleAsync: use RunContinuationsAsynchronously so await continuations don't run inline on the dispatcher at Low priority (defeating the yield loop's purpose); honour TryEnqueue failures (queue shutdown) by completing the TCS rather than hanging. - SelfTestRunner: publish a baseline FixtureProgress *before* calling SelfTestFixtureRegistry.Create, so the watchdog can still attribute a hang if construction itself blocks. Upgrade the per-fixture threshold once FixtureTimeout is known. - Fix misleading comments in EventSubscriptionLeakBaseline that claimed the heartbeat H.Check calls reset the watchdog — they don't; the watchdog uses elapsed-since-fixture-start. Heartbeats are log breadcrumbs only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collaborator
Author
|
Stress run summary for https://github.com/microsoft/microsoft-ui-reactor/actions/runs/26526776797:
Failed tests observed during the run:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two race-driven flakes surfaced in the recent 500x selftest stress run on
main(CI Stress run 26513480330): 4 of 10000 iterations failed (~99.96% pass), with 2 of 4 inNativeDocking_*fixtures. Investigation showed both root causes were in test infrastructure, not the new docking window manager work (#418/#420/#421/#415).Root causes
ReactorHost.WaitForIdleAsync(maxYields: 10)silently returned after 10 low-priority dispatcher yields whether or not the render loop had actually settled. Under CI VM contention, callers likeHarness.Render()moved on against a half-rendered tree — producing flakes likeNativeDocking_RoleAware_SplitDocArea_CloseNonLastfailing toFindText("body:d1")on iteration 3/500.SelfTestRunner.HangWatchdogLoopused a global 60 s constant equal toEventSubscriptionLeakBaseline's ownFixtureTimeout. On worst-case CI ticks (~60 s for 200 renders + 200 reconciles, as the fixture's own comment acknowledged), the two raced — the watchdog won and FailFasted the shard instead of letting the fixture's graceful timeout fire.Changes
src/Reactor/Hosting/ReactorHost.csWaitForIdleAsyncdefault cap 10 → 50;Debug.WriteLinewhen the cap actually hits so the next flake leaves a greppable trailtests/.../SelfTest/SelfTestRunner.csFixtureProgresscarries a per-fixtureHangThreshold = max(60 s, FixtureTimeout + 30 s); watchdog uses that, not the global constanttests/.../SelfTest/SelfTestFixtureBase.cstests/.../Fixtures/NativeDockingReliabilityFixture.csEventSubscriptionLeakBaselineFixtureTimeout60 → 120 s; heartbeatH.Checkafter warmup and every 25 cyclesThe heartbeat
H.Checkcalls don't reset the watchdog (it's still elapsed-since-fixture-start), but they makeoklines visible in the log so a future hang reveals which 25-cycle window it lived in, not just "no progress".Test plan
src/Reactorandtests/Reactor.AppTests.Hostboth compile clean (0 errors, only pre-existing XML-comment / nullability warnings)NativeDocking_*fixtures orDataGrid_KeyboardAndPrivateRenderPathsEventSubscriptionLeakBaselineper-iteration time stays well under the 120 s budget (target: still ~15 s local, ~30-60 s under CI contention)[Reactor.WaitForIdle] yield cap hitlines — if any appear, the 50-yield cap is being approached and we'll want to investigate the underlying render loop🤖 Generated with Claude Code