Skip to content

Commit 7558665

Browse files
Add off-dispatcher AOT hang watchdog and clean up stale selftest skips (#358)
* Add off-dispatcher AOT hang watchdog and clean up stale selftest skips Provides a debugging story for AOT selftests that hang the host app, and restores test coverage for fixtures that were skipped speculatively but actually pass under AOT today. Watchdog (tests/Reactor.AppTests.Host/SelfTest/SelfTestRunner.cs): * Background Thread tracks the currently running fixture as an immutable FixtureProgress record published via Volatile.Read/Write -- no multi-field consistency race when the publisher and watcher overlap. * Polls every 1s and FailFasts after HangTimeout (default 60s, configurable via REACTOR_SELFTEST_HANG_TIMEOUT_SECONDS, 0 disables; auto-disabled when Debugger.IsAttached). * On hang, writes "Bail out! HANG_DETECTED: <name> ran <s>s ..." to stdout and stderr before Environment.FailFast so a Watson minidump is produced when DOTNET_DbgEnableMiniDump=1. CLI / environment: * --no-aot-skip on the Host (Program.cs) flips SelfTestRunner.SkipAotPatterns so a developer can run a single DefaultAotSkipPatterns entry against the AOT-published binary. * REACTOR_SELFTEST_HOST_EXE on the MSTest harness (Reactor.SelfTests/SelfTestBatch.cs) overrides the auto-discovered Host binary so the same harness validates the AOT-published .exe from a publish/ directory. Parent attribution (SelfTestBatch.cs): * Parses HANG_DETECTED: <name> from both stdout and stderr; on process timeout or hang signal, attributes the failure to the named fixture with a repro command + dump-env instructions, instead of cascading through _initError (which would fail every unrelated fixture). * Falls back to the last "# Running:" line when the watchdog did not fire in time. Skip-list cleanup (SelfTestRunner.cs): * Removes 84 stale entries that pass under AOT today. * Replaces the previous mix of explicit names and Prefix* wildcards with 108 explicit fixture names grouped by failure mode (NATIVE_CRASH vs ASSERT_FAIL) so the skip list documents what is actually broken. Empirical probe (tests/Reactor.AppTests.Host/probe-aot-skips.ps1): * New script that expands DefaultAotSkipPatterns via --list-fixtures and runs each entry in isolation under --no-aot-skip, categorising by exit code + watchdog signal, writing a CSV. * Categories observed on this baseline: 84 PASS (removed), 61 NATIVE_CRASH (0xC0000409 / STATUS_STACK_BUFFER_OVERRUN), 47 ASSERT_FAIL, 0 HANG. The hang watchdog stays as a CI safety net (synthetic Thread.Sleep injection validated end-to-end) even though no current entry actually hangs. Docs (docs/aot-support.md): * New "Debugging an AOT selftest hang" section covering the three-bucket failure-mode taxonomy, minidump env-var setup, isolated-repro workflow, watchdog tuning, and the probe script. Validation: * JIT selftests: 735/735 passing, no false positives. * Synthetic hang (Thread.Sleep(int.MaxValue) injected into one fixture with REACTOR_SELFTEST_HANG_TIMEOUT_SECONDS=3): watchdog fired in 3s, process exited 0xE0434F4D (COR_E_FAILFAST), stack pointed at HangWatchdogLoop. Synthetic hang reverted. * Cleaned AOT run: 627 fixtures, 2358 ok checks, 108 skipped, 1 known pre-existing flaky assertion, 0 hangs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Address PR #358 review feedback * SelfTestBatch.cs: Set _abortedReason on hang/timeout so the Fixture test method reports never-executed fixtures as Assert.Inconclusive instead of cascading "was not reported by the Host" failures across every fixture downstream of the hang. * SelfTestRunner.cs: Fix SkipAotPatterns XML doc -- true means patterns are applied (default), false means ignored. Doc had it inverted. * SelfTestRunner.cs + docs/aot-support.md: Fix stale references to `.aot_runs/probe_skips.ps1`; the script lives at tests/Reactor.AppTests.Host/probe-aot-skips.ps1. * probe-aot-skips.ps1: STATUS_STACK_BUFFER_OVERRUN exit code is -1073740791 (0xC0000409 widens to Int64 in PowerShell, so `-eq 0xC0000409` never matched a negative Process.ExitCode). Verified locally that a known crasher now categorises as NATIVE_CRASH via the explicit constant. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1 parent 3c3bed9 commit 7558665

5 files changed

Lines changed: 516 additions & 186 deletions

File tree

docs/aot-support.md

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,69 @@ These subsystems compile cleanly with `IsAotCompatible=true` (the warnings are s
4242
- **Suppressions are temporary.** Every `[UnconditionalSuppressMessage("Trimming", ...)]` or `("AOT", ...)` in this repo is a TODO. The justification field names the reflection use; tracking is folded into issue #70.
4343
- **The benchmark canary.** `tests/stress_perf/StressPerf.Reactor` (and the `StressPerf.Direct`/`ReactorGrid` siblings) set `PublishAot=true`. If they stop publishing, an AOT regression has landed in the framework.
4444

45+
## Debugging an AOT selftest hang
46+
47+
`tests/Reactor.AppTests.Host` maintains an explicit allow-list of fixtures that hang, crash, or assert-fail under NativeAOT (`SelfTestRunner.DefaultAotSkipPatterns`). When you remove an entry from that list and the published Host hangs, crashes, or asserts instead of producing output, use the following workflow.
48+
49+
### Failure mode at a glance
50+
51+
Probing each `DefaultAotSkipPatterns` entry in isolation (via the `tests/Reactor.AppTests.Host/probe-aot-skips.ps1` helper) reveals three buckets. Pick the matching workflow:
52+
53+
| Bucket | Symptom | Debug step |
54+
|---|---|---|
55+
| **Hang** (dispatcher starvation) | Fixture's `RunAsync()` synchronously blocks the UI thread; the in-band 15 s `Task.Delay` watchdog cannot fire because the dispatcher isn't pumping. | Off-dispatcher hang watchdog (60 s default, configurable via `REACTOR_SELFTEST_HANG_TIMEOUT_SECONDS`) writes `Bail out! HANG_DETECTED: <fixture> …` to stdout + stderr, flushes, then `Environment.FailFast`. With `DOTNET_DbgEnableMiniDump=1` set, this produces a Watson minidump. |
56+
| **Native crash** | Process exits with `0xC0000409` (`STATUS_STACK_BUFFER_OVERRUN`) — the AOT runtime's `FailFast` for unhandled managed exceptions. No `# Total failures:` line because the process terminated abruptly. | Set `DOTNET_DbgEnableMiniDump=1` (and `COMPlus_DbgEnableMiniDump=1` — both, matching `DevtoolsStressE2ERunner`) before launching. Open the resulting `.dmp` with `dotnet-dump analyze` or WinDbg; look at the dispatcher (UI) thread's stack for the throwing call. |
57+
| **Assertion failure** | TAP output already shows `not ok <name> - <reason>`; process exits 1 cleanly. | No special tooling needed — read the TAP failure line. The fixture and check name are in the message. |
58+
59+
### Parent attribution
60+
61+
The MSTest harness (`Reactor.SelfTests.SelfTestBatch`) parses the `HANG_DETECTED` signal from both stdout and stderr, and on process timeout falls back to the last `# Running:` line. Either way, the failure surfaces against the named fixture in the failing test's detail with a copy-pasteable repro command. It does *not* cascade through `_initError` (which would mark every unrelated fixture failed).
62+
63+
### Capturing a dump
64+
65+
Set these env vars before launching the Host:
66+
67+
```text
68+
DOTNET_DbgEnableMiniDump=1
69+
DOTNET_DbgMiniDumpType=2
70+
DOTNET_DbgMiniDumpName=%TEMP%\reactor-selftest-%p.dmp
71+
COMPlus_DbgEnableMiniDump=1
72+
COMPlus_DbgMiniDumpType=2
73+
COMPlus_DbgMiniDumpName=%TEMP%\reactor-selftest-%p.dmp
74+
```
75+
76+
Both `Environment.FailFast` (hang path) and the AOT runtime's unhandled-exception fast-fail (crash path) honour these vars.
77+
78+
### Isolated repro
79+
80+
Once you have the offending fixture name, repro it standalone against the AOT-published binary:
81+
82+
```powershell
83+
dotnet publish tests/Reactor.AppTests.Host -p:PublishAotInternal=true -p:Platform=x64 -r win-x64 -c Release
84+
$env:DOTNET_DbgEnableMiniDump=1
85+
$env:DOTNET_DbgMiniDumpName="$env:TEMP\reactor-hang-%p.dmp"
86+
& "<publish-dir>\Reactor.AppTests.Host.exe" --self-test --no-aot-skip --filter <FixtureName>
87+
```
88+
89+
`--no-aot-skip` bypasses the entire skip list so the targeted fixture actually runs. To drive the MSTest harness against the AOT-published binary, point it at the publish output:
90+
91+
```powershell
92+
$env:REACTOR_SELFTEST_HOST_EXE="<publish-dir>\Reactor.AppTests.Host.exe"
93+
dotnet test tests/Reactor.SelfTests
94+
```
95+
96+
### Disabling the watchdog while debugging
97+
98+
When stepping through a fixture in a debugger, set `REACTOR_SELFTEST_HANG_TIMEOUT_SECONDS=0` to suppress the hang watchdog entirely. (It also auto-disables whenever `Debugger.IsAttached` returns true at poll time.)
99+
100+
### Categorising the skip list (`tests/Reactor.AppTests.Host/probe-aot-skips.ps1`)
101+
102+
The repo includes a probe script that runs every `DefaultAotSkipPatterns` entry in isolation under `--no-aot-skip` and writes a CSV summarising whether each one passes, hangs, crashes natively, or assert-fails. Use it after AOT framework changes to find stale skips that have started passing, and to triage what's still broken.
103+
104+
```powershell
105+
pwsh -NoProfile -File tests\Reactor.AppTests.Host\probe-aot-skips.ps1
106+
```
107+
45108
## When in doubt
46109

47110
If you need a feature listed in the "does not work" table and you're publishing AOT, file an issue against #70 with your scenario. The fix in most cases is a source generator pass; what gets prioritized is driven by who's hitting the wall.

tests/Reactor.AppTests.Host/Program.cs

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,8 @@
2222
var filterIdx = Array.IndexOf(args, "--filter");
2323
if (filterIdx >= 0 && filterIdx + 1 < args.Length)
2424
SelfTestRunner.Filter = args[filterIdx + 1];
25+
if (args.Contains("--no-aot-skip"))
26+
SelfTestRunner.SkipAotPatterns = false;
2527
SelfTestRunner.RunAll();
2628
}
2729
else if (args.Contains("--devtools-stress"))

0 commit comments

Comments
 (0)