Skip to content

Stress test fixes: visual demo budgets, GC marshaling, InfiniteBasic pump#396

Merged
codemonkeychris merged 1 commit into
mainfrom
fix/stress-cluster-t-and-i
May 24, 2026
Merged

Stress test fixes: visual demo budgets, GC marshaling, InfiniteBasic pump#396
codemonkeychris merged 1 commit into
mainfrom
fix/stress-cluster-t-and-i

Conversation

@codemonkeychris
Copy link
Copy Markdown
Collaborator

Summary

Three independent fixes for the post-#395 stress-failure landscape (CI run 26348014478, 6/20 failed shards, 7 total failures). After landing #395 (Cluster A), the remaining failures break into:

  • Cluster T (5 / 7) — fixtures timing out at the 15s watchdog
  • Cluster I (1 / 7)AsyncResource.InfiniteBasic_MultiplePagesFetched (got 4)
  • Cluster F (1 / 7)FloatingTitleBar_PaneBodyVisible (does not reproduce in isolation; not addressed here)

This PR addresses Cluster T and Cluster I.

1. Tighten visual-demo timing budgets (Cluster T1)

NativeDocking_SplitterProgrammaticVisualDemo was designed with an explicit ~9.7s timing budget (per the comment block at the top of the fixture) against a 15s timeout, assuming ~80ms/render. Under CI load renders can exceed 200ms, pushing total runtime over the timeout. NativeDocking_PerGroupDropTargetVisualDemo has the same shape with smaller margins.

Reduced pacing delays:

  • Splitter: initial 700ms → 200ms; per-nudge 200ms → 60ms; after-final 400ms → 120ms. New budget ~4.4s.
  • PerGroup: initial 50ms → 20ms; per-step 20ms → 5ms; observer 30ms → 5ms.

Smoke-tested locally: splitter at 4.8s wall-clock (~1.8s fixture runtime), PerGroup at 7.2s wall-clock (~4.2s fixture runtime), both well under 15s timeout. The pacing is still observable for manual debug viewing.

2. Marshal GC.WaitForPendingFinalizers off the UI dispatcher (Cluster T2)

AsyncResource.Framerate.DataGridScroll, Framerate.DataGridEditMutation, and PropertyGrid_Target_Switching were timing out at 15s but should normally complete in 1-3s. Every Framerate fixture (and several others) starts and/or ends with GC.Collect(); GC.WaitForPendingFinalizers(); GC.Collect();. Running that on the UI dispatcher thread is a known deadlock anti-pattern: a finalizer that needs to release a UI-thread-affine RCW (e.g. a WinUI control) marshals back to the dispatcher, but the dispatcher is blocked inside WaitForPendingFinalizers waiting for that very finalizer.

Migrated all 11 call sites onto a background thread via await Task.Run(() => { GC.Collect(); GC.WaitForPendingFinalizers(); GC.Collect(); }). The UI dispatcher continues pumping while the finalizer drains, breaking the deadlock potential.

Files affected: AsyncResourceFramerateFixtures.cs (6), AsyncResourceFixtures.cs, AsyncInfiniteResourceFramerateFixtures.cs, DataGridParityFixtures.cs, NativeDockingReliabilityFixture.cs (2).

The fix is speculative — the local repro (40 iter × 13 Framerate fixtures, 0 hits) is inconclusive against the ~0.075% CI rate. Suggesting a follow-up to add DOTNET_DbgEnableMiniDump=1 to the stress workflow env so the next CI hit (if any) yields a definitive dump.

3. Extra render pump in InfiniteBasic page-walk (Cluster I)

InfiniteBasic_MultiplePagesFetched (got 4) failed at shard 9 iter 10. The fixture pumps 2 renders after each ItemAt() call. The fetcher has a 10ms Task.Delay plus an Apply continuation; under CI load that occasionally outlives the 2-pump budget. The fixture's existing comment already admits the analogous race for page 1 and pumps three times in the initial-page path — extending the same pattern to the loop.

Test plan

  • dotnet build — 0 errors, only pre-existing warnings
  • Smoke test affected fixtures locally; all clean
  • Run CI Stress workflow on this branch (200 iter × 20 shards, target=selftests)
  • Confirm Cluster T and Cluster I hits drop to zero

🤖 Generated with Claude Code

…pump

Three independent fixes for the post-#395 stress-failure landscape (CI run
26348014478):

1. Tighten visual-demo pacing so SplitterProgrammaticVisualDemo (~9.7s) and
   PerGroupDropTargetVisualDemo (~9s under load) fit comfortably under a 5s
   target instead of brushing the 15s fixture timeout.

2. Marshal GC.Collect+WaitForPendingFinalizers+GC.Collect onto a background
   thread via Task.Run at all 11 selftest call sites. Running this pattern
   on the UI dispatcher thread is a known deadlock anti-pattern: a finalizer
   that needs to release a UI-thread-affine RCW marshals back to the
   dispatcher, but the dispatcher is blocked inside WaitForPendingFinalizers.
   Speculatively addresses the Framerate.DataGridScroll, Framerate.DataGridEditMutation,
   and PropertyGrid_Target_Switching hangs.

3. Add a third Harness.Render pump to InfiniteBasic's page-walk loop. The
   fetcher's 10ms Task.Delay plus the Apply continuation occasionally
   outlives two pumps on a loaded CI VM; the fixture already pumps three
   times in the initial-page path for the same reason.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GC.Collect();
await Task.Run(() =>
{
GC.Collect();
{
GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();
GC.Collect();
await Task.Run(() =>
{
GC.Collect();
{
GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();
// dispatcher were blocked inside WaitForPendingFinalizers.
await Task.Run(() =>
{
GC.Collect();
{
GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();
GC.Collect();
await Task.Run(() =>
{
GC.Collect();
{
GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();
GC.Collect();
await Task.Run(() =>
{
GC.Collect();
{
GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();
@codemonkeychris codemonkeychris merged commit 9b33e44 into main May 24, 2026
34 of 39 checks passed
@codemonkeychris codemonkeychris deleted the fix/stress-cluster-t-and-i branch May 24, 2026 03:30
codemonkeychris added a commit that referenced this pull request May 24, 2026
…d scroll poll (#397)

Three independent stress-flake fixes (Clusters T, O, C) per
INVESTIGATION.md follow-up to PR #396:

Cluster T — bump FixtureTimeout to 30s for three render-pump-heavy
fixtures whose budgets are tight on loaded CI runners but not
pathological:
  - DataGridParityFixtures.HookPagingFramerateScroll
  - AsyncResourceFramerateFixtures.DataGridEditMutation
  - NativeDockingSmokeFixture.PerGroupDropTargetVisualDemo

Cluster O — fix race in ReconcileHighlightOverlay.RefreshOrAdd.
DispatcherQueueTimer.Stop() can't dequeue a Tick that the dispatcher
queue has already dispatched, so a stale tick can tear down a sprite
the refresh still wants alive. Now each refresh swaps in a fresh timer
with its own Tick lambda; the stale tick checks ReferenceEquals on
ah.Timer and bails when its identity no longer matches.

Cluster C — DataGrid_ScrollPopulatesData was relying on a fixed 800ms
Render(800) for the scroll-settle → fetch → realization chain (not
tracked by _renderPending). Now waits 800ms baseline (preserves the
fetch-trigger window so ScrollPop_MultipleFetches still triggers) then
polls cells for up to 3s.

Validated locally: all 27 OverlayLifecycle_* fixtures pass, all 5
ScrollPop_* sub-checks pass, 15x local stress on both fixtures
together is clean.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant