Skip to content

Stress fixes: framerate/PerGroup timeouts, overlay tick race, DataGrid scroll poll#397

Merged
codemonkeychris merged 1 commit into
mainfrom
fix/stress-cluster-t-o-c
May 24, 2026
Merged

Stress fixes: framerate/PerGroup timeouts, overlay tick race, DataGrid scroll poll#397
codemonkeychris merged 1 commit into
mainfrom
fix/stress-cluster-t-o-c

Conversation

@codemonkeychris
Copy link
Copy Markdown
Collaborator

Summary

Three independent stress-flake fixes, bundled per the cadence set by #396.
Each is the implementation of a Net-next-step from INVESTIGATION.md
post-#396.

  • Cluster T (Path A) — FixtureTimeout => 30s on three render-pump-heavy
    fixtures whose mandatory wall-clock floors are tight against the default
    15 s budget on loaded CI runners. Not deadlocks; budget violations.
    • DataGridParityFixtures.HookPagingFramerateScroll
    • AsyncResourceFramerateFixtures.DataGridEditMutation
    • NativeDockingSmokeFixture.PerGroupDropTargetVisualDemo
  • Cluster O — ReconcileHighlightOverlay tick race. DispatcherQueueTimer.Stop() cannot rescind a Tick the dispatcher queue has already dispatched, so a stale tick can tear down a sprite the new refresh still owns. Fix swaps in a fresh timer on refresh; the stale tick's identity check fails and it bails.
  • Cluster C — DataGrid_ScrollPopulatesData poll. Fixed 800 ms Render
    occasionally undershoots realization on CI → cells=0. Now keeps the
    800 ms baseline (so the scroll-settle window still fires fetches, and the
    sibling ScrollPop_MultipleFetches sanity check still passes) and polls
    cells for up to 3 s.

Test plan

  • dotnet build clean across Reactor + AppTests.Host.
  • OverlayLifecycle_* self-test family (27 fixtures): all green locally.
  • DataGrid_ScrollPopulatesData single-shot: all 5 sub-assertions green
    (cells=41 / 41, calls=7).
  • 15-iter local stress over both fixtures together: 0/15 failures.
  • CI Stress (200 iter × 20 shards selftests on this branch) to confirm
    Cluster T/O/C all stay at zero hits.

🤖 Generated with Claude Code

…d scroll poll

Three independent stress-flake fixes (Clusters T, O, C) per
INVESTIGATION.md follow-up to PR #396:

Cluster T — bump FixtureTimeout to 30s for three render-pump-heavy
fixtures whose budgets are tight on loaded CI runners but not
pathological:
  - DataGridParityFixtures.HookPagingFramerateScroll
  - AsyncResourceFramerateFixtures.DataGridEditMutation
  - NativeDockingSmokeFixture.PerGroupDropTargetVisualDemo

Cluster O — fix race in ReconcileHighlightOverlay.RefreshOrAdd.
DispatcherQueueTimer.Stop() can't dequeue a Tick that the dispatcher
queue has already dispatched, so a stale tick can tear down a sprite
the refresh still wants alive. Now each refresh swaps in a fresh timer
with its own Tick lambda; the stale tick checks ReferenceEquals on
ah.Timer and bails when its identity no longer matches.

Cluster C — DataGrid_ScrollPopulatesData was relying on a fixed 800ms
Render(800) for the scroll-settle → fetch → realization chain (not
tracked by _renderPending). Now waits 800ms baseline (preserves the
fetch-trigger window so ScrollPop_MultipleFetches still triggers) then
polls cells for up to 3s.

Validated locally: all 27 OverlayLifecycle_* fixtures pass, all 5
ScrollPop_* sub-checks pass, 15x local stress on both fixtures
together is clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR bundles three stress-flake fixes across Reactor selftests and a dev overlay: (1) increases fixture timeouts for a few render-pump-heavy selftests that can exceed the default watchdog under CI load, (2) fixes a ReconcileHighlightOverlay expiry-timer race by swapping timers on refresh and guarding against stale ticks, and (3) hardens DataGrid_ScrollPopulatesData by polling for realized cells after a baseline settle delay.

Changes:

  • Increase FixtureTimeout to 30s for three known long-running stress fixtures.
  • Make ReconcileHighlightOverlay timer expiry resilient to stale dispatcher ticks by creating a fresh timer on refresh and adding an identity guard.
  • Replace a fixed post-scroll wait with baseline wait + polling for realized DataGrid cells to avoid cells=0 under CI load.
Show a summary per file
File Description
tests/Reactor.AppTests.Host/SelfTest/Fixtures/NativeDockingSmokeFixture.cs Extends timeout for PerGroupDropTargetVisualDemo to reduce CI watchdog flakes.
tests/Reactor.AppTests.Host/SelfTest/Fixtures/DataGridScrollFixtures.cs Adds polling after scroll to wait for realized “Emp-” cells (keeps 800ms baseline).
tests/Reactor.AppTests.Host/SelfTest/Fixtures/DataGridParityFixtures.cs Extends timeout for HookPagingFramerateScroll fixture.
tests/Reactor.AppTests.Host/SelfTest/Fixtures/AsyncResourceFramerateFixtures.cs Extends timeout for DataGridEditMutation fixture.
src/Reactor/Hosting/ReconcileHighlightOverlay.cs Swaps expiry timer on refresh and guards against stale timer ticks tearing down active sprites.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 5/5 changed files
  • Comments generated: 2

Comment on lines 151 to +154
try { existing.Timer.Stop(); } catch { }
try { existing.Timer.Start(); } catch { }
var refreshedTimer = CreateExpiryTimer(target, existing);
existing.Timer = refreshedTimer;
refreshedTimer.Start();
Comment on lines 132 to 133
H.Check($"ScrollPop_DataVisible (cells={visibleEmpCells.Count})",
visibleEmpCells.Count >= 4);
@codemonkeychris codemonkeychris merged commit aab59c3 into main May 24, 2026
16 checks passed
@codemonkeychris codemonkeychris deleted the fix/stress-cluster-t-o-c branch May 24, 2026 13:48
codemonkeychris added a commit that referenced this pull request May 24, 2026
Two independent stress-flake fixes per INVESTIGATION.md session 4
follow-up to PR #397:

Cluster F (4/1000 hits in run 26351376710) — FloatingTitleBar_PaneBodyVisible
asserted body TextBlock visibility after a single Harness.Render()
following DockFloatingWindow.Open. Chrome (TabView, TitleBar absence,
TabStripFooter) materializes on that first pump, but the inner pane
content lags one or two pumps behind on a loaded CI runner. Same
realization-race family as Cluster C; apply the same poll pattern
(2s budget, 50ms between pumps) and annotate the check name with the
observed body count for future-hit diagnostics.

Cluster T-new (2/1000 hits) — EventSubscriptionLeakBaseline timing out
at its 30s override. Local timing measurement across 3 fresh-process
runs: 14.5 / 15.6 / 15.4 s (avg 15.2s) on a dev box. The fixture is
100 mount/unmount cycles x 2 Harness.Render() = 200 renders + 200
reconcile passes; the work itself is substantial. CI VMs under load
have been measured at 2-4x slowdown elsewhere in the doc, easily
overshooting the 30s budget. Not a hang — just budget vs CI variance.
Bump FixtureTimeout to 60s (~4x local baseline) with a comment block
explaining the math.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
codemonkeychris added a commit that referenced this pull request May 25, 2026
Eliminates the dominant stress-failure mode (CI VM variance overshooting
tight per-fixture budgets). 5 of 8 failures in CI Stress run 26364109146
were 15 s timeout trips across 5 different fixtures - the pattern that
per-fixture overrides in PRs #395, #397, #399 chased without converging.

Local timing measurements (documented in INVESTIGATION.md) show the
Framerate.* family completing in ~7 s on a dev box; CI VMs under
contention run 2-4x slower, putting fixture work at ~28 s on the worst
tick. 30 s matches the explicit override budget already used by four
fixtures and leaves a clean signal gap below the 60 s
HangWatchdogLoop dump-on-hang threshold.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants