Skip to content

[BOO] Sync after profiling to fix flaky event loss#1329

Draft
rkayaith wants to merge 1 commit into
iree-org:mainfrom
rkayaith:fix-boo-profiler-event-loss
Draft

[BOO] Sync after profiling to fix flaky event loss#1329
rkayaith wants to merge 1 commit into
iree-org:mainfrom
rkayaith:fix-boo-profiler-event-loss

Conversation

@rkayaith

@rkayaith rkayaith commented Mar 16, 2026

Copy link
Copy Markdown
Member

Synchronize GPU in the BOO driver's profiling loop. We've been getting flaky failures in tests/kernel/boo/driver/cli_test.py with dispatches not being captured by the profiler, and I found that synchronizing in two places resolves the issue:

  1. Before each prof.step() — ensures in-flight GPU kernels complete before the profiler advances state
  2. After the profiler context exits — ensures profiler cleanup is complete before the next profiling session starts

The torch profiler context already includes a synchronize in its __exit__ so that all events are captured, so this seems to be something to do with needing to synchronize after the profiler's cleanup.

Fixes #1327

@rkayaith rkayaith force-pushed the fix-boo-profiler-event-loss branch from f4747c8 to ec10ac6 Compare March 16, 2026 18:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

test_main[args2-0] flaky on GPU CI due to incomplete torch profiler capture

1 participant