- Isolate the runtime cost of workgroup barriers and synchronization placement.
- How much overhead comes from barriers themselves, and does that cost depend on how work is tiled?
flat_loop_no_barriertiled_regions_no_barrierflat_loop_with_barriertiled_regions_with_barrier
- Use the same logical output and arithmetic while changing whether the kernel runs as a flat loop or staged tiled regions.
- Add or remove barriers without changing the final output contract.
- Median GPU time by synchronization strategy.
- Barrier overhead relative to the no-barrier forms.
- Placement sensitivity between flat and tiled execution shapes.
- A barrier cost is only meaningful relative to the work it protects.
- This experiment explains why some shared-memory kernels fail even when their memory traffic looks favorable on paper.