- Compare direct global reads against explicit workgroup-memory tiling for a stencil-like workload.
- When does local staging outperform direct global access once halo loads and barriers are included?
direct_globalshared_tiled- reuse-radius sweep
- Use the same 1D sliding-window sum for both implementations.
- Sweep neighborhood radius while keeping output count, arithmetic, and validation rules fixed.
- Median GPU time by variant and radius.
- Speedup of
shared_tiledvsdirect_global. - Shared-memory footprint and estimated global-traffic reduction.
- Shared memory only wins when reuse is large enough to amortize halo loading and barrier cost.
- This is the baseline for all later shared-memory tuning experiments.