- Measure the direct cost of breaking contiguous lane access into wider strides.
- How quickly does throughput fall as address stride increases and coalescing quality drops?
stride_1stride_2stride_4stride_8stride_16
- Keep arithmetic, output count, and useful bytes fixed while changing only the source index stride.
- Run the sweep at sizes that are large enough to stay in the bandwidth-bound region.
- Median GPU time by stride.
- Throughput and effective GB/s by stride.
- Slowdown curve relative to
stride_1.
- Coalescing is a first-order performance rule on bandwidth-bound kernels.
- The resulting curve is a concrete demonstration of why strided access wastes transactions and cache lines.