- Measure how shared-memory stride changes throughput because of bank conflicts.
- How severe is the slowdown from conflict-heavy strides, and how much does padding recover?
stride_1stride_2stride_4stride_8stride_16stride_32padded_fix
- Load one workgroup tile into shared memory and reread it with a configurable stride.
- Keep global-memory payload fixed so the difference is dominated by on-chip behavior.
- Median GPU time by stride.
- Slowdown relative to
stride_1. - Padding recovery relative to
stride_32.
- Shared memory is not automatically fast; bank conflicts can erase the intended gain.
- Padding should be justified by a measured recovery, not by habit.