- Move beyond 1D buffers into matrix-shaped access and transpose behavior.
- How much does tiling recover from the poor locality of naive transpose?
row_major_copynaive_transposetiled_transposetiled_transpose_padded
- Use a square matrix sized from the scratch budget.
- Keep input data and output semantics fixed while comparing naive and tiled access patterns.
- Median GPU time by variant.
- Effective GB/s by variant.
- Speedup of tiled transpose over naive transpose.
- This experiment connects the project's memory story to image-style and rendering-adjacent workloads.
- The padded tiled variant shows whether bank-conflict mitigation matters in the transpose case.