- Compare common vector-friendly storage choices against tighter scalarized layouts.
- How much performance is lost when
vec3andvec4-style layouts carry padding that the kernel does not need?
split_scalarsvec3_paddedvec4
- Use the same logical per-record values and the same arithmetic in every variant.
- Change only the storage representation and validate all outputs against the same CPU reference.
- Median GPU time by layout.
- Useful-payload GB/s by layout.
- Padding overhead relative to the scalarized baseline.
- Vector convenience and alignment hygiene can cost real bandwidth when the shader does not use the extra padded bytes.
- This experiment helps decide when explicit scalar packing is worth the added code complexity.