This index points to the detailed lecture-note style plan for each core experiment.
- Start with Experiments 01-05 to stabilize benchmark methodology and execution-model intuition.
- Continue with Experiments 06-10 for layout and alignment design rules.
- Use Experiments 11-15 to map access pattern, cache behavior, and saturation.
- Use Experiments 16-20 to build architecture-aware optimization intuition.
- Finish with Experiments 21-25 to assemble practical parallel primitives and capstone systems.
- Experiment 06: AoS vs SoA
- Experiment 11: Coalesced vs Strided Access
- Experiment 14: Read Reuse and Cache Locality
- Experiment 15: Bandwidth Saturation Sweep
- Experiment 16: Shared or Workgroup Memory Tiling
- Experiment 01: Dispatch Basics: Minimal Vulkan compute dispatch, correctness path, and baseline GPU timing.
- Experiment 02: Local Size Sweep: Workgroup sizing and execution efficiency tradeoffs.
- Experiment 03: Memory Copy Baseline: Raw buffer read/write/copy throughput characterization.
- Experiment 04: Sequential Indexing: Ideal contiguous thread-to-data mapping as a good-path baseline.
- Experiment 05: Global ID Mapping Variants: Direct, offset, and grid-stride mapping behavior.
- Experiment 06: AoS vs SoA: Array-of-Structures versus Structure-of-Arrays layout efficiency.
- Experiment 07: AoSoA or Blocked Layout: Hybrid layout balancing vector locality and contiguous field access.
- Experiment 08: std430 vs std140 vs Packed: Shader buffer layout standards and padding cost.
- Experiment 09: vec3, vec4, and Padding Costs: Impact of vector shape choice on storage efficiency and bandwidth.
- Experiment 10: Scalar Type Width Sweep: Precision-width tradeoffs: 32-bit, 16-bit, and narrower storage.
- Experiment 11: Coalesced vs Strided Access: Contiguous and strided load behavior.
- Experiment 12: Gather Access Pattern: Indirect indexed reads through an index buffer.
- Experiment 13: Scatter Access Pattern: Indirect indexed writes and contention behavior.
- Experiment 14: Read Reuse and Cache Locality: Temporal locality and reuse-distance effects.
- Experiment 15: Bandwidth Saturation Sweep: Scaling data volume until practical bandwidth plateau.
- Experiment 16: Shared or Workgroup Memory Tiling: Staging data in on-chip memory for reuse.
- Experiment 17: Tile Size Sweep: Tradeoff between reuse, shared-memory pressure, and occupancy.
- Experiment 18: Register Pressure Proxy Study: Effect of increased per-thread temporary state.
- Experiment 19: Branch Divergence: Control-flow divergence within warp or wave execution.
- Experiment 20: Barrier and Synchronization Cost: Synchronization overhead characterization.
- Experiment 21: Parallel Reduction: Reduction patterns from naive to tree and shared-memory optimized.
- Experiment 22: Prefix Sum or Scan: Inclusive/exclusive scan as a foundational parallel primitive.
- Experiment 23: Histogram and Atomic Contention: Atomic update contention and privatization strategies.
- Experiment 24: Stream Compaction: Flag, scan, and compact-write pipeline.
- Experiment 25: Spatial Binning or Clustered Culling Capstone: Rendering-style compute pipeline combining prior primitives.
- Experiment 26: Warp-Level Coalescing Alignment: Aligned vs misaligned contiguous accesses at warp granularity.
- Experiment 27: Cache Thrashing, Random vs Sequential: Healthy locality versus deliberate cache defeat.
- Experiment 28: Device-Local vs Host-Visible Heap Placement: Dispatch-only and end-to-end cost of host-visible buffers versus staged device-local placement.
- Experiment 29: Shared Memory Bank Conflict Study: Stride-driven shared-memory bank conflicts and the padding fix.
- Experiment 30: Subgroup Reduction Variants: Compare shared-tree reduction with subgroup-assisted reduction.
- Experiment 31: Subgroup Scan Variants: Block-local inclusive scan using shared memory versus subgroup intrinsics.
- Experiment 32: Subgroup Stream Compaction Variants: Per-workgroup compaction using shared atomics versus subgroup ballot ranking.
- Experiment 33: 2D Locality and Transpose Study: Row-major copy versus naive and tiled transpose access patterns.