- Concept: Warp/wave-level intrinsics versus shared-memory patterns.
- Why this matters: Subgroup features can remove barriers and temporary memory overhead in key primitives.
- Central question: Where do subgroup operations outperform shared-memory implementations?
By the end of this investigation, you should be able to:
- justify why this systems-level problem matters in practical GPU pipelines
- design a controlled benchmark matrix with clear independent variables
- interpret results without confusing correlation and causation
- extract design rules and limitations suitable for portfolio presentation
- Start with a pipeline-level mental model, not just a kernel-level view.
- Identify resource bottlenecks: memory traffic, synchronization, occupancy pressure, and control-flow efficiency.
- Separate algorithmic cost from implementation artifacts.
- Record assumptions and known unknowns before running the benchmarks.
Subgroup intrinsics win for reductions/prefix helpers when subgroup assumptions match hardware behavior.
Primitive type, subgroup availability/size, data size, fallback strategy.
- Fixed benchmark harness and timing method (GPU timestamp queries).
- Fixed data generation seeds per scenario where reproducibility is needed.
- Fixed correctness oracle per variant.
Runtime, synchronization count, temp memory footprint, portability constraints.
- Implement minimally correct baseline variant first.
- Add one optimized variant at a time to preserve causal clarity.
- Add deterministic correctness tests and edge-case datasets.
- Run warmup plus repeated measured runs for each matrix point.
- Export raw data and metadata to versioned result files.
- Generate charts and write a short interpretation section with caveats.
- Which stage or operation dominates total cost and why?
- Which tuning parameter is most sensitive?
- Which findings are likely architecture-dependent?
- What would change in a production rendering/compute pipeline?
Subgroup vs shared-memory charts, portability notes, implementation policy.
Minimum artifact set:
- one core chart
- one summary table
- one short conclusions page with limitations
- Frame conclusions as measured observations plus reasoned interpretation.
- Avoid claiming universal behavior from one GPU unless cross-GPU validated.
- Highlight tradeoffs and failure modes, not just best numbers.