Skip to content

Latest commit

 

History

History
61 lines (49 loc) · 2.66 KB

File metadata and controls

61 lines (49 loc) · 2.66 KB

Advanced Investigation 06: Subgroup Operations Study

1. Lecture Focus

  • Concept: Warp/wave-level intrinsics versus shared-memory patterns.
  • Why this matters: Subgroup features can remove barriers and temporary memory overhead in key primitives.
  • Central question: Where do subgroup operations outperform shared-memory implementations?

2. Learning Objectives

By the end of this investigation, you should be able to:

  • justify why this systems-level problem matters in practical GPU pipelines
  • design a controlled benchmark matrix with clear independent variables
  • interpret results without confusing correlation and causation
  • extract design rules and limitations suitable for portfolio presentation

3. Theory Primer (Lecture Notes)

  • Start with a pipeline-level mental model, not just a kernel-level view.
  • Identify resource bottlenecks: memory traffic, synchronization, occupancy pressure, and control-flow efficiency.
  • Separate algorithmic cost from implementation artifacts.
  • Record assumptions and known unknowns before running the benchmarks.

4. Hypothesis

Subgroup intrinsics win for reductions/prefix helpers when subgroup assumptions match hardware behavior.

5. Experimental Design

Independent variables

Primitive type, subgroup availability/size, data size, fallback strategy.

Controlled variables

  • Fixed benchmark harness and timing method (GPU timestamp queries).
  • Fixed data generation seeds per scenario where reproducibility is needed.
  • Fixed correctness oracle per variant.

Metrics

Runtime, synchronization count, temp memory footprint, portability constraints.

6. Implementation Plan

  1. Implement minimally correct baseline variant first.
  2. Add one optimized variant at a time to preserve causal clarity.
  3. Add deterministic correctness tests and edge-case datasets.
  4. Run warmup plus repeated measured runs for each matrix point.
  5. Export raw data and metadata to versioned result files.
  6. Generate charts and write a short interpretation section with caveats.

7. Analysis Prompts

  • Which stage or operation dominates total cost and why?
  • Which tuning parameter is most sensitive?
  • Which findings are likely architecture-dependent?
  • What would change in a production rendering/compute pipeline?

8. Deliverables

Subgroup vs shared-memory charts, portability notes, implementation policy.

Minimum artifact set:

  • one core chart
  • one summary table
  • one short conclusions page with limitations

9. Portfolio Framing Notes

  • Frame conclusions as measured observations plus reasoned interpretation.
  • Avoid claiming universal behavior from one GPU unless cross-GPU validated.
  • Highlight tradeoffs and failure modes, not just best numbers.