Advanced Investigation 06: Subgroup Operations Study

1. Lecture Focus

Concept: Warp/wave-level intrinsics versus shared-memory patterns.
Why this matters: Subgroup features can remove barriers and temporary memory overhead in key primitives.
Central question: Where do subgroup operations outperform shared-memory implementations?

2. Learning Objectives

By the end of this investigation, you should be able to:

justify why this systems-level problem matters in practical GPU pipelines
design a controlled benchmark matrix with clear independent variables
interpret results without confusing correlation and causation
extract design rules and limitations suitable for portfolio presentation

3. Theory Primer (Lecture Notes)

Start with a pipeline-level mental model, not just a kernel-level view.
Identify resource bottlenecks: memory traffic, synchronization, occupancy pressure, and control-flow efficiency.
Separate algorithmic cost from implementation artifacts.
Record assumptions and known unknowns before running the benchmarks.

4. Hypothesis

Subgroup intrinsics win for reductions/prefix helpers when subgroup assumptions match hardware behavior.

5. Experimental Design

Independent variables

Primitive type, subgroup availability/size, data size, fallback strategy.

Controlled variables

Fixed benchmark harness and timing method (GPU timestamp queries).
Fixed data generation seeds per scenario where reproducibility is needed.
Fixed correctness oracle per variant.

Metrics

Runtime, synchronization count, temp memory footprint, portability constraints.

6. Implementation Plan

Implement minimally correct baseline variant first.
Add one optimized variant at a time to preserve causal clarity.
Add deterministic correctness tests and edge-case datasets.
Run warmup plus repeated measured runs for each matrix point.
Export raw data and metadata to versioned result files.
Generate charts and write a short interpretation section with caveats.

7. Analysis Prompts

Which stage or operation dominates total cost and why?
Which tuning parameter is most sensitive?
Which findings are likely architecture-dependent?
What would change in a production rendering/compute pipeline?

8. Deliverables

Subgroup vs shared-memory charts, portability notes, implementation policy.

Minimum artifact set:

one core chart
one summary table
one short conclusions page with limitations

9. Portfolio Framing Notes

Frame conclusions as measured observations plus reasoned interpretation.
Avoid claiming universal behavior from one GPU unless cross-GPU validated.
Highlight tradeoffs and failure modes, not just best numbers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Advanced Investigation 06: Subgroup Operations Study

1. Lecture Focus

2. Learning Objectives

3. Theory Primer (Lecture Notes)

4. Hypothesis

5. Experimental Design

Independent variables

Controlled variables

Metrics

6. Implementation Plan

7. Analysis Prompts

8. Deliverables

9. Portfolio Framing Notes

FilesExpand file tree

plan.md

Latest commit

History

plan.md

File metadata and controls

Advanced Investigation 06: Subgroup Operations Study

1. Lecture Focus

2. Learning Objectives

3. Theory Primer (Lecture Notes)

4. Hypothesis

5. Experimental Design

Independent variables

Controlled variables

Metrics

6. Implementation Plan

7. Analysis Prompts

8. Deliverables

9. Portfolio Framing Notes