- Concept: Compose culling, compaction, bucketing, and argument-like generation.
- Why this matters: Demonstrates practical GPU-driven thinking similar to modern rendering architectures.
- Central question: What bottlenecks appear when primitive stages are composed into a submission-ready pipeline?
By the end of this investigation, you should be able to:
- justify why this systems-level problem matters in practical GPU pipelines
- design a controlled benchmark matrix with clear independent variables
- interpret results without confusing correlation and causation
- extract design rules and limitations suitable for portfolio presentation
- Start with a pipeline-level mental model, not just a kernel-level view.
- Identify resource bottlenecks: memory traffic, synchronization, occupancy pressure, and control-flow efficiency.
- Separate algorithmic cost from implementation artifacts.
- Record assumptions and known unknowns before running the benchmarks.
Scan/compaction stages and list construction dominate unless ordering and buffering are tuned.
Stage variants, visibility ratios, bucket strategy, ordering coherence.
- Fixed benchmark harness and timing method (GPU timestamp queries).
- Fixed data generation seeds per scenario where reproducibility is needed.
- Fixed correctness oracle per variant.
Per-stage runtime, end-to-end throughput, bottleneck share.
- Implement minimally correct baseline variant first.
- Add one optimized variant at a time to preserve causal clarity.
- Add deterministic correctness tests and edge-case datasets.
- Run warmup plus repeated measured runs for each matrix point.
- Export raw data and metadata to versioned result files.
- Generate charts and write a short interpretation section with caveats.
- Which stage or operation dominates total cost and why?
- Which tuning parameter is most sensitive?
- Which findings are likely architecture-dependent?
- What would change in a production rendering/compute pipeline?
Pipeline stage table, bottleneck map, architecture notes.
Minimum artifact set:
- one core chart
- one summary table
- one short conclusions page with limitations
- Frame conclusions as measured observations plus reasoned interpretation.
- Avoid claiming universal behavior from one GPU unless cross-GPU validated.
- Highlight tradeoffs and failure modes, not just best numbers.