- Approximate the cost of rising register pressure without changing the overall memory pattern.
- How much performance is lost as the kernel carries more live temporaries?
temp_4temp_8temp_16temp_32
- Keep the same memory traffic and output contract while increasing the amount of per-thread temporary state.
- Validate every variant against the same deterministic CPU reference.
- Median GPU time by temporary-count variant.
- Slowdown relative to the lightest register-pressure path.
- Evidence of occupancy or scheduling cliffs.
- This is a proxy experiment, so it should be read as pressure sensitivity rather than a literal register count measurement.
- Large slowdowns indicate that arithmetic-only optimizations can still fail if they inflate live state.