perf(gpu): avoid temporary half-value staging in LogUp GKR by peter941221 · Pull Request #2806 · succinctlabs/sp1

peter941221 · 2026-05-21T15:58:16Z

This PR removes temporary half-value struct materialization inside the measured LogUp GKR circuit-layer hotspot. In the strongest same-session local A/B, fix_and_sum_circuit_layer.gpu_exec moved from 300.732us to 179.397us, replayed sum_as_poly moved from 259.818us to 146.141us, and the fused boundary timing moved from 461.293us to 282.597us.

What changed

This is a narrow hot-path change in sp1-gpu/crates/sys/include/logup_gkr/round.cuh.

Inside sumAsPolyCircuitLayerInner, the patch stops materializing a temporary CircuitValues valuesHalf struct just to evaluate the half-point. Instead, it computes the four half-point scalars directly and feeds them into a scalar helper that preserves the existing sumAsPoly arithmetic.

The patch does not change transcript order, kernel topology, output layout, or finalize logic.

Why this target

This patch came out of profiling, not blind arithmetic tuning.

2.1 Fine-grained timing first made finalize_univariate look hot.

2.2 Forced-sync stage attribution then moved the remaining cost into the fused circuit-layer producer.

2.3 Safe replay attribution on the real fused output showed that replayed sum_as_poly alone was almost as expensive as the whole fused producer.

In the local baseline replay run:

fix_and_sum_circuit_layer.gpu_exec: 220.702us

fix_and_sum_circuit_layer.sum_as_poly_replay.gpu_exec: 216.164us

That is why this PR targets sumAsPolyCircuitLayerInner instead of another finalize-only rewrite or a broader transition rewrite.

Validation

I restaged the winning A0 cut onto a clean branch from main@98a376e87ec9dd5c3ae3495b98846bf921d6035b and reran compile validation on the outgoing branch.

Outgoing branch:

peter941221:perf/logup-gkr-a0

Outgoing commit:

f933bda3810b2f0dfe0788175df9518f7e657956

Command:

wsl.exe -d Ubuntu-24.04 -- bash -lc "cd <temporary WSL verification worktree> && cargo check -p sp1-gpu-logup-gkr"

Result:

passed

Benchmark evidence

The strongest evidence is a same-session baseline versus patched comparison on the same main base.

Baseline:

fix_and_sum_circuit_layer.gpu_exec: 300.732us

fix_and_sum_circuit_layer.sum_as_poly_replay.gpu_exec: 259.818us

sumcheck.finalize_univariate.total_after_explicit_sync: 461.293us

prove/random/core_2^20: 289.20-409.77ms

Patched:

fix_and_sum_circuit_layer.gpu_exec: 179.397us

fix_and_sum_circuit_layer.sum_as_poly_replay.gpu_exec: 146.141us

sumcheck.finalize_univariate.total_after_explicit_sync: 282.597us

prove/random/core_2^20: 124.68-240.74ms

Criterion reported Performance has improved on that patched run.

A later patched confirmation kept the same stage-level direction:

fix_and_sum_circuit_layer.gpu_exec: 177.666us

fix_and_sum_circuit_layer.sum_as_poly_replay.gpu_exec: 144.523us

sumcheck.finalize_univariate.total_after_explicit_sync: 307.568us

Why this PR stays small

I also tried a broader follow-up that removed the local valuesZero and valuesOne struct staging too, and it regressed clearly.

Versus the stronger A0 run:

fix_and_sum_circuit_layer.gpu_exec: 179.397us -> 271.304us

fix_and_sum_circuit_layer.sum_as_poly_replay.gpu_exec: 146.141us -> 231.833us

sumcheck.finalize_univariate.total_after_explicit_sync: 282.597us -> 407.165us

The same end-to-end run also regressed to prove/random/core_2^20 = 389.69-468.00ms.

So the reviewer story here is intentionally narrow: removing the temporary half-value staging helped, while flattening the rest of the path did not.

Caveat

The strongest claim here is stage-level first and end-to-end second.

The stage signal is repeatable locally. The end-to-end prove/random/core_2^20 result still has enough variance that I do not want to oversell it beyond the measured data above.

peter941221 · 2026-05-21T15:59:33Z

This is the narrow A0 cut from the replay-driven hotspot narrowing chain.
The attached card is the shortest version of the story: baseline A versus the winning A0 cut, with the three core stage metrics and their deltas.
Strongest same-session local numbers:

fix_and_sum_circuit_layer.gpu_exec: 300.732us -> 179.397us

fix_and_sum_circuit_layer.sum_as_poly_replay.gpu_exec: 259.818us -> 146.141us

sumcheck.finalize_univariate.total_after_explicit_sync: 461.293us -> 282.597us

perf(gpu): avoid temporary LogUp half-value struct

2dc9248

peter941221 force-pushed the perf/logup-gkr-a0 branch from f933bda to 2dc9248 Compare June 2, 2026 02:35

Merge branch 'main' into perf/logup-gkr-a0

b22b665

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(gpu): avoid temporary half-value staging in LogUp GKR#2806

perf(gpu): avoid temporary half-value staging in LogUp GKR#2806
peter941221 wants to merge 2 commits into
succinctlabs:mainfrom
peter941221:perf/logup-gkr-a0

peter941221 commented May 21, 2026 •

edited

Loading

Uh oh!

peter941221 commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

peter941221 commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

peter941221 commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

peter941221 commented May 21, 2026 •

edited

Loading