Skip to content

perf(gpu): avoid temporary half-value staging in LogUp GKR#2806

Open
peter941221 wants to merge 2 commits into
succinctlabs:mainfrom
peter941221:perf/logup-gkr-a0
Open

perf(gpu): avoid temporary half-value staging in LogUp GKR#2806
peter941221 wants to merge 2 commits into
succinctlabs:mainfrom
peter941221:perf/logup-gkr-a0

Conversation

@peter941221

@peter941221 peter941221 commented May 21, 2026

Copy link
Copy Markdown

This PR removes temporary half-value struct materialization inside the measured LogUp GKR circuit-layer hotspot. In the strongest same-session local A/B, fix_and_sum_circuit_layer.gpu_exec moved from 300.732us to 179.397us, replayed sum_as_poly moved from 259.818us to 146.141us, and the fused boundary timing moved from 461.293us to 282.597us.

  1. What changed

This is a narrow hot-path change in sp1-gpu/crates/sys/include/logup_gkr/round.cuh.

Inside sumAsPolyCircuitLayerInner, the patch stops materializing a temporary CircuitValues valuesHalf struct just to evaluate the half-point. Instead, it computes the four half-point scalars directly and feeds them into a scalar helper that preserves the existing sumAsPoly arithmetic.

The patch does not change transcript order, kernel topology, output layout, or finalize logic.

  1. Why this target

This patch came out of profiling, not blind arithmetic tuning.

2.1 Fine-grained timing first made finalize_univariate look hot.

2.2 Forced-sync stage attribution then moved the remaining cost into the fused circuit-layer producer.

2.3 Safe replay attribution on the real fused output showed that replayed sum_as_poly alone was almost as expensive as the whole fused producer.

In the local baseline replay run:

fix_and_sum_circuit_layer.gpu_exec: 220.702us

fix_and_sum_circuit_layer.sum_as_poly_replay.gpu_exec: 216.164us

That is why this PR targets sumAsPolyCircuitLayerInner instead of another finalize-only rewrite or a broader transition rewrite.

  1. Validation

I restaged the winning A0 cut onto a clean branch from main@98a376e87ec9dd5c3ae3495b98846bf921d6035b and reran compile validation on the outgoing branch.

Outgoing branch:

peter941221:perf/logup-gkr-a0

Outgoing commit:

f933bda3810b2f0dfe0788175df9518f7e657956

Command:

wsl.exe -d Ubuntu-24.04 -- bash -lc "cd <temporary WSL verification worktree> && cargo check -p sp1-gpu-logup-gkr"

Result:

passed

  1. Benchmark evidence

The strongest evidence is a same-session baseline versus patched comparison on the same main base.

Baseline:

fix_and_sum_circuit_layer.gpu_exec: 300.732us

fix_and_sum_circuit_layer.sum_as_poly_replay.gpu_exec: 259.818us

sumcheck.finalize_univariate.total_after_explicit_sync: 461.293us

prove/random/core_2^20: 289.20-409.77ms

Patched:

fix_and_sum_circuit_layer.gpu_exec: 179.397us

fix_and_sum_circuit_layer.sum_as_poly_replay.gpu_exec: 146.141us

sumcheck.finalize_univariate.total_after_explicit_sync: 282.597us

prove/random/core_2^20: 124.68-240.74ms

Criterion reported Performance has improved on that patched run.

A later patched confirmation kept the same stage-level direction:

fix_and_sum_circuit_layer.gpu_exec: 177.666us

fix_and_sum_circuit_layer.sum_as_poly_replay.gpu_exec: 144.523us

sumcheck.finalize_univariate.total_after_explicit_sync: 307.568us

  1. Why this PR stays small

I also tried a broader follow-up that removed the local valuesZero and valuesOne struct staging too, and it regressed clearly.

Versus the stronger A0 run:

fix_and_sum_circuit_layer.gpu_exec: 179.397us -> 271.304us

fix_and_sum_circuit_layer.sum_as_poly_replay.gpu_exec: 146.141us -> 231.833us

sumcheck.finalize_univariate.total_after_explicit_sync: 282.597us -> 407.165us

The same end-to-end run also regressed to prove/random/core_2^20 = 389.69-468.00ms.

So the reviewer story here is intentionally narrow: removing the temporary half-value staging helped, while flattening the rest of the path did not.

  1. Caveat

The strongest claim here is stage-level first and end-to-end second.

The stage signal is repeatable locally. The end-to-end prove/random/core_2^20 result still has enough variance that I do not want to oversell it beyond the measured data above.

@peter941221

Copy link
Copy Markdown
Author
  1. This is the narrow A0 cut from the replay-driven hotspot narrowing chain.

  2. The attached card is the shortest version of the story: baseline A versus the winning A0 cut, with the three core stage metrics and their deltas.

  3. Strongest same-session local numbers:

fix_and_sum_circuit_layer.gpu_exec: 300.732us -> 179.397us

fix_and_sum_circuit_layer.sum_as_poly_replay.gpu_exec: 259.818us -> 146.141us

sumcheck.finalize_univariate.total_after_explicit_sync: 461.293us -> 282.597us

LogUp GKR A0 benchmark card

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants