test(gpu): chunk_efficiency example — measure column-load amplification by tamirhemo · Pull Request #2797 · succinctlabs/sp1

tamirhemo · 2026-05-17T23:01:34Z

Stacked on top of #2795 (feat/gpu-zerocheck-v2). Measurement only — does not touch the zerocheck prover path.

What

Adds sp1-gpu/crates/air/examples/chunk_efficiency.rs: for every RiscvAir
chip, runs the v2 builder → analysis → chunker and reports how many column
loads the chunker emits relative to the irreducible floor.

cargo run --release --example chunk_efficiency -p sp1-gpu-air

Per chip, at ColumnLeaf granularity ((source, col) — local vs next count separately):

refd — distinct column leaves referenced (union over all chunks). Each must be loaded once; the floor.
loaded — Σ over chunks of |chunk.leafset|. The fused sequential kernel materialises one register slot per leaf per chunk, so a column shared by K chunks is fetched K times.
reload x = loaded / refd. 1.0 = perfect, >1.0 = the split forces redundant re-fetches.
ovr — chunks that are a single over-budget constraint (oversize singleton).

Finding

At the production default CHUNKER_MAX_LEAFSET=64:

max_leafset	machine-wide column loads	distinct columns	reload x
64 (default)	501,094	53,338	9.39x
128	263,422	53,338	4.94x
256	235,688	53,338	4.42x

Two distinct causes, separable via the ovr column:

Oversize singletons — the wide field-arithmetic precompiles (Bls12381 / Bn254 / Ed
field ops) have individual constraints touching >64 columns, e.g.
Bls12381Fp2AddSubAssign is 132/151 oversize chunks, reload 20.8x. The loads
are inherent to the budget, not greedy-packer fragmentation.
Fragmentation — KeccakPermute has zero oversize singletons yet still
reload 4.9x: its constraints fit the budget but overlap so heavily the greedy
first-fit-decreasing packer can't tile them cleanly.

Not done here (deliberately)

No chunker change. Raising max_leafset cuts reloads but cross-tiers the fused
kernel's MAX_REGS template (128 → 512), doubling the per-thread regs[]
footprint — a real tradeoff the chunker docs already call out. Any improvement
(better packer, budget retune, routing oversize constraints to the escape-valve
lowering) should be a separate PR measured against GPU wall-time, not column
counts alone.

🤖 Generated with Claude Code

Adds the v2 zerocheck path: a DAG-native IR (sp1-gpu-air::v2) with a shape-aware chunker, fused-kernel lowering, and CUDA kernels under zerocheck_v2/. v2 is a drop-in replacement for the legacy zerocheck and is faster across all measured SP1 workloads (e2e v6/rsp: ~9% on core, ~8% on compressed). Key properties: - Machine-stable bytecode: assertion alpha indices are stored chip-relative; the cluster-dependent shift is applied at launch. This lets the compiled + uploaded bytecode be cached once per machine. - Flat per-machine upload: every chunk's bytecode is concatenated into a small fixed set of device buffers, uploaded once at prover construction instead of per shard. - Empty chips are skipped, so per-round work is O(non-empty chips). - Chunker budget is keyed off the real register-pressure resource (max_leafset) rather than a synthetic work cap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`read_layout_from_json` returned chips in file-array order. The downstream cluster is a `BTreeSet<Chip>` (chip-name order) and the prover's per-chip trace-offset computation walks it in that order, so a layout whose array wasn't already name-sorted produced a trace whose columns didn't line up — wrong `main_ptr`, out-of-bounds device reads. Sorting entries by name on load makes the bench robust to any input ordering. (v2 proving itself was correct; this only affected the synthetic-trace bench harness.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…-use - generate_json_data asserts the synthesized trace's column order matches the cluster (BTreeSet) iteration order, so a future change to Chip's ordering fails loudly instead of silently corrupting the trace. - test_zerocheck_v2_real_traces now proves the same shard cluster against both the real machine and a 5000-chip padded machine, and asserts the per-round cost doesn't scale with machine size. Measured: ~914us/round (122-chip machine) vs ~930us (5000-chip machine) — the per-shard work is pay-per-use, proportional to the active cluster (29 non-empty of 36) not the machine size. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Dumps every RiscvAir chip as a JSON layout entry (height 0) so the zerocheck bench's JSON source can be fed hand-tuned chip/height distributions — used to verify v2 has no per-active-chip cost blowup (12 vs 122 active chips at matched round count: 84.6 vs 84.7 ms). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Remove the legacy SSA-tape zerocheck (the `SymbolicProverFolder` → `Instruction32` → optimizer → `Instruction16` pipeline, `BlockAir`, and the per-shard per-chunk upload) and make the DAG-native fused-kernel implementation the only zerocheck path. Drops the `v2`/`_v2`/`V2` suffixes across modules, structs, files, kernels, and env vars now that there is a single implementation: - `sp1-gpu-air::v2` module -> `sp1-gpu-air::ir` - `sp1-gpu-zerocheck` `v2.rs` -> `prover.rs` - `sp1-gpu-sys` `v2_kernels` -> `kernels`, `zerocheck_v2_*` CUDA kernels -> `zerocheck_*` - `MachineBytecodeV2` -> `MachineBytecode`, `zerocheck_v2` -> `zerocheck`, and related symbols Removes the `SP1_GPU_ZEROCHECK_V1` fallback branch and the now-dead v1 constraint-codegen machinery from the shard prover and prover components. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…cation Adds a measurement-only example that reports, per RiscvAir chip, how many column loads the zerocheck chunker emits (Σ chunk leafsets) versus how many distinct columns the chip's constraints actually reference. The ratio (`reload x`) is a direct measure of chunking quality: 1.0 = each column loaded once, >1.0 = the chunk split forces redundant re-fetches. Pure addition — touches nothing in the zerocheck prover path. Finding at the default CHUNKER_MAX_LEAFSET=64: machine-wide column loads are ~9.4x the distinct-column floor (839% overhead). The wide field-arithmetic precompiles dominate — their constraints individually exceed the 64-leaf budget (oversize singletons), so the loads are inherent to the budget, not just greedy-packer fragmentation. KeccakPermute (no oversize singletons) still reloads ~4.9x from fragmentation alone. Raising the budget to 256 drops the machine-wide factor to ~4.4x but cross-tiers the kernel template (MAX_REGS 128 -> 512) — a tradeoff that needs a GPU A/B before any change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

tamirhemo and others added 7 commits May 15, 2026 19:29

Merge remote-tracking branch 'origin/main' into feat/gpu-zerocheck-v2

3c7f6be

tamirhemo mentioned this pull request May 17, 2026

perf(gpu): seed-and-grow zerocheck chunker #2798

Closed

Base automatically changed from feat/gpu-zerocheck-v2 to main June 9, 2026 19:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test(gpu): chunk_efficiency example — measure column-load amplification#2797

test(gpu): chunk_efficiency example — measure column-load amplification#2797
tamirhemo wants to merge 7 commits into
mainfrom
perf/gpu-chunk-efficiency

tamirhemo commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

tamirhemo commented May 17, 2026

What

Finding

Not done here (deliberately)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant