Skip to content

test(gpu): chunk_efficiency example — measure column-load amplification#2797

Open
tamirhemo wants to merge 7 commits into
mainfrom
perf/gpu-chunk-efficiency
Open

test(gpu): chunk_efficiency example — measure column-load amplification#2797
tamirhemo wants to merge 7 commits into
mainfrom
perf/gpu-chunk-efficiency

Conversation

@tamirhemo

Copy link
Copy Markdown
Contributor

Stacked on top of #2795 (feat/gpu-zerocheck-v2). Measurement only — does not touch the zerocheck prover path.

What

Adds sp1-gpu/crates/air/examples/chunk_efficiency.rs: for every RiscvAir
chip, runs the v2 builder → analysis → chunker and reports how many column
loads the chunker emits relative to the irreducible floor.

cargo run --release --example chunk_efficiency -p sp1-gpu-air

Per chip, at ColumnLeaf granularity ((source, col) — local vs next count separately):

  • refd — distinct column leaves referenced (union over all chunks). Each must be loaded once; the floor.
  • loadedΣ over chunks of |chunk.leafset|. The fused sequential kernel materialises one register slot per leaf per chunk, so a column shared by K chunks is fetched K times.
  • reload x = loaded / refd. 1.0 = perfect, >1.0 = the split forces redundant re-fetches.
  • ovr — chunks that are a single over-budget constraint (oversize singleton).

Finding

At the production default CHUNKER_MAX_LEAFSET=64:

max_leafset machine-wide column loads distinct columns reload x
64 (default) 501,094 53,338 9.39x
128 263,422 53,338 4.94x
256 235,688 53,338 4.42x

Two distinct causes, separable via the ovr column:

  • Oversize singletons — the wide field-arithmetic precompiles (Bls12381 / Bn254 / Ed
    field ops) have individual constraints touching >64 columns, e.g.
    Bls12381Fp2AddSubAssign is 132/151 oversize chunks, reload 20.8x. The loads
    are inherent to the budget, not greedy-packer fragmentation.
  • FragmentationKeccakPermute has zero oversize singletons yet still
    reload 4.9x: its constraints fit the budget but overlap so heavily the greedy
    first-fit-decreasing packer can't tile them cleanly.

Not done here (deliberately)

No chunker change. Raising max_leafset cuts reloads but cross-tiers the fused
kernel's MAX_REGS template (128 → 512), doubling the per-thread regs[]
footprint — a real tradeoff the chunker docs already call out. Any improvement
(better packer, budget retune, routing oversize constraints to the escape-valve
lowering) should be a separate PR measured against GPU wall-time, not column
counts alone.

🤖 Generated with Claude Code

tamirhemo and others added 7 commits May 15, 2026 19:29
Adds the v2 zerocheck path: a DAG-native IR (sp1-gpu-air::v2) with a
shape-aware chunker, fused-kernel lowering, and CUDA kernels under
zerocheck_v2/. v2 is a drop-in replacement for the legacy zerocheck and
is faster across all measured SP1 workloads (e2e v6/rsp: ~9% on core,
~8% on compressed).

Key properties:
- Machine-stable bytecode: assertion alpha indices are stored
  chip-relative; the cluster-dependent shift is applied at launch. This
  lets the compiled + uploaded bytecode be cached once per machine.
- Flat per-machine upload: every chunk's bytecode is concatenated into
  a small fixed set of device buffers, uploaded once at prover
  construction instead of per shard.
- Empty chips are skipped, so per-round work is O(non-empty chips).
- Chunker budget is keyed off the real register-pressure resource
  (max_leafset) rather than a synthetic work cap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`read_layout_from_json` returned chips in file-array order. The
downstream cluster is a `BTreeSet<Chip>` (chip-name order) and the
prover's per-chip trace-offset computation walks it in that order, so a
layout whose array wasn't already name-sorted produced a trace whose
columns didn't line up — wrong `main_ptr`, out-of-bounds device reads.
Sorting entries by name on load makes the bench robust to any input
ordering. (v2 proving itself was correct; this only affected the
synthetic-trace bench harness.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-use

- generate_json_data asserts the synthesized trace's column order
  matches the cluster (BTreeSet) iteration order, so a future change to
  Chip's ordering fails loudly instead of silently corrupting the trace.
- test_zerocheck_v2_real_traces now proves the same shard cluster
  against both the real machine and a 5000-chip padded machine, and
  asserts the per-round cost doesn't scale with machine size. Measured:
  ~914us/round (122-chip machine) vs ~930us (5000-chip machine) — the
  per-shard work is pay-per-use, proportional to the active cluster
  (29 non-empty of 36) not the machine size.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Dumps every RiscvAir chip as a JSON layout entry (height 0) so the
zerocheck bench's JSON source can be fed hand-tuned chip/height
distributions — used to verify v2 has no per-active-chip cost blowup
(12 vs 122 active chips at matched round count: 84.6 vs 84.7 ms).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Remove the legacy SSA-tape zerocheck (the `SymbolicProverFolder` →
`Instruction32` → optimizer → `Instruction16` pipeline, `BlockAir`, and
the per-shard per-chunk upload) and make the DAG-native fused-kernel
implementation the only zerocheck path.

Drops the `v2`/`_v2`/`V2` suffixes across modules, structs, files,
kernels, and env vars now that there is a single implementation:

- `sp1-gpu-air::v2` module -> `sp1-gpu-air::ir`
- `sp1-gpu-zerocheck` `v2.rs` -> `prover.rs`
- `sp1-gpu-sys` `v2_kernels` -> `kernels`, `zerocheck_v2_*` CUDA
  kernels -> `zerocheck_*`
- `MachineBytecodeV2` -> `MachineBytecode`, `zerocheck_v2` -> `zerocheck`,
  and related symbols

Removes the `SP1_GPU_ZEROCHECK_V1` fallback branch and the now-dead v1
constraint-codegen machinery from the shard prover and prover
components.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…cation

Adds a measurement-only example that reports, per RiscvAir chip, how many
column loads the zerocheck chunker emits (Σ chunk leafsets) versus how many
distinct columns the chip's constraints actually reference. The ratio
(`reload x`) is a direct measure of chunking quality: 1.0 = each column
loaded once, >1.0 = the chunk split forces redundant re-fetches.

Pure addition — touches nothing in the zerocheck prover path.

Finding at the default CHUNKER_MAX_LEAFSET=64: machine-wide column loads are
~9.4x the distinct-column floor (839% overhead). The wide field-arithmetic
precompiles dominate — their constraints individually exceed the 64-leaf
budget (oversize singletons), so the loads are inherent to the budget, not
just greedy-packer fragmentation. KeccakPermute (no oversize singletons)
still reloads ~4.9x from fragmentation alone. Raising the budget to 256
drops the machine-wide factor to ~4.4x but cross-tiers the kernel template
(MAX_REGS 128 -> 512) — a tradeoff that needs a GPU A/B before any change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Base automatically changed from feat/gpu-zerocheck-v2 to main June 9, 2026 19:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant