test(gpu): chunk_efficiency example — measure column-load amplification#2797
Open
tamirhemo wants to merge 7 commits into
Open
test(gpu): chunk_efficiency example — measure column-load amplification#2797tamirhemo wants to merge 7 commits into
tamirhemo wants to merge 7 commits into
Conversation
Adds the v2 zerocheck path: a DAG-native IR (sp1-gpu-air::v2) with a shape-aware chunker, fused-kernel lowering, and CUDA kernels under zerocheck_v2/. v2 is a drop-in replacement for the legacy zerocheck and is faster across all measured SP1 workloads (e2e v6/rsp: ~9% on core, ~8% on compressed). Key properties: - Machine-stable bytecode: assertion alpha indices are stored chip-relative; the cluster-dependent shift is applied at launch. This lets the compiled + uploaded bytecode be cached once per machine. - Flat per-machine upload: every chunk's bytecode is concatenated into a small fixed set of device buffers, uploaded once at prover construction instead of per shard. - Empty chips are skipped, so per-round work is O(non-empty chips). - Chunker budget is keyed off the real register-pressure resource (max_leafset) rather than a synthetic work cap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`read_layout_from_json` returned chips in file-array order. The downstream cluster is a `BTreeSet<Chip>` (chip-name order) and the prover's per-chip trace-offset computation walks it in that order, so a layout whose array wasn't already name-sorted produced a trace whose columns didn't line up — wrong `main_ptr`, out-of-bounds device reads. Sorting entries by name on load makes the bench robust to any input ordering. (v2 proving itself was correct; this only affected the synthetic-trace bench harness.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-use - generate_json_data asserts the synthesized trace's column order matches the cluster (BTreeSet) iteration order, so a future change to Chip's ordering fails loudly instead of silently corrupting the trace. - test_zerocheck_v2_real_traces now proves the same shard cluster against both the real machine and a 5000-chip padded machine, and asserts the per-round cost doesn't scale with machine size. Measured: ~914us/round (122-chip machine) vs ~930us (5000-chip machine) — the per-shard work is pay-per-use, proportional to the active cluster (29 non-empty of 36) not the machine size. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Dumps every RiscvAir chip as a JSON layout entry (height 0) so the zerocheck bench's JSON source can be fed hand-tuned chip/height distributions — used to verify v2 has no per-active-chip cost blowup (12 vs 122 active chips at matched round count: 84.6 vs 84.7 ms). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Remove the legacy SSA-tape zerocheck (the `SymbolicProverFolder` → `Instruction32` → optimizer → `Instruction16` pipeline, `BlockAir`, and the per-shard per-chunk upload) and make the DAG-native fused-kernel implementation the only zerocheck path. Drops the `v2`/`_v2`/`V2` suffixes across modules, structs, files, kernels, and env vars now that there is a single implementation: - `sp1-gpu-air::v2` module -> `sp1-gpu-air::ir` - `sp1-gpu-zerocheck` `v2.rs` -> `prover.rs` - `sp1-gpu-sys` `v2_kernels` -> `kernels`, `zerocheck_v2_*` CUDA kernels -> `zerocheck_*` - `MachineBytecodeV2` -> `MachineBytecode`, `zerocheck_v2` -> `zerocheck`, and related symbols Removes the `SP1_GPU_ZEROCHECK_V1` fallback branch and the now-dead v1 constraint-codegen machinery from the shard prover and prover components. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…cation Adds a measurement-only example that reports, per RiscvAir chip, how many column loads the zerocheck chunker emits (Σ chunk leafsets) versus how many distinct columns the chip's constraints actually reference. The ratio (`reload x`) is a direct measure of chunking quality: 1.0 = each column loaded once, >1.0 = the chunk split forces redundant re-fetches. Pure addition — touches nothing in the zerocheck prover path. Finding at the default CHUNKER_MAX_LEAFSET=64: machine-wide column loads are ~9.4x the distinct-column floor (839% overhead). The wide field-arithmetic precompiles dominate — their constraints individually exceed the 64-leaf budget (oversize singletons), so the loads are inherent to the budget, not just greedy-packer fragmentation. KeccakPermute (no oversize singletons) still reloads ~4.9x from fragmentation alone. Raising the budget to 256 drops the machine-wide factor to ~4.4x but cross-tiers the kernel template (MAX_REGS 128 -> 512) — a tradeoff that needs a GPU A/B before any change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on top of #2795 (
feat/gpu-zerocheck-v2). Measurement only — does not touch the zerocheck prover path.What
Adds
sp1-gpu/crates/air/examples/chunk_efficiency.rs: for everyRiscvAirchip, runs the v2 builder → analysis → chunker and reports how many column
loads the chunker emits relative to the irreducible floor.
Per chip, at
ColumnLeafgranularity ((source, col)— local vs next count separately):refd— distinct column leaves referenced (union over all chunks). Each must be loaded once; the floor.loaded—Σ over chunks of |chunk.leafset|. The fused sequential kernel materialises one register slot per leaf per chunk, so a column shared by K chunks is fetched K times.reload x=loaded / refd. 1.0 = perfect, >1.0 = the split forces redundant re-fetches.ovr— chunks that are a single over-budget constraint (oversize singleton).Finding
At the production default
CHUNKER_MAX_LEAFSET=64:Two distinct causes, separable via the
ovrcolumn:field ops) have individual constraints touching >64 columns, e.g.
Bls12381Fp2AddSubAssignis 132/151 oversize chunks,reload 20.8x. The loadsare inherent to the budget, not greedy-packer fragmentation.
KeccakPermutehas zero oversize singletons yet stillreload 4.9x: its constraints fit the budget but overlap so heavily the greedyfirst-fit-decreasing packer can't tile them cleanly.
Not done here (deliberately)
No chunker change. Raising
max_leafsetcuts reloads but cross-tiers the fusedkernel's
MAX_REGStemplate (128 → 512), doubling the per-threadregs[]footprint — a real tradeoff the chunker docs already call out. Any improvement
(better packer, budget retune, routing oversize constraints to the escape-valve
lowering) should be a separate PR measured against GPU wall-time, not column
counts alone.
🤖 Generated with Claude Code