perf(powdr): record-in-chip APC witness generation#2864
Draft
qwang98 wants to merge 5 commits into
Draft
Conversation
Add the minimal per-invocation capture types that let the APC chip regenerate its witness by re-executing a block instead of materializing its events during tracing: sp1_jit::ReplayTrace (zero-copy replay over a shared read-oracle Arc), ApcInvocation/ApcInvocations (lean store with Arc<[MemValue]> + offsets), and the program's apc_indices_by_start_idx / start_pc_idx / num_cycles accessors. Removes the now-unused ApcEvents (events/apc.rs). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
TracingVM skips per-opcode event emission inside APC ranges and captures one ApcInvocation per block. A single current_skip tracks the in-progress invocation, resolved on range-exit / fresh-start / success / shard-end so loops and aborts stay correct (aborted blocks are replayed as software). CoreExecutionState borrows registers; a lightweight untracked replay VM regenerates blocks. ExecutionRecordSnapshot is trimmed to the three fields apply_calls/capture actually read. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ApcChip replays each captured ApcInvocation to synthesize its per-opcode witness, caching the generated trace in generate_dependencies for reuse in generate_trace (no ApcEvents extraction, no double generation). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
APC chips are pure byte-lookup producers with commutative, independent output, so run them in parallel and merge before the (order-independent) rest. This is the dominant generate_dependencies cost under record-in-chip. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
test_add_apc_prove exercises the capture/replay path with overlapping-APC skip; test_apc_core_rsp validates record-in-chip end-to-end on the real RSP program at APC=12 via the shared test_e2e helper. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Record-in-chip APC witness generation
Stacked on #2781 (APC support in prover and sdk). Replaces the "extract ApcEvents from the full software trace" path with record-in-chip: during tracing the executor skips per-opcode event emission inside APC ranges and captures a minimal
ApcInvocationper block; the APC chip then regenerates that block's witness by re-executing it, caching the generated trace ingenerate_dependenciesfor reuse ingenerate_trace.Why
The speedup is entirely CPU-side (so the GPU proving path benefits too, since trace generation runs there as well):
ApcEventsextraction from the software trace.generate_dependenciesand reused ingenerate_trace(no double generation).generate_dependenciesparallelized (APC chips are commutative, independent byte-lookup producers).reads: Arc<[MemValue]>shared across a shard's invocations;ExecutionRecordSnapshottrimmed to the 3 fields actually read.Correctness
A single
current_skiptracks the in-progress invocation, resolved on range-exit / fresh-start / success / shard-end, so loops and aborts stay correct — an aborted block's skipped prefix is replayed as software (rollback), a successful block is regenerated by the chip (B1).Verification (RSP, real
rsp-clientprogram, block 21740136, APC=12, cuda)Measured against this same #2781 head with the ApcEvents path (APC=12), 2 clean unloaded runs each:
rc=0rc=0, noCumulativeSumMismatchprove, where shard records + traces are generated — that's what record-in-chip changes (record-gen + trace-gen; the GPU proving portion is unchanged).executetime is not affected:client.executeonly produces the cycle/gasExecutionReportviaMinimalExecutorRunner/GasEstimatingVM— no shard records, no APC witnesses — so it's identical between the two paths (any delta there is run-to-run variance).proof timeis wall-clock (tokio::Instant) aroundclient.prove_with_mode; single-sample, so expect some run-to-run GPU variance.cargo fmt+cargo clippy -D warningsclean on the touched crates.Commits (5, dependency order)
feat(apc): record-in-chip data model —ReplayTrace,ApcInvocationfeat(executor): capture record-in-chip APC invocations during tracingfeat(machine): regenerate APC chip trace from record-in-chip invocationsperf(hypercube): parallelize APC-chipgenerate_dependenciestest(apc): record-in-chip prove tests (APC unit + RSP core e2e)🤖 Generated with Claude Code