perf(powdr): record-in-chip APC witness generation by qwang98 · Pull Request #2864 · succinctlabs/sp1

qwang98 · 2026-07-02T04:56:34Z

Record-in-chip APC witness generation

Stacked on #2781 (APC support in prover and sdk). Replaces the "extract ApcEvents from the full software trace" path with record-in-chip: during tracing the executor skips per-opcode event emission inside APC ranges and captures a minimal ApcInvocation per block; the APC chip then regenerates that block's witness by re-executing it, caching the generated trace in generate_dependencies for reuse in generate_trace.

Why

The speedup is entirely CPU-side (so the GPU proving path benefits too, since trace generation runs there as well):

No redundant ApcEvents extraction from the software trace.
Trace cached in generate_dependencies and reused in generate_trace (no double generation).
APC-chip generate_dependencies parallelized (APC chips are commutative, independent byte-lookup producers).
Leaner capture: zero-copy reads: Arc<[MemValue]> shared across a shard's invocations; ExecutionRecordSnapshot trimmed to the 3 fields actually read.

Correctness

A single current_skip tracks the in-progress invocation, resolved on range-exit / fresh-start / success / shard-end, so loops and aborts stay correct — an aborted block's skipped prefix is replayed as software (rollback), a successful block is regenerated by the chip (B1).

Verification (RSP, real `rsp-client` program, block 21740136, APC=12, cuda)

Measured against this same #2781 head with the ApcEvents path (APC=12), 2 clean unloaded runs each:

metric	#2781 baseline (ApcEvents)	this PR (record-in-chip)
core prove	49.2 / 53.0 s	41.5 / 43.1 s (~−17%)
peak host RAM	23.98 / 23.28 GB	24.16 / 22.08 GB (comparable)
result	`rc=0`	`rc=0`, no `CumulativeSumMismatch`

The speedup is in prove, where shard records + traces are generated — that's what record-in-chip changes (record-gen + trace-gen; the GPU proving portion is unchanged). execute time is not affected: client.execute only produces the cycle/gas ExecutionReport via MinimalExecutorRunner/GasEstimatingVM — no shard records, no APC witnesses — so it's identical between the two paths (any delta there is run-to-run variance).
Abort rate ~6.6% of APC invocations (aborted blocks fall back to software; in line with the ~10% seen at APC=88).
proof time is wall-clock (tokio::Instant) around client.prove_with_mode; single-sample, so expect some run-to-run GPU variance.
cargo fmt + cargo clippy -D warnings clean on the touched crates.

Commits (5, dependency order)

feat(apc): record-in-chip data model — ReplayTrace, ApcInvocation
feat(executor): capture record-in-chip APC invocations during tracing
feat(machine): regenerate APC chip trace from record-in-chip invocations
perf(hypercube): parallelize APC-chip generate_dependencies
test(apc): record-in-chip prove tests (APC unit + RSP core e2e)

🤖 Generated with Claude Code

Add the minimal per-invocation capture types that let the APC chip regenerate its witness by re-executing a block instead of materializing its events during tracing: sp1_jit::ReplayTrace (zero-copy replay over a shared read-oracle Arc), ApcInvocation/ApcInvocations (lean store with Arc<[MemValue]> + offsets), and the program's apc_indices_by_start_idx / start_pc_idx / num_cycles accessors. Removes the now-unused ApcEvents (events/apc.rs). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

TracingVM skips per-opcode event emission inside APC ranges and captures one ApcInvocation per block. A single current_skip tracks the in-progress invocation, resolved on range-exit / fresh-start / success / shard-end so loops and aborts stay correct (aborted blocks are replayed as software). CoreExecutionState borrows registers; a lightweight untracked replay VM regenerates blocks. ExecutionRecordSnapshot is trimmed to the three fields apply_calls/capture actually read. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

ApcChip replays each captured ApcInvocation to synthesize its per-opcode witness, caching the generated trace in generate_dependencies for reuse in generate_trace (no ApcEvents extraction, no double generation). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

APC chips are pure byte-lookup producers with commutative, independent output, so run them in parallel and merge before the (order-independent) rest. This is the dominant generate_dependencies cost under record-in-chip. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

test_add_apc_prove exercises the capture/replay path with overlapping-APC skip; test_apc_core_rsp validates record-in-chip end-to-end on the real RSP program at APC=12 via the shared test_e2e helper. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

qwang98 and others added 5 commits July 2, 2026 12:48

qwang98 changed the title ~~perf(apc): record-in-chip APC witness generation~~ perf(powdr): record-in-chip APC witness generation Jul 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(powdr): record-in-chip APC witness generation#2864

perf(powdr): record-in-chip APC witness generation#2864
qwang98 wants to merge 5 commits into
powdr-labs/apc-support-in-prover-and-sdkfrom
powdr-labs/apc-record-in-chip

qwang98 commented Jul 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

qwang98 commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Record-in-chip APC witness generation

Why

Correctness

Verification (RSP, real rsp-client program, block 21740136, APC=12, cuda)

Commits (5, dependency order)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

qwang98 commented Jul 2, 2026 •

edited

Loading

Verification (RSP, real `rsp-client` program, block 21740136, APC=12, cuda)