Skip to content

perf(powdr): record-in-chip APC witness generation#2864

Draft
qwang98 wants to merge 5 commits into
powdr-labs/apc-support-in-prover-and-sdkfrom
powdr-labs/apc-record-in-chip
Draft

perf(powdr): record-in-chip APC witness generation#2864
qwang98 wants to merge 5 commits into
powdr-labs/apc-support-in-prover-and-sdkfrom
powdr-labs/apc-record-in-chip

Conversation

@qwang98

@qwang98 qwang98 commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Record-in-chip APC witness generation

Stacked on #2781 (APC support in prover and sdk). Replaces the "extract ApcEvents from the full software trace" path with record-in-chip: during tracing the executor skips per-opcode event emission inside APC ranges and captures a minimal ApcInvocation per block; the APC chip then regenerates that block's witness by re-executing it, caching the generated trace in generate_dependencies for reuse in generate_trace.

Why

The speedup is entirely CPU-side (so the GPU proving path benefits too, since trace generation runs there as well):

  • No redundant ApcEvents extraction from the software trace.
  • Trace cached in generate_dependencies and reused in generate_trace (no double generation).
  • APC-chip generate_dependencies parallelized (APC chips are commutative, independent byte-lookup producers).
  • Leaner capture: zero-copy reads: Arc<[MemValue]> shared across a shard's invocations; ExecutionRecordSnapshot trimmed to the 3 fields actually read.

Correctness

A single current_skip tracks the in-progress invocation, resolved on range-exit / fresh-start / success / shard-end, so loops and aborts stay correct — an aborted block's skipped prefix is replayed as software (rollback), a successful block is regenerated by the chip (B1).

Verification (RSP, real rsp-client program, block 21740136, APC=12, cuda)

Measured against this same #2781 head with the ApcEvents path (APC=12), 2 clean unloaded runs each:

metric #2781 baseline (ApcEvents) this PR (record-in-chip)
core prove 49.2 / 53.0 s 41.5 / 43.1 s (~−17%)
peak host RAM 23.98 / 23.28 GB 24.16 / 22.08 GB (comparable)
result rc=0 rc=0, no CumulativeSumMismatch
  • The speedup is in prove, where shard records + traces are generated — that's what record-in-chip changes (record-gen + trace-gen; the GPU proving portion is unchanged). execute time is not affected: client.execute only produces the cycle/gas ExecutionReport via MinimalExecutorRunner/GasEstimatingVM — no shard records, no APC witnesses — so it's identical between the two paths (any delta there is run-to-run variance).
  • Abort rate ~6.6% of APC invocations (aborted blocks fall back to software; in line with the ~10% seen at APC=88).
  • proof time is wall-clock (tokio::Instant) around client.prove_with_mode; single-sample, so expect some run-to-run GPU variance.
  • cargo fmt + cargo clippy -D warnings clean on the touched crates.

Commits (5, dependency order)

  1. feat(apc): record-in-chip data model — ReplayTrace, ApcInvocation
  2. feat(executor): capture record-in-chip APC invocations during tracing
  3. feat(machine): regenerate APC chip trace from record-in-chip invocations
  4. perf(hypercube): parallelize APC-chip generate_dependencies
  5. test(apc): record-in-chip prove tests (APC unit + RSP core e2e)

🤖 Generated with Claude Code

qwang98 and others added 5 commits July 2, 2026 12:48
Add the minimal per-invocation capture types that let the APC chip
regenerate its witness by re-executing a block instead of materializing
its events during tracing: sp1_jit::ReplayTrace (zero-copy replay over a
shared read-oracle Arc), ApcInvocation/ApcInvocations (lean store with
Arc<[MemValue]> + offsets), and the program's apc_indices_by_start_idx /
start_pc_idx / num_cycles accessors. Removes the now-unused ApcEvents
(events/apc.rs).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
TracingVM skips per-opcode event emission inside APC ranges and captures
one ApcInvocation per block. A single current_skip tracks the in-progress
invocation, resolved on range-exit / fresh-start / success / shard-end so
loops and aborts stay correct (aborted blocks are replayed as software).
CoreExecutionState borrows registers; a lightweight untracked replay VM
regenerates blocks. ExecutionRecordSnapshot is trimmed to the three fields
apply_calls/capture actually read.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ApcChip replays each captured ApcInvocation to synthesize its per-opcode
witness, caching the generated trace in generate_dependencies for reuse in
generate_trace (no ApcEvents extraction, no double generation).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
APC chips are pure byte-lookup producers with commutative, independent
output, so run them in parallel and merge before the (order-independent)
rest. This is the dominant generate_dependencies cost under record-in-chip.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
test_add_apc_prove exercises the capture/replay path with overlapping-APC
skip; test_apc_core_rsp validates record-in-chip end-to-end on the real RSP
program at APC=12 via the shared test_e2e helper.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@qwang98 qwang98 changed the title perf(apc): record-in-chip APC witness generation perf(powdr): record-in-chip APC witness generation Jul 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant