Skip to content

Comments

10x cheaper C2/PRU: CuZK Proving engine#1043

Draft
magik6k wants to merge 33 commits intomainfrom
feat/cuzk
Draft

10x cheaper C2/PRU: CuZK Proving engine#1043
magik6k wants to merge 33 commits intomainfrom
feat/cuzk

Conversation

@magik6k
Copy link
Collaborator

@magik6k magik6k commented Feb 20, 2026

Summary

Integrate the cuzk persistent GPU SNARK proving daemon with Curio's task scheduler via gRPC. When enabled, Curio delegates PoRep C2, SnapDeals prove, and PSProve SNARK computations to the cuzk daemon instead of spawning per-proof child processes through ffiselect.

  • Add gRPC client (lib/cuzk/) and SealCalls methods (lib/ffi/cuzk_funcs.go) for PoRep and SnapDeals
  • Wire cuzk into PoRep, SnapDeals, and PSProve tasks with backpressure via GetStatus
  • Add make cuzk build target (NOT in default BINS — CI unaffected)
  • Vendor bellpepper-core and supraseal-c2 crate files so git clone && make cuzk works
  • Add user-facing documentation under documentation/en/experimental-features/

What is cuzk

cuzk is a persistent Rust daemon that keeps Groth16 SRS parameters (~47 GiB for 32 GiB PoRep) resident in CUDA-pinned host memory across proofs. The current ffiselect model spawns a fresh process per proof, loading the SRS from scratch each time (30-90s). cuzk eliminates this overhead entirely.
Beyond SRS residency, cuzk implements a 13-phase optimization pipeline that achieves 2.8x throughput over the ffiselect baseline (37.7s/proof vs ~89s on RTX 5070 Ti). The key architectural contributions are pipelined partition synthesis, dual-worker GPU interlock, PCIe transfer optimization, and a split async GPU proving API.

Architecture

Curio (Go)                          cuzk daemon (Rust/CUDA)
─────────────                       ──────────────────────
tasks/seal,snap,proofshare          persistent process
        │                                   │
    gRPC client ────── unix/TCP ──────► gRPC server
  (lib/cuzk/client.go)                     │
                                    ┌──────┴───────┐
                                    │   Scheduler   │
                                    │  (priority Q) │
                                    └──────┬───────┘
                                           │
                                    ┌──────┴───────┐
                                    │  GPU Workers  │
                                    │  (per device) │
                                    └──────────────┘

Vanilla proofs are generated locally in Curio (requires sector data on disk), then sent to cuzk for SNARK computation. The returned proof is verified locally before submission.

Pipelining

The cuzk engine pipelines work at three levels:

1. Partition-Level Synthesis → GPU Pipeline (Phase 7)

Instead of synthesizing all 10 partitions of a sector as a batch before any GPU work begins (the ffiselect model), cuzk spawns partition_workers concurrent synthesis tasks. Each produces a single partition's ProvingAssignment (~13.6 GiB) and sends it through a bounded channel to the GPU worker:

Synthesis Workers:  Part 0Part 2 ...
                         │      │      │
                         ▼      ▼      ▼
GPU Channel:        ─── P0 P1 P2 ───►
                                        │
GPU Worker:                         Prove P0Prove P2 ...

The GPU processes partition N while workers synthesize partition N+1..N+k. With partition_workers=10, all 10 partitions synthesize concurrently and the GPU is continuously fed.

2. Dual-Worker GPU Interlock (Phase 8)

Inside generate_groth16_proofs_c(), each partition has ~1.3s of CPU preprocessing (pointer setup, bitmap population) before ~3.3s of CUDA kernels, followed by ~0.7s of CPU epilogue. Phase 8 narrows the C++ mutex to cover only the CUDA kernel region and runs two GPU workers per device:

Worker A: CPU prepepilogue══ CUDA ══
Worker B:          CPU prep──epilogue
GPU:               ████ A ████████ B ████

This achieves 100% GPU utilization — zero idle gaps between partitions in steady state.

3. Split Async GPU API (Phase 12)

Phase 12 splits the monolithic C++ prove call into prove_start() (GPU kernels) + finalize() (b_g2_msm CPU + proof assembly). The GPU worker releases the lock ~1.7s earlier per partition, immediately picking up the next synthesized partition while b_g2_msm runs in a spawned finalizer task:

GPU Worker: ══ CUDA ════ CUDA ══
Finalizer:        b_g2_msm  b_g2_msm  b_g2_msm

Memory Management

SRS Residency

The daemon pre-populates GROTH_PARAM_MEMORY_CACHE at startup. Since the process is long-lived, the ~47 GiB SRS stays pinned in CUDA host memory across all proofs. No per-proof disk I/O.

Per-Partition Working Set

Memory is proportional to partition_workers, not total partitions:

Pipeline stage Per-partition memory Notes
During synthesis ~16 GiB 12 GiB a/b/c + 4 GiB aux
After prove_start ~4 GiB a/b/c freed immediately, only aux + density remain
Pending finalize ~4 GiB Held by finalizer task
The formula: Peak RSS ≈ 69 + (partition_workers × 20) GiB
Validated configurations:
  • 128 GiB system: pw=2, gw=1 → 110 GiB peak, 152s/proof
  • 256 GiB system: pw=7, gw=1 → 208 GiB peak, 53s/proof
  • 512 GiB system: pw=12, gw=2 → 400 GiB peak, 37.7s/proof

Backpressure

Three mechanisms prevent OOM at high concurrency:

  1. Early a/b/c free: prove_start() clears 12 GiB/partition immediately after GPU upload
  2. Channel capacity auto-scaling: bounded to max(synthesis_lookahead, partition_workers)
  3. Partition semaphore held through send: limits total in-flight synthesis outputs

CPU Locking / GPU Mutex

The C++ generate_groth16_proofs_c() originally used a static std::mutex that serialized the entire function. cuzk introduces:

  • Heap-allocated mutex (create_gpu_mutex() / destroy_gpu_mutex() FFI): one per physical GPU, managed by the engine. Passed through FFI as *mut c_void.
  • Narrowed scope: acquired before per-GPU CUDA kernel launch, released after kernels complete but before prep_msm_thread.join() — b_g2_msm and proof assembly run outside the lock.
  • Backward compatible: if gpu_mtx is null, falls back to the function-local static mutex (for non-engine callers).
    The dual-worker interlock (2 workers per GPU) alternates lock acquisition so Worker B's CPU prep runs while Worker A holds the lock for CUDA kernels, and vice versa.

Task Integration Details

When [Cuzk] Address is set in Curio config:

Behavior Change
TypeDetails() GPU requirement zeroed, RAM set to 1 GiB (resource decisions delegated to cuzk)
CanAccept() Queries GetStatus → rejects if totalPending >= MaxPending
Do() Generates vanilla proof locally → sends to cuzk via Prove RPC → verifies returned proof locally
When Address is empty (default), all tasks behave exactly as before. No behavioral change for existing deployments.

Build

make cuzk          # builds extern/cuzk → ./cuzk binary (~1m51s from scratch)
make install-cuzk  # installs to /usr/local/bin
make clean         # includes cargo clean in extern/cuzk
make cuzk is NOT in the default BINS or BUILD_DEPS targets, so CI (which has no CUDA) is unaffected. Requires nvcc and cargo.

Files Changed

New files:

  • lib/cuzk/client.go — gRPC client wrapper (connect, Prove, GetStatus, HasCapacity)
  • lib/cuzk/proving.pb.go, proving_grpc.pb.go — generated protobuf/gRPC stubs
  • lib/ffi/cuzk_funcs.go — PoRepSnarkCuzk, ProveUpdateCuzk on SealCalls
  • documentation/en/experimental-features/cuzk-proving-daemon.md — user guide
    Modified files:
  • deps/config/types.go — CuzkConfig struct + defaults
  • cmd/curio/tasks/tasks.go — creates cuzk.Client, passes to task constructors
  • tasks/seal/task_porep.go — cuzkClient field, Do/CanAccept/TypeDetails branches
  • tasks/snap/task_prove.go — same pattern
  • tasks/proofshare/task_prove.go — same + threaded through computeProof→computePoRep/computeSnap
  • Makefile — cuzk build/install/clean targets
  • .gitignore — /cuzk binary
    Vendored crate files (for git clone && make cuzk):
  • extern/bellpepper-core/ — 13 files (full crate: Cargo.toml, src/, licenses)
  • extern/supraseal-c2/ — 8 files (Cargo.toml, build.rs, Cargo.lock, tests)

Implement the cuzk proving engine as a Rust workspace in extern/cuzk/
with 5 crates (proto, core, server, daemon, bench) and full gRPC API.

Phase 0 delivers:
- gRPC daemon (TCP + Unix socket) with 8 RPC endpoints
- Real PoRep C2 proving via filecoin-proofs-api + SupraSeal CUDA backend
- SRS parameter residency via GROTH_PARAM_MEMORY_CACHE (lazy populate)
- Priority scheduler with binary heap queue
- Prometheus metrics endpoint
- Bench tool for single proof submission, status, preload, metrics

E2E validated: Two consecutive 32GiB PoRep C2 proofs on RTX 5070 Ti —
116.8s cold (SRS from disk) → 92.8s warm (SRS cached), 20.5% improvement.
Both produced valid 1920-byte Groth16 proofs.
…f fix

Improve the cuzk daemon's debuggability and operational readiness
for Phase 1 multi-GPU work:

Observability:
- Add tracing spans (info_span) with job_id correlation throughout
  prover and engine; upstream filecoin-proofs logs now tagged per-job
- Split timing into deserialize vs proving (monolithic in Phase 0)
- Per proof-kind Prometheus counters and duration summaries
- GPU detection via nvidia-smi in GetStatus RPC (name, VRAM)
- Running job info shown in status and annotated on GPU

Correctness:
- Fix AwaitProof to register late listeners (was broken, always 404)
- Graceful shutdown via watch channel (drain, finish current proof)
- Per-kind completed/failed counters with ring buffer for durations

Tooling:
- Add 'batch' command to cuzk-bench (sequential + concurrent modes,
  throughput stats with avg/min/max/proofs-per-min)
- Refactor bench client connection into shared connect() helper
- Add cuzk.example.toml with documented configuration

E2E validated: 32GiB PoRep C2 proof completes in ~110s with full
job_id-correlated logging and per-kind metrics.
…heduling

Wire up WinningPoSt, WindowPoSt, and SnapDeals provers via filecoin-proofs-api:
- prove_winning_post: generate_winning_post_with_vanilla
- prove_window_post: generate_single_window_post_with_vanilla (per-partition)
- prove_snap_deals: generate_empty_sector_update_proof_with_vanilla

Multi-GPU worker pool:
- Auto-detect GPUs via nvidia-smi or use config gpus.devices list
- Spawn one async worker loop per GPU with CUDA_VISIBLE_DEVICES isolation
- Per-worker SRS affinity tracking (last_circuit_id for future routing)

Proto/API updates:
- Add repeated bytes vanilla_proofs field for PoSt/SnapDeals multi-proof inputs
- Rename SnapDeals fields to comm_r_old/comm_r_new/comm_d_new (raw 32-byte)
- Registered proof type enum conversion (FFI V1_1 ↔ proofs-api V1_2 mapping)

Bench tool updated:
- Supports all proof types with --vanilla (JSON array of base64 proofs)
- New flags: --registered-proof, --randomness, --comm-r-old/new, --comm-d-new

8 unit tests pass, 0 warnings, clean cargo check --no-default-features.
…napDeals

Add gen-vanilla subcommand to cuzk-bench for generating vanilla proof test
data from existing sealed sector data. This completes Phase 1 by enabling
end-to-end testing of all four proof types (WinningPoSt, WindowPoSt,
SnapDeals) without requiring Go/Curio.

Three sub-subcommands:
- winning-post: challenge selection + Merkle inclusion proofs (66 challenges)
- window-post: fallback challenges + vanilla proofs (10 challenges)
- snap-prove: partition proofs from original + updated sector data (16 partitions)

Key implementation details:
- filecoin-proofs-api added as optional dep behind 'gen-vanilla' feature flag
- CID commitment parsing via cid crate (bagboea4b5abc... → [u8;32])
- commdr.txt file format parsing (d:<CID> r:<CID>)
- Output format: JSON array of base64 strings (matches Go json.Marshal([][]byte))
- CPU-only, no GPU required (--no-default-features --features gen-vanilla)

Validated against /data/32gbench/ golden data:
- WinningPoSt: 164KB vanilla proof, 218KB JSON output
- WindowPoSt: 25KB vanilla proof, 33KB JSON output
- SnapDeals: 16 × 562KB partition proofs, 12MB JSON output

5 new unit tests (CID parsing, commdr format, JSON round-trip).
Fork bellperson 0.26.0 into extern/bellperson/ with minimal changes to
expose the synthesis/GPU split point for pipelined proving:

bellperson changes (3 files, ~130 lines changed):
- prover/mod.rs: Make ProvingAssignment struct and all fields pub
- prover/supraseal.rs: Make synthesize_circuits_batch() pub, add new
  prove_from_assignments() function (extracted GPU-phase code)
- groth16/mod.rs: Re-export ProvingAssignment, synthesize_circuits_batch,
  prove_from_assignments under cuda-supraseal feature

The internal two-phase architecture was already clean — synthesis runs
circuit.synthesize() on CPU (rayon parallel), producing ProvingAssignment
with a/b/c evaluation vectors + density trackers. GPU phase packs these
into raw pointer arrays and calls supraseal_c2::generate_groth16_proof().
We simply expose both phases as separate public functions.

cuzk workspace changes:
- Cargo.toml: Add [patch.crates-io] for bellperson fork, add bellperson
  as workspace dependency
- Cargo.lock: Updated to use local bellperson

Also includes cuzk-phase2-design.md with complete Phase 2 design:
- Per-partition pipeline strategy (13.6 GiB intermediate state instead of
  136 GiB for all 10 partitions)
- Memory budget analysis for 128 GiB vs 256 GiB machines
- SRS manager design using SuprasealParameters directly
- 7-step implementation plan
- Call chain comparison (Phase 1 monolithic vs Phase 2 pipelined)

All 8 existing cuzk tests pass. Zero new warnings from our changes.
Implement the core Phase 2 infrastructure: split monolithic seal_commit_phase2()
into separate CPU synthesis and GPU proving phases, connected via a pipeline.

New modules:
- srs_manager.rs: Direct SRS loading via SuprasealParameters (bypasses
  GROTH_PARAM_MEMORY_CACHE). CircuitId enum maps proof types to exact
  .params filenames. Supports preload, evict, memory budget tracking.

- pipeline.rs: Per-partition pipelined PoRep C2 proving. Each of the 10
  partitions is synthesized individually (~13.6 GiB intermediate state vs
  ~136 GiB for all 10 at once), then proven on GPU via bellperson's split
  API (synthesize_circuits_batch → prove_from_assignments).
  Enables PoRep pipelining on 128 GiB machines.

Engine changes:
- Engine now supports pipeline.enabled config flag
- When enabled, PoRep C2 jobs use pipelined prover with SrsManager
- When disabled, falls back to Phase 1 monolithic prover
- SRS preloading uses SrsManager in pipeline mode

Config additions:
- [pipeline] section: enabled, synthesis_lookahead
- synthesis_lookahead controls backpressure (partitions buffered)

Dependencies:
- Added direct deps on filecoin-proofs, storage-proofs-{core,porep,post,update},
  bellperson (fork), blstrs, ff, rayon, rand_core, filecoin-hashers
- Correct feature flag propagation (cuda-supraseal for core+bellperson,
  cuda for porep/post/update which lack cuda-supraseal)

Tests: 15 pass (12 existing + 3 new), 0 warnings from cuzk code.
Compiles with --no-default-features (no GPU required for check builds).
Rewrite pipeline.rs to use batch synthesis (all 10 PoRep partitions in
one rayon-parallel call + single GPU pass) instead of per-partition
sequential mode. This matches monolithic performance (~91s vs ~93s)
while enabling cross-proof overlap in the next step.

Add pipelined synthesis/prove functions for all 4 proof types:
- PoRep C2: batch mode (synthesize_porep_c2_batch + gpu_prove)
- WinningPoSt: inlined circuit construction (no private API needed)
- WindowPoSt: single-partition inlined circuit construction
- SnapDeals: all-partition circuit construction

Other changes:
- engine.rs: route all proof types through pipeline when enabled
- prover.rs: make 4 helper functions pub for pipeline.rs use
- Add bincode dep for PoSt/SnapDeals vanilla proof deserialization
Restructure the engine to use a two-stage pipeline architecture when
pipeline mode is enabled:

  Stage 1 (synthesis task): Pulls requests from the scheduler, runs
  CPU-bound circuit synthesis on a blocking thread, pushes the
  SynthesizedJob (intermediate state + SRS ref) to a bounded channel.

  Stage 2 (GPU workers): One per GPU, pull SynthesizedJob from the
  shared channel, run gpu_prove on a blocking thread pinned to their
  GPU via CUDA_VISIBLE_DEVICES, complete the job.

The bounded channel (capacity = synthesis_lookahead config, default 1)
provides backpressure: when GPU workers are busy and the channel is
full, the synthesis task blocks — preventing OOM from unbounded
pre-synthesized proofs.

For PoRep 32G under continuous load, this enables:
  synth(N) | GPU(N) + synth(N+1) | GPU(N+1) + synth(N+2) | ...
  Steady-state: ~55s/proof (synthesis-bound) vs ~91s sequential

When pipeline.enabled = false, falls back to Phase 1 monolithic
workers (no overlap, full cycle per GPU worker).

Also updates the example config with improved pipeline documentation.
Add batch collector and multi-sector synthesis to the pipeline engine.
When max_batch_size > 1, same-type PoRep requests are accumulated and
processed as a single combined synthesis + GPU proving pass, amortizing
fixed GPU costs and improving SM utilization.

New files:
- batch_collector.rs: Accumulates same-circuit-type proof requests,
  flushes on max_batch_size or max_batch_wait_ms timeout. PoRep and
  SnapDeals are batchable; PoSt types bypass the collector entirely.

Pipeline changes:
- synthesize_porep_c2_multi(): Takes N sectors' C1 outputs, builds all
  N×10 partition circuits, synthesizes in one batch call. Returns
  combined SynthesizedProof + sector_boundaries for splitting results.
- split_batched_proofs(): Splits concatenated GPU output back into
  per-sector proof byte vectors using sector_boundaries.

Engine changes:
- Synthesis task now uses BatchCollector for batchable proof types.
  Races scheduler delivery against batch timeout. Non-batchable types
  (WinningPost, WindowPost) preempt-flush any pending batch and process
  immediately.
- SynthesizedJob extended with batch_requests and sector_boundaries.
- GPU worker handles batched results: splits proof output, notifies
  each sector's individual caller with its own proof bytes and timings.

Config:
- scheduler.max_batch_size controls batch limit (1=disabled, 2-3 typical)
- scheduler.max_batch_wait_ms controls accumulation window

Backward compatible: max_batch_size=1 (default) preserves Phase 2
single-sector behavior exactly. All 25 tests pass, 0 cuzk warnings.
…oughput

All Phase 3 E2E tests pass on RTX 5070 Ti:
- Timeout flush: BatchCollector correctly flushes after 30s wait
- Batch=2: 2 sectors synthesized as 20 circuits in 55s (same as 10),
  GPU 69s, yielding 62.7s/proof (1.42x vs baseline 89s)
- Overflow: 3 proofs with batch=2 shows correct batch+overflow+pipeline
- Non-batchable: WinningPoSt bypasses BatchCollector (0.8s total)

Memory: batch=2 peaks at 360 GiB (vs 203 GiB for single proof).
Updated roadmap table with measured numbers.
Synthesis optimizations (55.4s → 50.9s, -8.3%):
- Boolean::add_to_lc/sub_from_lc: eliminate temporary LC allocations in
  circuit gadget hot paths (Boolean::lc creates a fresh Vec on every call;
  the new methods append directly to an existing LC)
- Patched: UInt32::addmany, Num::add_bool_with_coeff, Boolean::enforce_equal,
  Boolean::sha256_ch, Boolean::sha256_maj, lookup3_xy,
  lookup3_xy_with_conditional_negation
- Vec recycling pool in ProvingAssignment::enforce for the 6 LC buffers
- Software prefetch in eval_with_trackers and LinearCombination::eval
- perf stat: 91B fewer instructions (-15.3%), 18.6B fewer branches (-26.7%)

GPU async deallocation (36s → 26s bellperson wrapper, -10s):
- Root cause: ~37 GB of C++ vectors (split_vectors, tail_msm_bases) and
  ~130 GB of Rust Vecs (ProvingAssignment a/b/c) freed synchronously in
  destructors after GPU proving, blocking return for ~10s of munmap() calls
- C++ fix: move split_vectors + tail_msm bases into detached std::thread
- Rust fix: spawn thread to drop provers/input_assignments/aux_assignments
- CUDA internal timing unchanged (~26s); overhead was pure deallocation

Also: A4 (parallel B_G2 CPU MSM), D4 (per-MSM window objects),
CUDA timing instrumentation, synth-only microbenchmark tool.

E2E 32 GiB PoRep C2 on RTX 5070 Ti: 88.9s → 77.2s (-13.2%)
Pre-allocate ProvingAssignment Vecs (a, b, c, aux_assignment) to their
final capacity using hints cached from the first synthesis. Eliminates
~27 reallocation cycles per Vec per circuit.

Benchmarked: no measurable impact on 32 GiB PoRep C2 (50.65s with and
without hints). Rust's geometric doubling amortizes well at our scale,
and the ~265 GB of theoretical redundant copies are overlapped with
computation across 10 parallel circuits on 96 cores. Kept as defensive
code for memory-constrained environments.
Replace full circuit synthesis (alloc+enforce) with two-phase approach:
1. WitnessCS: witness-only generation (enforce is no-op)
2. CSR MatVec: pre-compiled sparse matrix × witness vector

New cuzk-pce crate with:
- RecordingCS: captures R1CS structure into CSR format (with tagged
  column encoding to handle interleaved alloc_input/enforce)
- CsrMatrix/PreCompiledCircuit: serializable CSR storage
- spmv_parallel: row-parallel sparse MatVec with rayon
- evaluate_pce: builds witness vector, evaluates A*w, B*w, C*w
- PreComputedDensity: density bitmaps extracted from CSR structure

Pipeline integration:
- synthesize_auto() dispatcher: PCE fast path when cached, old path otherwise
- Static OnceLock caches per circuit type (porep-32g, winning-post, etc.)
- ProvingAssignment::from_pce() constructor in bellperson fork
- All 6 synthesis call sites switched to synthesize_auto()

Benchmark (pce-bench subcommand):
- Correctness: all 10 circuits × 130M constraints match bit-for-bit
- Baseline synthesis: 50.4s (10 circuits, old path)
- PCE synthesis:     35.5s (26.5s witness + 8.8s MatVec)
- Speedup:           1.42x
- PCE extraction:    46.9s (one-time cost, amortized over all future proofs)
- Peak RAM:          375 GB
Add PcePipeline subcommand to cuzk-bench for testing PCE memory behavior
under sequential and parallel pipelining modes:
- RSS tracking via /proc/self/status at each pipeline stage
- malloc_trim() between proofs for clean memory release
- Wave-based parallel execution using std::thread::scope (-j N flag)
- compare_old flag for A/B comparison in first iteration

Update cuzk-project.md with j=2 parallel pipeline benchmark results:
- 2 concurrent syntheses: 49s wall vs 71s sequential (1.45x wall speedup)
- Per-proof degradation: 46-49s (vs 35.5s j=1) due to BW contention
- Peak RSS: 407 GiB (2x working sets + PCE static + transient)
PCE disk persistence (raw binary format):
- New cuzk-pce::disk module with save_to_disk/load_from_disk
- Raw binary format (v2): 32-byte header + bulk byte dumps of CSR vectors
- 5.4x faster than bincode: 9.2s load vs 49.9s (from tmpfs, 25.7 GiB)
- Atomic writes (tmp + rename) to prevent corruption
- Header with magic/version/dimensions for quick validation

Daemon integration:
- preload_pce_from_disk() called at engine startup (loads all PCE files)
- extract_and_cache_pce() now saves to disk after extraction
- Background PCE auto-extraction triggered after first old-path synthesis
- get_pce() made public for engine-level cache checking

Phase 6 design document (c2-optimization-proposal-6.md):
- Slotted partition pipeline: overlap synth/GPU at partition granularity
- slot_size=2 sweet spot: 41s latency (vs 69.5s batch), 54 GiB RAM (vs 136 GiB)
- Steady-state throughput unchanged (35.5s/proof, synthesis-bound)
- Multi-sector and multi-GPU extension paths documented

Measured (RTX 5070 Ti, 32 GiB PoRep):
- PCE save (NVMe): 22.3s, 1.2 GB/s
- PCE load (tmpfs): 9.2s, 3.0 GB/s
- PCE load (NVMe): ~13-15s estimated (3x faster than 47s extraction)
…esis

Redesign the slotted pipeline to truly pipeline partition synthesis with
GPU proving. All 10 partitions are synthesized in parallel (bounded by
channel capacity), and the GPU consumes them one at a time as they
arrive.

Key changes:
- prove_porep_c2_partitioned(): spawns one thread per partition via
  std::thread::scope, all run concurrently. Bounded sync_channel
  provides backpressure to limit live RAM.
- Each partition = 1 GPU call (num_circuits=1), which gives fast
  b_g2_msm (~0.4s multi-threaded vs ~23s for num_circuits>=2).
- ProofAssembler: indexed by partition number, supports out-of-order
  arrival, assembles in partition order.
- synthesize_partition(): single-partition synthesis helper.
- Backward-compatible prove_porep_c2_slotted() wrapper dispatches
  to partitioned path when slot_size < num_partitions.

Benchmark results (32 GiB PoRep, 96-core Zen4, RTX 5070 Ti):
  max_concurrent=1: 72.0s, 71.3 GiB peak (5.42x overlap)
  max_concurrent=2: 72.7s, 86.8 GiB peak (5.38x overlap)
  max_concurrent=3: 71.9s, 86.8 GiB peak (5.37x overlap)
  batch-all:        62.3s, 228.5 GiB peak (no overlap)

Pipelined mode uses 3.2x less RAM (71 vs 228 GiB) with only ~16%
latency overhead. GPU takes ~3.8s/partition vs 25.5s batch-all total.
…ispatcher

Add timeline instrumentation for waterfall visualization of the proving
pipeline. Events (SYNTH_START/END, CHAN_SEND, GPU_PICKUP/START/END) are
emitted as CSV to stderr with millisecond offsets from engine start,
enabling precise analysis of GPU utilization and idle gaps.

Add synthesis_concurrency config parameter that controls how many proofs
can be synthesized simultaneously on the CPU. When synthesis takes longer
than GPU proving (39s vs 27s), the GPU idles ~12s between proofs with
sequential synthesis. With concurrency=2, overlapping syntheses can keep
the GPU continuously fed.

Implementation uses tokio::sync::Semaphore to limit concurrent synthesis
tasks. When concurrency=1 (default), behavior is identical to the old
sequential loop. When >1, each batch is spawned as an independent task
with semaphore-guarded concurrency.

Benchmark results (PoRep C2, 5-proof runs):
  concurrency=1: 45.3s/proof, 70.9% GPU utilization (baseline)
  concurrency=2, j=2: 42.2s/proof, 77.8% GPU utilization (+7%)
  concurrency=2, j=3: 43.1s/proof, 90.7% GPU utilization (+5%)
  concurrency=2, j=4: 60.2s/proof (CPU contention, regression)

CPU contention between synthesis (rayon) and b_g2_msm (rayon) during GPU
proving limits the improvement. Thread pool isolation is the next step.
Add configurable thread pool partitioning to reduce CPU contention when
running parallel synthesis alongside GPU proving.

Two independent thread pools compete for CPU cores during proving:
  1. Rayon global pool — used by synthesis (bellperson, PCE SpMV)
  2. C++ groth16_pool (sppark) — used by b_g2_msm and preprocessing

Changes:
- groth16_cuda.cu: Convert static groth16_pool to lazy initialization
  via std::call_once, reading CUZK_GPU_THREADS env var for pool size.
  This allows the Rust caller to set the env var before first GPU call.
- groth16_srs.cuh: Update all pool references to use get_groth16_pool()
- config.rs: Add gpus.gpu_threads field (default 0 = all CPUs)
- daemon main.rs: Configure rayon global pool from synthesis.threads,
  set CUZK_GPU_THREADS from gpus.gpu_threads before engine start
- Cargo.toml: Add rayon dependency to cuzk-daemon
- cuzk.example.toml: Document thread isolation strategy

Benchmark results (PoRep C2 32G, 96C/192T + RTX 5070 Ti):
  Baseline (sequential, no isolation):      46.1s/proof, 70.9% GPU util
  Parallel c=2, j=2, no isolation:          46.0s/proof, 81.9% GPU util
  Parallel c=2, j=2, rayon=192, gpu=32:     44.9s/proof, 76.9% GPU util
  Parallel c=2, j=3, rayon=192, gpu=32:     42.8s/proof (best, +7.2%)

Thread isolation provides modest improvement (~2-3%). The dominant factor
remains synthesis thread scalability: 2 syntheses sharing the rayon pool
each get ~96 effective threads, inflating synth from 39s to 45-47s.
Higher pipeline fill (j=3) is more effective than thread partitioning.
Proposal 7 replaces the thundering-herd synthesis pattern (all 10
partitions start/finish simultaneously) with a synth worker pool that
processes partitions individually and feeds them to the GPU one at a time.

Key design points:
- 20 synth workers (configurable) each synthesize 1 partition (~29s)
- Workers submit to engine GPU channel; block if full (backpressure)
- GPU proves each partition with num_circuits=1 (b_g2_msm: 0.4s vs 25s)
- ProofAssembler in JobTracker accumulates partitions per job_id
- Cross-sector overlap: next sector's synth starts on free workers

Expected impact: 42.8s/proof → ~30s/proof steady-state (GPU-limited),
~100% GPU utilization, zero inter-sector GPU idle time.

~110 net new lines of code, primarily in engine.rs.
Implement the Phase 7 architecture from c2-optimization-proposal-7.md:
dispatches individual PoRep partitions as independent work units through
the engine's synthesis→GPU pipeline, eliminating the thundering-herd
pattern and enabling cross-sector pipelining.

Key changes:
- SynthesizedJob: add partition_index, total_partitions, parent_job_id
  fields for per-partition routing
- PartitionedJobState: new struct tracking per-job ProofAssembler,
  accumulated timings, and failure state
- PartitionWorkItem: work unit for spawn_blocking synthesis workers
- JobTracker: add assemblers map for in-progress partitioned proofs
- process_batch(): new Phase 7 dispatch path when partition_workers > 0
  and single-sector PoRep C2 — parses C1 once, registers assembler,
  dispatches 10 spawn_blocking tasks gated by partition_semaphore,
  returns immediately (non-blocking)
- GPU worker: partition-aware result routing — routes partition proofs
  to ProofAssembler, delivers final proof when all partitions complete,
  calls malloc_trim(0) after each partition to release memory
- Error handling: failed flag on PartitionedJobState, synthesis/GPU
  failure propagation, skip work for already-failed jobs
- Config: add synthesis.partition_workers (default 20), partition
  semaphore limiting concurrent synthesis workers
- Phase 6 slotted pipeline retained as fallback (partition_workers=0,
  slot_size>0)
- ParsedC1Output and parse_c1_output made pub for engine access
- synthesize_partition made pub for engine dispatch

Expected steady-state: 42.8s/proof → ~30s/proof (GPU-limited), ~100%
GPU utilization, zero cross-sector GPU idle gaps. Per-partition GPU
calls use num_circuits=1, making b_g2_msm 0.4s instead of 25s.
Proposal to eliminate per-partition GPU idle gaps by overlapping one
worker's CPU preamble/epilogue with another worker's CUDA kernel
execution. Two GPU workers per physical GPU share a fine-grained
mutex that brackets only the CUDA kernel region inside
generate_groth16_proofs_c.

Key findings:
- The static mutex in groth16_cuda.cu covers the entire function
  (~3.5s), but actual CUDA kernel time is ~2.1s. The remaining
  ~1.3s is CPU work (preprocessing, b_g2_msm, epilogue) that
  could overlap with the next partition's GPU execution.
- The sppark semaphore_t is a counting semaphore that latches
  notify() before wait(), confirming safe barrier semantics for
  the proposed restructuring.
- Recommended approach: pass mutex pointer from Rust through FFI,
  acquire before per-GPU thread launch, release after per-GPU
  thread join, leaving b_g2_msm and epilogue outside the lock.

Estimated impact: GPU efficiency ~64% → ~98%, throughput ~3-10%
improvement on top of Phase 7.
Narrow the C++ static mutex in generate_groth16_proofs_c to cover only
the CUDA kernel region (NTT+MSM, batch additions, tail MSMs). CPU
preprocessing and b_g2_msm now run outside the lock, allowing two GPU
workers to interleave: one does CPU work while the other runs CUDA.

Changes across 7 files (~195 lines):

- groth16_cuda.cu: Remove static mutex, add std::mutex* parameter,
  acquire lock before per-GPU thread launch, release after per-GPU
  join (before prep_msm_thread join). Add create/destroy_gpu_mutex
  C helpers for FFI allocation.

- supraseal-c2/lib.rs: Add gpu_mtx parameter to FFI decl and both
  generate_groth16_proof wrappers. Export alloc/free_gpu_mutex.

- bellperson supraseal.rs: Add GpuMutexPtr type, SendableGpuMutex
  wrapper, alloc/free helpers. Thread gpu_mutex through
  prove_from_assignments. Legacy callers pass null (fallback mutex).

- pipeline.rs: Thread GpuMutexPtr through gpu_prove(). Internal
  callers pass null_mut() for backward compatibility.

- engine.rs: Create one C++ mutex per GPU via alloc_gpu_mutex().
  Spawn gpu_workers_per_device workers per GPU (default 2), each
  sharing the same mutex address (as usize for Send safety).

- config.rs: Add gpus.gpu_workers_per_device (default 2).

Benchmark results (RTX 5070 Ti, 96-core Zen4, partition_workers=20):

  Single proof:  69.3s wall (GPU efficiency: 100.0% — zero idle gaps)
  Throughput c=5 j=3: 44.0s/proof (Phase 7: 50.7s → 13.2% improvement)
  Throughput c=5 j=2: 49.5s/proof (Phase 7: 59.8s → 17.2% improvement)

  partition_workers=30 regresses to 60.4s/proof due to CPU contention
  from 30 simultaneous synthesis workers starving GPU preprocessing.
Document three new phases of the pipelined SNARK proving engine:

- Phase 6: Pipelined partition proving (slot-based, 62x b_g2_msm speedup)
- Phase 7: Engine-level per-partition pipeline (cross-sector overlap)
- Phase 8: Dual-worker GPU interlock (100% GPU utilization)

Key benchmark findings:
- Optimal partition_workers=10-12 on 96-core machine (43.5s/proof → 37.4s)
- System is perfectly GPU-bound: throughput = serial CUDA kernel time
  (10 partitions × 3.75s = 37.5s vs measured 37.4s/proof)
- Cross-sector GPU transitions are seamless (<50ms after warmup)
- synthesis_concurrency>1 provides no benefit (synthesis already overlapped)

Update file references and related documents for Phases 6-8.
Two changes to reduce GPU SM idle time caused by PCIe transfers
inside the GPU mutex:

1. Pre-stage a/b/c polynomials (6 GiB) outside the mutex via
   cudaHostRegister + async upload on a dedicated copy stream.
   Overlaps with the other worker's CUDA kernels.

2. Deferred batch sync in Pippenger MSM: double-buffer host-side
   bucket results so GPU never waits for CPU to process the
   previous batch. Eliminates 8+ per-batch idle gaps per MSM.

Includes full PCIe transfer inventory (23.6 GiB HtoD per partition)
and expected 4-9% throughput improvement over Phase 8.
…uploads

- Pre-stage a/b/c polynomial uploads using cudaHostRegister + async DMA
  before GPU mutex acquisition (host pinning) and after (device alloc + upload)
- Memory-aware allocation: query cudaMemGetInfo after pool trim, only pre-stage
  if full 12 GiB (d_a + d_bc) fits with 512 MiB safety margin
- Double-buffered deferred batch sync in Pippenger MSM (sppark submodule):
  per-batch sync deferred to next iteration, overlapping DtoH with compute
- Early d_bc free inside per_gpu thread after NTT phase completes
- GPU resources cleaned up before mutex release, host pages unregistered after

Results (gw=1, pw=10, c=3, j=1):
- 32.1s/proof avg (14.2% improvement over Phase 8 baseline 37.4s)
- ntt_msm_h_ms: 2430ms -> 690ms (-71.6%)
- gpu_total_ms: 3746ms -> 1450ms (-61.3%)

gw=2 shows regression (41.0s) due to cudaDeviceSynchronize + pool trim
serialization — needs further investigation.
Add per-stage timing to prestage setup: sync_ms, trim_ms, alloc_ms, upload_ms.

Key findings with c=15 j=15 gw=1:
- Pre-staging overhead: 18ms avg (negligible - PCIe gen5 is fast)
- GPU kernels: 1824ms avg/partition
- CPU critical path (prep_msm + b_g2_msm): 2393ms avg/partition
- CPU is the bottleneck, not GPU — DDR5 bandwidth wall
  with 10 concurrent synthesis workers competing for memory
- Throughput: 41.3s/proof (steady-state)
- c=30 j=20 causes OOM/crash from memory pressure
Phase 9 cuts GPU kernel time 51% (3.7s→1.8s/partition) but steady-state
throughput only improves 14% (37.4→32.1s in isolation) because CPU
preprocessing (prep_msm + b_g2_msm = 2.4s/partition) is now the critical
path. At high concurrency, 10 synthesis workers saturate 8-channel DDR5
bandwidth, slowing CPU MSM operations 12-27% and limiting throughput to
~41s/proof.
Phase 10 (two-lock GPU interlock) was implemented, tested, and abandoned:
- 16 GB VRAM too small for 2 workers' pre-staged buffers
- CUDA memory APIs are device-global, serializing across streams
- Phase 9 already hides b_g2_msm behind GPU lock release

Phase 11 design spec identifies 3 sources of throughput degradation
(32.1s isolation → 38.0s at c=20 j=15) and proposes 3 interventions:
1. Serialize async_dealloc to bound TLB shootdown storms
2. Reduce groth16_pool to 32 threads to cut L3 thrashing
3. Memory-bandwidth throttle during b_g2_msm via shared atomic

Also reverts groth16_cuda.cu Phase 10 timing instrumentation back to
Phase 9 state.
Three interventions to reduce CPU memory subsystem contention at high
concurrency (c=20 j=15):

1. Serialize async_dealloc threads (static mutex in C++ and Rust) to
   prevent concurrent munmap() TLB shootdown storms. Alone: negligible.

2. Reduce groth16_pool from 192 to 32 threads (gpu_threads=32 config).
   Cuts b_g2_msm L3 cache footprint from ~1.1 GiB to ~192 MB. b_g2_msm
   slows from 0.5s to 1.7s but runs outside GPU lock. Best result:
   36.7s/proof (3.4% improvement over Phase 9 baseline of 38.0s).

3. Memory-bandwidth throttle: global AtomicI32 flag set by C++ around
   b_g2_msm, checked by Rust SpMV every 64 chunks with yield_now().
   No additional gain over Intervention 2 alone.

Also tested gw=3 (37.2s) and gw=4 (37.4s) — both worse due to CPU
contention from additional GPU workers.

Optimal config: gw=2, pw=10, gpu_threads=32 → 36.7s/proof.
Decouple b_g2_msm CPU computation from the GPU worker loop so the GPU
worker can pick up the next synthesized partition ~1.7s faster. The C++
generate_groth16_proofs_c is refactored into start (returns pending
handle after GPU lock release) + finalize (joins b_g2_msm, runs
epilogue). GPU workers spawn a separate tokio finalizer task and
immediately loop back for the next job.

Key changes:
- C++ groth16_pending_proof struct holds all shared state on the heap
- generate_groth16_proofs_start_c / finalize_groth16_proof_c split API
- Fix use-after-free: prep_msm_thread now reads provers_owned (heap
  copy) instead of the stack parameter that goes out of scope
- Rust FFI: start_groth16_proof, finish_groth16_proof, drop_pending_proof
- Bellperson: PendingProofHandle<E>, prove_start(), finish_pending_proof()
- Pipeline: gpu_prove_start() / gpu_prove_finish(), PendingGpuProof alias
- Engine: GPU worker restructured with spawned finalizer task; extracted
  process_partition_result() and process_monolithic_result() helpers
- SynthesisCapacityHint struct added (was referenced but undefined)
- Removed unused PR generic from start_groth16_proof FFI

Benchmark (gw=2 pw=10 gt=32, c=20 j=15): 37.1s/proof throughput
(vs 38.0s Phase 11 baseline, ~2.4% improvement).
…re fix

Three improvements on top of the Phase 12 split API (99c31c2):

1. Early a/b/c free: After prove_start returns (GPU done with NTT+MSM),
   clear prover.a/b/c evaluation vectors (~12 GiB per partition). Only
   density bitvecs + assignment data are kept for background b_g2_msm.

2. Channel capacity auto-scaling: Size the synthesis→GPU channel to
   max(synthesis_lookahead, partition_workers) instead of hardcoded 1.
   Completed syntheses drain into the channel buffer without blocking,
   preventing synthesis output pile-up on send().

3. Partition permit held through send: The partition semaphore permit
   is now held until AFTER the channel send succeeds (not just through
   synthesis). With channel capacity = pw, sends are non-blocking so
   this adds zero latency. Bounds total in-flight synthesis outputs
   to partition_workers, preventing unbounded memory growth.

Also adds buffer flight counters (atomic, tracing::debug) for memory
diagnostics and converts eprintln to log::debug in bellperson dealloc.

Benchmark results (20 proofs, j=20, gw=2, gt=32):
  pw=10: 38.5s/proof, 321 GiB peak RSS (was 367 GiB)
  pw=12: 37.7s/proof, 400 GiB peak RSS (was OOM at 668 GiB!)
  pw=14: 37.8s/proof, 457 GiB peak RSS
  pw=16: 38.4s/proof, 510 GiB peak RSS

Optimal config: pw=12 — best throughput with bounded memory.
Document Phase 12 split API and memory backpressure in cuzk-project.md:
- Split API architecture, use-after-free fix, early a/b/c free
- Memory backpressure (channel auto-scaling, permit-through-send)
- Buffer flight counters, memory budget analysis

Add low-memory benchmark sweep (pw=1/2/5/7/10/12 × gw=1/2):
- Memory scales as ~69 + pw×20 GiB (measured)
- 128 GiB: pw=2 gw=1 → 110 GiB peak, 152s/proof
- 256 GiB: pw=7 gw=1 → 208 GiB peak, 53s/proof
- 384 GiB: pw=10 gw=2 → 271 GiB peak, 43s/proof
- gw=2 adds no benefit below pw=10 (GPU synthesis-starved)

Update cuzk.example.toml with measured RAM-tier recommendations,
optimal defaults (gpu_threads=32, partition_workers=12), and
guidance for gw=1 vs gw=2 based on partition worker count.
Wire the cuzk persistent GPU SNARK proving daemon into Curio's harmony
task scheduler for PoRep C2, SnapDeals Prove, and PSProve tasks.

When configured, Curio delegates SNARK computations to cuzk over gRPC
instead of spawning per-proof child processes via ffiselect. Vanilla
proofs are still generated locally (require sector data on disk), then
sent to the daemon for GPU proving, then verified locally.

Go integration:
- lib/cuzk/: gRPC client wrapper (generated proto + client.go)
- lib/ffi/cuzk_funcs.go: PoRepSnarkCuzk, ProveUpdateCuzk on SealCalls
- deps/config/types.go: CuzkConfig (Address, MaxPending, ProveTimeout)
- cmd/curio/tasks/tasks.go: create cuzk.Client, pass to task constructors
- tasks/seal, tasks/snap, tasks/proofshare: cuzk branch in Do/CanAccept/TypeDetails

Build system:
- Makefile: add 'make cuzk' target (cargo build, requires nvcc+cargo)
- Deliberately not in BINS/BUILD_DEPS so CI is unaffected
- install-cuzk/uninstall-cuzk targets, cargo clean in make clean

Vendored Rust forks (complete crates for cargo build):
- extern/bellpepper-core: full crate (was partially tracked)
- extern/supraseal-c2: full crate (was partially tracked)
- extern/bellperson, extern/cuzk: already fully tracked

When Cuzk.Address is empty (default), behavior is identical to before.
No impact on existing deployments.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant