10x cheaper C2/PRU: CuZK Proving engine by magik6k · Pull Request #1043 · filecoin-project/curio

magik6k · 2026-02-20T23:43:46Z

Summary

Integrate the cuzk persistent GPU SNARK proving daemon with Curio's task scheduler via gRPC. When enabled, Curio delegates PoRep C2, SnapDeals prove, and PSProve SNARK computations to the cuzk daemon instead of spawning per-proof child processes through ffiselect.

Add gRPC client (lib/cuzk/) and SealCalls methods (lib/ffi/cuzk_funcs.go) for PoRep and SnapDeals
Wire cuzk into PoRep, SnapDeals, and PSProve tasks with backpressure via GetStatus
Add make cuzk build target (NOT in default BINS — CI unaffected)
Vendor bellpepper-core and supraseal-c2 crate files so git clone && make cuzk works
Add user-facing documentation under documentation/en/experimental-features/

What is cuzk

cuzk is a persistent Rust daemon that keeps Groth16 SRS parameters (~47 GiB for 32 GiB PoRep) resident in CUDA-pinned host memory across proofs. The current ffiselect model spawns a fresh process per proof, loading the SRS from scratch each time (30-90s). cuzk eliminates this overhead entirely.
Beyond SRS residency, cuzk implements a 13-phase optimization pipeline that achieves 2.8x throughput over the ffiselect baseline (37.7s/proof vs ~89s on RTX 5070 Ti). The key architectural contributions are pipelined partition synthesis, dual-worker GPU interlock, PCIe transfer optimization, and a split async GPU proving API.

Architecture

Curio (Go)                          cuzk daemon (Rust/CUDA)
─────────────                       ──────────────────────
tasks/seal,snap,proofshare          persistent process
        │                                   │
    gRPC client ────── unix/TCP ──────► gRPC server
  (lib/cuzk/client.go)                     │
                                    ┌──────┴───────┐
                                    │   Scheduler   │
                                    │  (priority Q) │
                                    └──────┬───────┘
                                           │
                                    ┌──────┴───────┐
                                    │  GPU Workers  │
                                    │  (per device) │
                                    └──────────────┘

Vanilla proofs are generated locally in Curio (requires sector data on disk), then sent to cuzk for SNARK computation. The returned proof is verified locally before submission.

Pipelining

The cuzk engine pipelines work at three levels:

1. Partition-Level Synthesis → GPU Pipeline (Phase 7)

Instead of synthesizing all 10 partitions of a sector as a batch before any GPU work begins (the ffiselect model), cuzk spawns partition_workers concurrent synthesis tasks. Each produces a single partition's ProvingAssignment (~13.6 GiB) and sends it through a bounded channel to the GPU worker:

Synthesis Workers:  Part 0Part 2 ...
                         │      │      │
                         ▼      ▼      ▼
GPU Channel:        ─── P0 P1 P2 ───►
                                        │
GPU Worker:                         Prove P0Prove P2 ...

The GPU processes partition N while workers synthesize partition N+1..N+k. With partition_workers=10, all 10 partitions synthesize concurrently and the GPU is continuously fed.

2. Dual-Worker GPU Interlock (Phase 8)

Inside generate_groth16_proofs_c(), each partition has ~1.3s of CPU preprocessing (pointer setup, bitmap population) before ~3.3s of CUDA kernels, followed by ~0.7s of CPU epilogue. Phase 8 narrows the C++ mutex to cover only the CUDA kernel region and runs two GPU workers per device:

Worker A: CPU prepepilogue══ CUDA ══
Worker B:          CPU prep──epilogue
GPU:               ████ A ████████ B ████

This achieves 100% GPU utilization — zero idle gaps between partitions in steady state.

3. Split Async GPU API (Phase 12)

Phase 12 splits the monolithic C++ prove call into prove_start() (GPU kernels) + finalize() (b_g2_msm CPU + proof assembly). The GPU worker releases the lock ~1.7s earlier per partition, immediately picking up the next synthesized partition while b_g2_msm runs in a spawned finalizer task:

GPU Worker: ══ CUDA ════ CUDA ══
Finalizer:        b_g2_msm  b_g2_msm  b_g2_msm

Memory Management

SRS Residency

The daemon pre-populates GROTH_PARAM_MEMORY_CACHE at startup. Since the process is long-lived, the ~47 GiB SRS stays pinned in CUDA host memory across all proofs. No per-proof disk I/O.

Per-Partition Working Set

Memory is proportional to partition_workers, not total partitions:

Pipeline stage	Per-partition memory	Notes
During synthesis	~16 GiB	12 GiB a/b/c + 4 GiB aux
After prove_start	~4 GiB	a/b/c freed immediately, only aux + density remain
Pending finalize	~4 GiB	Held by finalizer task
The formula: Peak RSS ≈ 69 + (partition_workers × 20) GiB
Validated configurations:

128 GiB system: pw=2, gw=1 → 110 GiB peak, 152s/proof
256 GiB system: pw=7, gw=1 → 208 GiB peak, 53s/proof
512 GiB system: pw=12, gw=2 → 400 GiB peak, 37.7s/proof

Backpressure

Three mechanisms prevent OOM at high concurrency:

Early a/b/c free: prove_start() clears 12 GiB/partition immediately after GPU upload
Channel capacity auto-scaling: bounded to max(synthesis_lookahead, partition_workers)
Partition semaphore held through send: limits total in-flight synthesis outputs

CPU Locking / GPU Mutex

The C++ generate_groth16_proofs_c() originally used a static std::mutex that serialized the entire function. cuzk introduces:

Heap-allocated mutex (create_gpu_mutex() / destroy_gpu_mutex() FFI): one per physical GPU, managed by the engine. Passed through FFI as *mut c_void.
Narrowed scope: acquired before per-GPU CUDA kernel launch, released after kernels complete but before prep_msm_thread.join() — b_g2_msm and proof assembly run outside the lock.
Backward compatible: if gpu_mtx is null, falls back to the function-local static mutex (for non-engine callers).
The dual-worker interlock (2 workers per GPU) alternates lock acquisition so Worker B's CPU prep runs while Worker A holds the lock for CUDA kernels, and vice versa.

Task Integration Details

When [Cuzk] Address is set in Curio config:

Behavior	Change
`TypeDetails()`	GPU requirement zeroed, RAM set to 1 GiB (resource decisions delegated to cuzk)
`CanAccept()`	Queries `GetStatus` → rejects if `totalPending >= MaxPending`
`Do()`	Generates vanilla proof locally → sends to cuzk via `Prove` RPC → verifies returned proof locally
When `Address` is empty (default), all tasks behave exactly as before. No behavioral change for existing deployments.

Build

make cuzk          # builds extern/cuzk → ./cuzk binary (~1m51s from scratch)
make install-cuzk  # installs to /usr/local/bin
make clean         # includes cargo clean in extern/cuzk
make cuzk is NOT in the default BINS or BUILD_DEPS targets, so CI (which has no CUDA) is unaffected. Requires nvcc and cargo.

Files Changed

New files:

lib/cuzk/client.go — gRPC client wrapper (connect, Prove, GetStatus, HasCapacity)
lib/cuzk/proving.pb.go, proving_grpc.pb.go — generated protobuf/gRPC stubs
lib/ffi/cuzk_funcs.go — PoRepSnarkCuzk, ProveUpdateCuzk on SealCalls
documentation/en/experimental-features/cuzk-proving-daemon.md — user guide
Modified files:
deps/config/types.go — CuzkConfig struct + defaults
cmd/curio/tasks/tasks.go — creates cuzk.Client, passes to task constructors
tasks/seal/task_porep.go — cuzkClient field, Do/CanAccept/TypeDetails branches
tasks/snap/task_prove.go — same pattern
tasks/proofshare/task_prove.go — same + threaded through computeProof→computePoRep/computeSnap
Makefile — cuzk build/install/clean targets
.gitignore — /cuzk binary
Vendored crate files (for git clone && make cuzk):
extern/bellpepper-core/ — 13 files (full crate: Cargo.toml, src/, licenses)
extern/supraseal-c2/ — 8 files (Cargo.toml, build.rs, Cargo.lock, tests)

Implement the cuzk proving engine as a Rust workspace in extern/cuzk/ with 5 crates (proto, core, server, daemon, bench) and full gRPC API. Phase 0 delivers: - gRPC daemon (TCP + Unix socket) with 8 RPC endpoints - Real PoRep C2 proving via filecoin-proofs-api + SupraSeal CUDA backend - SRS parameter residency via GROTH_PARAM_MEMORY_CACHE (lazy populate) - Priority scheduler with binary heap queue - Prometheus metrics endpoint - Bench tool for single proof submission, status, preload, metrics E2E validated: Two consecutive 32GiB PoRep C2 proofs on RTX 5070 Ti — 116.8s cold (SRS from disk) → 92.8s warm (SRS cached), 20.5% improvement. Both produced valid 1920-byte Groth16 proofs.

…f fix Improve the cuzk daemon's debuggability and operational readiness for Phase 1 multi-GPU work: Observability: - Add tracing spans (info_span) with job_id correlation throughout prover and engine; upstream filecoin-proofs logs now tagged per-job - Split timing into deserialize vs proving (monolithic in Phase 0) - Per proof-kind Prometheus counters and duration summaries - GPU detection via nvidia-smi in GetStatus RPC (name, VRAM) - Running job info shown in status and annotated on GPU Correctness: - Fix AwaitProof to register late listeners (was broken, always 404) - Graceful shutdown via watch channel (drain, finish current proof) - Per-kind completed/failed counters with ring buffer for durations Tooling: - Add 'batch' command to cuzk-bench (sequential + concurrent modes, throughput stats with avg/min/max/proofs-per-min) - Refactor bench client connection into shared connect() helper - Add cuzk.example.toml with documented configuration E2E validated: 32GiB PoRep C2 proof completes in ~110s with full job_id-correlated logging and per-kind metrics.

…heduling Wire up WinningPoSt, WindowPoSt, and SnapDeals provers via filecoin-proofs-api: - prove_winning_post: generate_winning_post_with_vanilla - prove_window_post: generate_single_window_post_with_vanilla (per-partition) - prove_snap_deals: generate_empty_sector_update_proof_with_vanilla Multi-GPU worker pool: - Auto-detect GPUs via nvidia-smi or use config gpus.devices list - Spawn one async worker loop per GPU with CUDA_VISIBLE_DEVICES isolation - Per-worker SRS affinity tracking (last_circuit_id for future routing) Proto/API updates: - Add repeated bytes vanilla_proofs field for PoSt/SnapDeals multi-proof inputs - Rename SnapDeals fields to comm_r_old/comm_r_new/comm_d_new (raw 32-byte) - Registered proof type enum conversion (FFI V1_1 ↔ proofs-api V1_2 mapping) Bench tool updated: - Supports all proof types with --vanilla (JSON array of base64 proofs) - New flags: --registered-proof, --randomness, --comm-r-old/new, --comm-d-new 8 unit tests pass, 0 warnings, clean cargo check --no-default-features.

…napDeals Add gen-vanilla subcommand to cuzk-bench for generating vanilla proof test data from existing sealed sector data. This completes Phase 1 by enabling end-to-end testing of all four proof types (WinningPoSt, WindowPoSt, SnapDeals) without requiring Go/Curio. Three sub-subcommands: - winning-post: challenge selection + Merkle inclusion proofs (66 challenges) - window-post: fallback challenges + vanilla proofs (10 challenges) - snap-prove: partition proofs from original + updated sector data (16 partitions) Key implementation details: - filecoin-proofs-api added as optional dep behind 'gen-vanilla' feature flag - CID commitment parsing via cid crate (bagboea4b5abc... → [u8;32]) - commdr.txt file format parsing (d:<CID> r:<CID>) - Output format: JSON array of base64 strings (matches Go json.Marshal([][]byte)) - CPU-only, no GPU required (--no-default-features --features gen-vanilla) Validated against /data/32gbench/ golden data: - WinningPoSt: 164KB vanilla proof, 218KB JSON output - WindowPoSt: 25KB vanilla proof, 33KB JSON output - SnapDeals: 16 × 562KB partition proofs, 12MB JSON output 5 new unit tests (CID parsing, commdr format, JSON round-trip).

Fork bellperson 0.26.0 into extern/bellperson/ with minimal changes to expose the synthesis/GPU split point for pipelined proving: bellperson changes (3 files, ~130 lines changed): - prover/mod.rs: Make ProvingAssignment struct and all fields pub - prover/supraseal.rs: Make synthesize_circuits_batch() pub, add new prove_from_assignments() function (extracted GPU-phase code) - groth16/mod.rs: Re-export ProvingAssignment, synthesize_circuits_batch, prove_from_assignments under cuda-supraseal feature The internal two-phase architecture was already clean — synthesis runs circuit.synthesize() on CPU (rayon parallel), producing ProvingAssignment with a/b/c evaluation vectors + density trackers. GPU phase packs these into raw pointer arrays and calls supraseal_c2::generate_groth16_proof(). We simply expose both phases as separate public functions. cuzk workspace changes: - Cargo.toml: Add [patch.crates-io] for bellperson fork, add bellperson as workspace dependency - Cargo.lock: Updated to use local bellperson Also includes cuzk-phase2-design.md with complete Phase 2 design: - Per-partition pipeline strategy (13.6 GiB intermediate state instead of 136 GiB for all 10 partitions) - Memory budget analysis for 128 GiB vs 256 GiB machines - SRS manager design using SuprasealParameters directly - 7-step implementation plan - Call chain comparison (Phase 1 monolithic vs Phase 2 pipelined) All 8 existing cuzk tests pass. Zero new warnings from our changes.

Implement the core Phase 2 infrastructure: split monolithic seal_commit_phase2() into separate CPU synthesis and GPU proving phases, connected via a pipeline. New modules: - srs_manager.rs: Direct SRS loading via SuprasealParameters (bypasses GROTH_PARAM_MEMORY_CACHE). CircuitId enum maps proof types to exact .params filenames. Supports preload, evict, memory budget tracking. - pipeline.rs: Per-partition pipelined PoRep C2 proving. Each of the 10 partitions is synthesized individually (~13.6 GiB intermediate state vs ~136 GiB for all 10 at once), then proven on GPU via bellperson's split API (synthesize_circuits_batch → prove_from_assignments). Enables PoRep pipelining on 128 GiB machines. Engine changes: - Engine now supports pipeline.enabled config flag - When enabled, PoRep C2 jobs use pipelined prover with SrsManager - When disabled, falls back to Phase 1 monolithic prover - SRS preloading uses SrsManager in pipeline mode Config additions: - [pipeline] section: enabled, synthesis_lookahead - synthesis_lookahead controls backpressure (partitions buffered) Dependencies: - Added direct deps on filecoin-proofs, storage-proofs-{core,porep,post,update}, bellperson (fork), blstrs, ff, rayon, rand_core, filecoin-hashers - Correct feature flag propagation (cuda-supraseal for core+bellperson, cuda for porep/post/update which lack cuda-supraseal) Tests: 15 pass (12 existing + 3 new), 0 warnings from cuzk code. Compiles with --no-default-features (no GPU required for check builds).

Rewrite pipeline.rs to use batch synthesis (all 10 PoRep partitions in one rayon-parallel call + single GPU pass) instead of per-partition sequential mode. This matches monolithic performance (~91s vs ~93s) while enabling cross-proof overlap in the next step. Add pipelined synthesis/prove functions for all 4 proof types: - PoRep C2: batch mode (synthesize_porep_c2_batch + gpu_prove) - WinningPoSt: inlined circuit construction (no private API needed) - WindowPoSt: single-partition inlined circuit construction - SnapDeals: all-partition circuit construction Other changes: - engine.rs: route all proof types through pipeline when enabled - prover.rs: make 4 helper functions pub for pipeline.rs use - Add bincode dep for PoSt/SnapDeals vanilla proof deserialization

Restructure the engine to use a two-stage pipeline architecture when pipeline mode is enabled: Stage 1 (synthesis task): Pulls requests from the scheduler, runs CPU-bound circuit synthesis on a blocking thread, pushes the SynthesizedJob (intermediate state + SRS ref) to a bounded channel. Stage 2 (GPU workers): One per GPU, pull SynthesizedJob from the shared channel, run gpu_prove on a blocking thread pinned to their GPU via CUDA_VISIBLE_DEVICES, complete the job. The bounded channel (capacity = synthesis_lookahead config, default 1) provides backpressure: when GPU workers are busy and the channel is full, the synthesis task blocks — preventing OOM from unbounded pre-synthesized proofs. For PoRep 32G under continuous load, this enables: synth(N) | GPU(N) + synth(N+1) | GPU(N+1) + synth(N+2) | ... Steady-state: ~55s/proof (synthesis-bound) vs ~91s sequential When pipeline.enabled = false, falls back to Phase 1 monolithic workers (no overlap, full cycle per GPU worker). Also updates the example config with improved pipeline documentation.

Add batch collector and multi-sector synthesis to the pipeline engine. When max_batch_size > 1, same-type PoRep requests are accumulated and processed as a single combined synthesis + GPU proving pass, amortizing fixed GPU costs and improving SM utilization. New files: - batch_collector.rs: Accumulates same-circuit-type proof requests, flushes on max_batch_size or max_batch_wait_ms timeout. PoRep and SnapDeals are batchable; PoSt types bypass the collector entirely. Pipeline changes: - synthesize_porep_c2_multi(): Takes N sectors' C1 outputs, builds all N×10 partition circuits, synthesizes in one batch call. Returns combined SynthesizedProof + sector_boundaries for splitting results. - split_batched_proofs(): Splits concatenated GPU output back into per-sector proof byte vectors using sector_boundaries. Engine changes: - Synthesis task now uses BatchCollector for batchable proof types. Races scheduler delivery against batch timeout. Non-batchable types (WinningPost, WindowPost) preempt-flush any pending batch and process immediately. - SynthesizedJob extended with batch_requests and sector_boundaries. - GPU worker handles batched results: splits proof output, notifies each sector's individual caller with its own proof bytes and timings. Config: - scheduler.max_batch_size controls batch limit (1=disabled, 2-3 typical) - scheduler.max_batch_wait_ms controls accumulation window Backward compatible: max_batch_size=1 (default) preserves Phase 2 single-sector behavior exactly. All 25 tests pass, 0 cuzk warnings.

…oughput All Phase 3 E2E tests pass on RTX 5070 Ti: - Timeout flush: BatchCollector correctly flushes after 30s wait - Batch=2: 2 sectors synthesized as 20 circuits in 55s (same as 10), GPU 69s, yielding 62.7s/proof (1.42x vs baseline 89s) - Overflow: 3 proofs with batch=2 shows correct batch+overflow+pipeline - Non-batchable: WinningPoSt bypasses BatchCollector (0.8s total) Memory: batch=2 peaks at 360 GiB (vs 203 GiB for single proof). Updated roadmap table with measured numbers.

Synthesis optimizations (55.4s → 50.9s, -8.3%): - Boolean::add_to_lc/sub_from_lc: eliminate temporary LC allocations in circuit gadget hot paths (Boolean::lc creates a fresh Vec on every call; the new methods append directly to an existing LC) - Patched: UInt32::addmany, Num::add_bool_with_coeff, Boolean::enforce_equal, Boolean::sha256_ch, Boolean::sha256_maj, lookup3_xy, lookup3_xy_with_conditional_negation - Vec recycling pool in ProvingAssignment::enforce for the 6 LC buffers - Software prefetch in eval_with_trackers and LinearCombination::eval - perf stat: 91B fewer instructions (-15.3%), 18.6B fewer branches (-26.7%) GPU async deallocation (36s → 26s bellperson wrapper, -10s): - Root cause: ~37 GB of C++ vectors (split_vectors, tail_msm_bases) and ~130 GB of Rust Vecs (ProvingAssignment a/b/c) freed synchronously in destructors after GPU proving, blocking return for ~10s of munmap() calls - C++ fix: move split_vectors + tail_msm bases into detached std::thread - Rust fix: spawn thread to drop provers/input_assignments/aux_assignments - CUDA internal timing unchanged (~26s); overhead was pure deallocation Also: A4 (parallel B_G2 CPU MSM), D4 (per-MSM window objects), CUDA timing instrumentation, synth-only microbenchmark tool. E2E 32 GiB PoRep C2 on RTX 5070 Ti: 88.9s → 77.2s (-13.2%)

Pre-allocate ProvingAssignment Vecs (a, b, c, aux_assignment) to their final capacity using hints cached from the first synthesis. Eliminates ~27 reallocation cycles per Vec per circuit. Benchmarked: no measurable impact on 32 GiB PoRep C2 (50.65s with and without hints). Rust's geometric doubling amortizes well at our scale, and the ~265 GB of theoretical redundant copies are overlapped with computation across 10 parallel circuits on 96 cores. Kept as defensive code for memory-constrained environments.

Replace full circuit synthesis (alloc+enforce) with two-phase approach: 1. WitnessCS: witness-only generation (enforce is no-op) 2. CSR MatVec: pre-compiled sparse matrix × witness vector New cuzk-pce crate with: - RecordingCS: captures R1CS structure into CSR format (with tagged column encoding to handle interleaved alloc_input/enforce) - CsrMatrix/PreCompiledCircuit: serializable CSR storage - spmv_parallel: row-parallel sparse MatVec with rayon - evaluate_pce: builds witness vector, evaluates A*w, B*w, C*w - PreComputedDensity: density bitmaps extracted from CSR structure Pipeline integration: - synthesize_auto() dispatcher: PCE fast path when cached, old path otherwise - Static OnceLock caches per circuit type (porep-32g, winning-post, etc.) - ProvingAssignment::from_pce() constructor in bellperson fork - All 6 synthesis call sites switched to synthesize_auto() Benchmark (pce-bench subcommand): - Correctness: all 10 circuits × 130M constraints match bit-for-bit - Baseline synthesis: 50.4s (10 circuits, old path) - PCE synthesis: 35.5s (26.5s witness + 8.8s MatVec) - Speedup: 1.42x - PCE extraction: 46.9s (one-time cost, amortized over all future proofs) - Peak RAM: 375 GB

Add PcePipeline subcommand to cuzk-bench for testing PCE memory behavior under sequential and parallel pipelining modes: - RSS tracking via /proc/self/status at each pipeline stage - malloc_trim() between proofs for clean memory release - Wave-based parallel execution using std::thread::scope (-j N flag) - compare_old flag for A/B comparison in first iteration Update cuzk-project.md with j=2 parallel pipeline benchmark results: - 2 concurrent syntheses: 49s wall vs 71s sequential (1.45x wall speedup) - Per-proof degradation: 46-49s (vs 35.5s j=1) due to BW contention - Peak RSS: 407 GiB (2x working sets + PCE static + transient)

PCE disk persistence (raw binary format): - New cuzk-pce::disk module with save_to_disk/load_from_disk - Raw binary format (v2): 32-byte header + bulk byte dumps of CSR vectors - 5.4x faster than bincode: 9.2s load vs 49.9s (from tmpfs, 25.7 GiB) - Atomic writes (tmp + rename) to prevent corruption - Header with magic/version/dimensions for quick validation Daemon integration: - preload_pce_from_disk() called at engine startup (loads all PCE files) - extract_and_cache_pce() now saves to disk after extraction - Background PCE auto-extraction triggered after first old-path synthesis - get_pce() made public for engine-level cache checking Phase 6 design document (c2-optimization-proposal-6.md): - Slotted partition pipeline: overlap synth/GPU at partition granularity - slot_size=2 sweet spot: 41s latency (vs 69.5s batch), 54 GiB RAM (vs 136 GiB) - Steady-state throughput unchanged (35.5s/proof, synthesis-bound) - Multi-sector and multi-GPU extension paths documented Measured (RTX 5070 Ti, 32 GiB PoRep): - PCE save (NVMe): 22.3s, 1.2 GB/s - PCE load (tmpfs): 9.2s, 3.0 GB/s - PCE load (NVMe): ~13-15s estimated (3x faster than 47s extraction)

…esis Redesign the slotted pipeline to truly pipeline partition synthesis with GPU proving. All 10 partitions are synthesized in parallel (bounded by channel capacity), and the GPU consumes them one at a time as they arrive. Key changes: - prove_porep_c2_partitioned(): spawns one thread per partition via std::thread::scope, all run concurrently. Bounded sync_channel provides backpressure to limit live RAM. - Each partition = 1 GPU call (num_circuits=1), which gives fast b_g2_msm (~0.4s multi-threaded vs ~23s for num_circuits>=2). - ProofAssembler: indexed by partition number, supports out-of-order arrival, assembles in partition order. - synthesize_partition(): single-partition synthesis helper. - Backward-compatible prove_porep_c2_slotted() wrapper dispatches to partitioned path when slot_size < num_partitions. Benchmark results (32 GiB PoRep, 96-core Zen4, RTX 5070 Ti): max_concurrent=1: 72.0s, 71.3 GiB peak (5.42x overlap) max_concurrent=2: 72.7s, 86.8 GiB peak (5.38x overlap) max_concurrent=3: 71.9s, 86.8 GiB peak (5.37x overlap) batch-all: 62.3s, 228.5 GiB peak (no overlap) Pipelined mode uses 3.2x less RAM (71 vs 228 GiB) with only ~16% latency overhead. GPU takes ~3.8s/partition vs 25.5s batch-all total.

…ispatcher Add timeline instrumentation for waterfall visualization of the proving pipeline. Events (SYNTH_START/END, CHAN_SEND, GPU_PICKUP/START/END) are emitted as CSV to stderr with millisecond offsets from engine start, enabling precise analysis of GPU utilization and idle gaps. Add synthesis_concurrency config parameter that controls how many proofs can be synthesized simultaneously on the CPU. When synthesis takes longer than GPU proving (39s vs 27s), the GPU idles ~12s between proofs with sequential synthesis. With concurrency=2, overlapping syntheses can keep the GPU continuously fed. Implementation uses tokio::sync::Semaphore to limit concurrent synthesis tasks. When concurrency=1 (default), behavior is identical to the old sequential loop. When >1, each batch is spawned as an independent task with semaphore-guarded concurrency. Benchmark results (PoRep C2, 5-proof runs): concurrency=1: 45.3s/proof, 70.9% GPU utilization (baseline) concurrency=2, j=2: 42.2s/proof, 77.8% GPU utilization (+7%) concurrency=2, j=3: 43.1s/proof, 90.7% GPU utilization (+5%) concurrency=2, j=4: 60.2s/proof (CPU contention, regression) CPU contention between synthesis (rayon) and b_g2_msm (rayon) during GPU proving limits the improvement. Thread pool isolation is the next step.

Add configurable thread pool partitioning to reduce CPU contention when running parallel synthesis alongside GPU proving. Two independent thread pools compete for CPU cores during proving: 1. Rayon global pool — used by synthesis (bellperson, PCE SpMV) 2. C++ groth16_pool (sppark) — used by b_g2_msm and preprocessing Changes: - groth16_cuda.cu: Convert static groth16_pool to lazy initialization via std::call_once, reading CUZK_GPU_THREADS env var for pool size. This allows the Rust caller to set the env var before first GPU call. - groth16_srs.cuh: Update all pool references to use get_groth16_pool() - config.rs: Add gpus.gpu_threads field (default 0 = all CPUs) - daemon main.rs: Configure rayon global pool from synthesis.threads, set CUZK_GPU_THREADS from gpus.gpu_threads before engine start - Cargo.toml: Add rayon dependency to cuzk-daemon - cuzk.example.toml: Document thread isolation strategy Benchmark results (PoRep C2 32G, 96C/192T + RTX 5070 Ti): Baseline (sequential, no isolation): 46.1s/proof, 70.9% GPU util Parallel c=2, j=2, no isolation: 46.0s/proof, 81.9% GPU util Parallel c=2, j=2, rayon=192, gpu=32: 44.9s/proof, 76.9% GPU util Parallel c=2, j=3, rayon=192, gpu=32: 42.8s/proof (best, +7.2%) Thread isolation provides modest improvement (~2-3%). The dominant factor remains synthesis thread scalability: 2 syntheses sharing the rayon pool each get ~96 effective threads, inflating synth from 39s to 45-47s. Higher pipeline fill (j=3) is more effective than thread partitioning.

Proposal 7 replaces the thundering-herd synthesis pattern (all 10 partitions start/finish simultaneously) with a synth worker pool that processes partitions individually and feeds them to the GPU one at a time. Key design points: - 20 synth workers (configurable) each synthesize 1 partition (~29s) - Workers submit to engine GPU channel; block if full (backpressure) - GPU proves each partition with num_circuits=1 (b_g2_msm: 0.4s vs 25s) - ProofAssembler in JobTracker accumulates partitions per job_id - Cross-sector overlap: next sector's synth starts on free workers Expected impact: 42.8s/proof → ~30s/proof steady-state (GPU-limited), ~100% GPU utilization, zero inter-sector GPU idle time. ~110 net new lines of code, primarily in engine.rs.

Implement the Phase 7 architecture from c2-optimization-proposal-7.md: dispatches individual PoRep partitions as independent work units through the engine's synthesis→GPU pipeline, eliminating the thundering-herd pattern and enabling cross-sector pipelining. Key changes: - SynthesizedJob: add partition_index, total_partitions, parent_job_id fields for per-partition routing - PartitionedJobState: new struct tracking per-job ProofAssembler, accumulated timings, and failure state - PartitionWorkItem: work unit for spawn_blocking synthesis workers - JobTracker: add assemblers map for in-progress partitioned proofs - process_batch(): new Phase 7 dispatch path when partition_workers > 0 and single-sector PoRep C2 — parses C1 once, registers assembler, dispatches 10 spawn_blocking tasks gated by partition_semaphore, returns immediately (non-blocking) - GPU worker: partition-aware result routing — routes partition proofs to ProofAssembler, delivers final proof when all partitions complete, calls malloc_trim(0) after each partition to release memory - Error handling: failed flag on PartitionedJobState, synthesis/GPU failure propagation, skip work for already-failed jobs - Config: add synthesis.partition_workers (default 20), partition semaphore limiting concurrent synthesis workers - Phase 6 slotted pipeline retained as fallback (partition_workers=0, slot_size>0) - ParsedC1Output and parse_c1_output made pub for engine access - synthesize_partition made pub for engine dispatch Expected steady-state: 42.8s/proof → ~30s/proof (GPU-limited), ~100% GPU utilization, zero cross-sector GPU idle gaps. Per-partition GPU calls use num_circuits=1, making b_g2_msm 0.4s instead of 25s.

Proposal to eliminate per-partition GPU idle gaps by overlapping one worker's CPU preamble/epilogue with another worker's CUDA kernel execution. Two GPU workers per physical GPU share a fine-grained mutex that brackets only the CUDA kernel region inside generate_groth16_proofs_c. Key findings: - The static mutex in groth16_cuda.cu covers the entire function (~3.5s), but actual CUDA kernel time is ~2.1s. The remaining ~1.3s is CPU work (preprocessing, b_g2_msm, epilogue) that could overlap with the next partition's GPU execution. - The sppark semaphore_t is a counting semaphore that latches notify() before wait(), confirming safe barrier semantics for the proposed restructuring. - Recommended approach: pass mutex pointer from Rust through FFI, acquire before per-GPU thread launch, release after per-GPU thread join, leaving b_g2_msm and epilogue outside the lock. Estimated impact: GPU efficiency ~64% → ~98%, throughput ~3-10% improvement on top of Phase 7.

Narrow the C++ static mutex in generate_groth16_proofs_c to cover only the CUDA kernel region (NTT+MSM, batch additions, tail MSMs). CPU preprocessing and b_g2_msm now run outside the lock, allowing two GPU workers to interleave: one does CPU work while the other runs CUDA. Changes across 7 files (~195 lines): - groth16_cuda.cu: Remove static mutex, add std::mutex* parameter, acquire lock before per-GPU thread launch, release after per-GPU join (before prep_msm_thread join). Add create/destroy_gpu_mutex C helpers for FFI allocation. - supraseal-c2/lib.rs: Add gpu_mtx parameter to FFI decl and both generate_groth16_proof wrappers. Export alloc/free_gpu_mutex. - bellperson supraseal.rs: Add GpuMutexPtr type, SendableGpuMutex wrapper, alloc/free helpers. Thread gpu_mutex through prove_from_assignments. Legacy callers pass null (fallback mutex). - pipeline.rs: Thread GpuMutexPtr through gpu_prove(). Internal callers pass null_mut() for backward compatibility. - engine.rs: Create one C++ mutex per GPU via alloc_gpu_mutex(). Spawn gpu_workers_per_device workers per GPU (default 2), each sharing the same mutex address (as usize for Send safety). - config.rs: Add gpus.gpu_workers_per_device (default 2). Benchmark results (RTX 5070 Ti, 96-core Zen4, partition_workers=20): Single proof: 69.3s wall (GPU efficiency: 100.0% — zero idle gaps) Throughput c=5 j=3: 44.0s/proof (Phase 7: 50.7s → 13.2% improvement) Throughput c=5 j=2: 49.5s/proof (Phase 7: 59.8s → 17.2% improvement) partition_workers=30 regresses to 60.4s/proof due to CPU contention from 30 simultaneous synthesis workers starving GPU preprocessing.

Document three new phases of the pipelined SNARK proving engine: - Phase 6: Pipelined partition proving (slot-based, 62x b_g2_msm speedup) - Phase 7: Engine-level per-partition pipeline (cross-sector overlap) - Phase 8: Dual-worker GPU interlock (100% GPU utilization) Key benchmark findings: - Optimal partition_workers=10-12 on 96-core machine (43.5s/proof → 37.4s) - System is perfectly GPU-bound: throughput = serial CUDA kernel time (10 partitions × 3.75s = 37.5s vs measured 37.4s/proof) - Cross-sector GPU transitions are seamless (<50ms after warmup) - synthesis_concurrency>1 provides no benefit (synthesis already overlapped) Update file references and related documents for Phases 6-8.

Two changes to reduce GPU SM idle time caused by PCIe transfers inside the GPU mutex: 1. Pre-stage a/b/c polynomials (6 GiB) outside the mutex via cudaHostRegister + async upload on a dedicated copy stream. Overlaps with the other worker's CUDA kernels. 2. Deferred batch sync in Pippenger MSM: double-buffer host-side bucket results so GPU never waits for CPU to process the previous batch. Eliminates 8+ per-batch idle gaps per MSM. Includes full PCIe transfer inventory (23.6 GiB HtoD per partition) and expected 4-9% throughput improvement over Phase 8.

…uploads - Pre-stage a/b/c polynomial uploads using cudaHostRegister + async DMA before GPU mutex acquisition (host pinning) and after (device alloc + upload) - Memory-aware allocation: query cudaMemGetInfo after pool trim, only pre-stage if full 12 GiB (d_a + d_bc) fits with 512 MiB safety margin - Double-buffered deferred batch sync in Pippenger MSM (sppark submodule): per-batch sync deferred to next iteration, overlapping DtoH with compute - Early d_bc free inside per_gpu thread after NTT phase completes - GPU resources cleaned up before mutex release, host pages unregistered after Results (gw=1, pw=10, c=3, j=1): - 32.1s/proof avg (14.2% improvement over Phase 8 baseline 37.4s) - ntt_msm_h_ms: 2430ms -> 690ms (-71.6%) - gpu_total_ms: 3746ms -> 1450ms (-61.3%) gw=2 shows regression (41.0s) due to cudaDeviceSynchronize + pool trim serialization — needs further investigation.

Add per-stage timing to prestage setup: sync_ms, trim_ms, alloc_ms, upload_ms. Key findings with c=15 j=15 gw=1: - Pre-staging overhead: 18ms avg (negligible - PCIe gen5 is fast) - GPU kernels: 1824ms avg/partition - CPU critical path (prep_msm + b_g2_msm): 2393ms avg/partition - CPU is the bottleneck, not GPU — DDR5 bandwidth wall with 10 concurrent synthesis workers competing for memory - Throughput: 41.3s/proof (steady-state) - c=30 j=20 causes OOM/crash from memory pressure

Phase 9 cuts GPU kernel time 51% (3.7s→1.8s/partition) but steady-state throughput only improves 14% (37.4→32.1s in isolation) because CPU preprocessing (prep_msm + b_g2_msm = 2.4s/partition) is now the critical path. At high concurrency, 10 synthesis workers saturate 8-channel DDR5 bandwidth, slowing CPU MSM operations 12-27% and limiting throughput to ~41s/proof.

Phase 10 (two-lock GPU interlock) was implemented, tested, and abandoned: - 16 GB VRAM too small for 2 workers' pre-staged buffers - CUDA memory APIs are device-global, serializing across streams - Phase 9 already hides b_g2_msm behind GPU lock release Phase 11 design spec identifies 3 sources of throughput degradation (32.1s isolation → 38.0s at c=20 j=15) and proposes 3 interventions: 1. Serialize async_dealloc to bound TLB shootdown storms 2. Reduce groth16_pool to 32 threads to cut L3 thrashing 3. Memory-bandwidth throttle during b_g2_msm via shared atomic Also reverts groth16_cuda.cu Phase 10 timing instrumentation back to Phase 9 state.

Three interventions to reduce CPU memory subsystem contention at high concurrency (c=20 j=15): 1. Serialize async_dealloc threads (static mutex in C++ and Rust) to prevent concurrent munmap() TLB shootdown storms. Alone: negligible. 2. Reduce groth16_pool from 192 to 32 threads (gpu_threads=32 config). Cuts b_g2_msm L3 cache footprint from ~1.1 GiB to ~192 MB. b_g2_msm slows from 0.5s to 1.7s but runs outside GPU lock. Best result: 36.7s/proof (3.4% improvement over Phase 9 baseline of 38.0s). 3. Memory-bandwidth throttle: global AtomicI32 flag set by C++ around b_g2_msm, checked by Rust SpMV every 64 chunks with yield_now(). No additional gain over Intervention 2 alone. Also tested gw=3 (37.2s) and gw=4 (37.4s) — both worse due to CPU contention from additional GPU workers. Optimal config: gw=2, pw=10, gpu_threads=32 → 36.7s/proof.

Decouple b_g2_msm CPU computation from the GPU worker loop so the GPU worker can pick up the next synthesized partition ~1.7s faster. The C++ generate_groth16_proofs_c is refactored into start (returns pending handle after GPU lock release) + finalize (joins b_g2_msm, runs epilogue). GPU workers spawn a separate tokio finalizer task and immediately loop back for the next job. Key changes: - C++ groth16_pending_proof struct holds all shared state on the heap - generate_groth16_proofs_start_c / finalize_groth16_proof_c split API - Fix use-after-free: prep_msm_thread now reads provers_owned (heap copy) instead of the stack parameter that goes out of scope - Rust FFI: start_groth16_proof, finish_groth16_proof, drop_pending_proof - Bellperson: PendingProofHandle<E>, prove_start(), finish_pending_proof() - Pipeline: gpu_prove_start() / gpu_prove_finish(), PendingGpuProof alias - Engine: GPU worker restructured with spawned finalizer task; extracted process_partition_result() and process_monolithic_result() helpers - SynthesisCapacityHint struct added (was referenced but undefined) - Removed unused PR generic from start_groth16_proof FFI Benchmark (gw=2 pw=10 gt=32, c=20 j=15): 37.1s/proof throughput (vs 38.0s Phase 11 baseline, ~2.4% improvement).

…re fix Three improvements on top of the Phase 12 split API (99c31c2): 1. Early a/b/c free: After prove_start returns (GPU done with NTT+MSM), clear prover.a/b/c evaluation vectors (~12 GiB per partition). Only density bitvecs + assignment data are kept for background b_g2_msm. 2. Channel capacity auto-scaling: Size the synthesis→GPU channel to max(synthesis_lookahead, partition_workers) instead of hardcoded 1. Completed syntheses drain into the channel buffer without blocking, preventing synthesis output pile-up on send(). 3. Partition permit held through send: The partition semaphore permit is now held until AFTER the channel send succeeds (not just through synthesis). With channel capacity = pw, sends are non-blocking so this adds zero latency. Bounds total in-flight synthesis outputs to partition_workers, preventing unbounded memory growth. Also adds buffer flight counters (atomic, tracing::debug) for memory diagnostics and converts eprintln to log::debug in bellperson dealloc. Benchmark results (20 proofs, j=20, gw=2, gt=32): pw=10: 38.5s/proof, 321 GiB peak RSS (was 367 GiB) pw=12: 37.7s/proof, 400 GiB peak RSS (was OOM at 668 GiB!) pw=14: 37.8s/proof, 457 GiB peak RSS pw=16: 38.4s/proof, 510 GiB peak RSS Optimal config: pw=12 — best throughput with bounded memory.

Document Phase 12 split API and memory backpressure in cuzk-project.md: - Split API architecture, use-after-free fix, early a/b/c free - Memory backpressure (channel auto-scaling, permit-through-send) - Buffer flight counters, memory budget analysis Add low-memory benchmark sweep (pw=1/2/5/7/10/12 × gw=1/2): - Memory scales as ~69 + pw×20 GiB (measured) - 128 GiB: pw=2 gw=1 → 110 GiB peak, 152s/proof - 256 GiB: pw=7 gw=1 → 208 GiB peak, 53s/proof - 384 GiB: pw=10 gw=2 → 271 GiB peak, 43s/proof - gw=2 adds no benefit below pw=10 (GPU synthesis-starved) Update cuzk.example.toml with measured RAM-tier recommendations, optimal defaults (gpu_threads=32, partition_workers=12), and guidance for gw=1 vs gw=2 based on partition worker count.

Wire the cuzk persistent GPU SNARK proving daemon into Curio's harmony task scheduler for PoRep C2, SnapDeals Prove, and PSProve tasks. When configured, Curio delegates SNARK computations to cuzk over gRPC instead of spawning per-proof child processes via ffiselect. Vanilla proofs are still generated locally (require sector data on disk), then sent to the daemon for GPU proving, then verified locally. Go integration: - lib/cuzk/: gRPC client wrapper (generated proto + client.go) - lib/ffi/cuzk_funcs.go: PoRepSnarkCuzk, ProveUpdateCuzk on SealCalls - deps/config/types.go: CuzkConfig (Address, MaxPending, ProveTimeout) - cmd/curio/tasks/tasks.go: create cuzk.Client, pass to task constructors - tasks/seal, tasks/snap, tasks/proofshare: cuzk branch in Do/CanAccept/TypeDetails Build system: - Makefile: add 'make cuzk' target (cargo build, requires nvcc+cargo) - Deliberately not in BINS/BUILD_DEPS so CI is unaffected - install-cuzk/uninstall-cuzk targets, cargo clean in make clean Vendored Rust forks (complete crates for cargo build): - extern/bellpepper-core: full crate (was partially tracked) - extern/supraseal-c2: full crate (was partially tracked) - extern/bellperson, extern/cuzk: already fully tracked When Cuzk.Address is empty (default), behavior is identical to before. No impact on existing deployments.

magik6k added 30 commits February 17, 2026 16:38

magik6k added 3 commits February 20, 2026 15:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

10x cheaper C2/PRU: CuZK Proving engine#1043

10x cheaper C2/PRU: CuZK Proving engine#1043
magik6k wants to merge 33 commits intomainfrom
feat/cuzk

magik6k commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

magik6k commented Feb 20, 2026

Summary

What is cuzk

Architecture

Pipelining

1. Partition-Level Synthesis → GPU Pipeline (Phase 7)

2. Dual-Worker GPU Interlock (Phase 8)

3. Split Async GPU API (Phase 12)

Memory Management

SRS Residency

Per-Partition Working Set

Backpressure

CPU Locking / GPU Mutex

Task Integration Details

Build

Files Changed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant