Skip to content

Conversation

@p4u
Copy link
Contributor

@p4u p4u commented Dec 19, 2025

Description

This PR addresses critical stability and memory issues in the ICICLE Groth16 backend, enabling reliable proof generation for large circuits while drastically reducing GPU memory consumption.

Problems Solved

  1. Illegal Memory Access Crashes: Large MSMs (especially BLS12-377 G2 with >4M constraints) were causing "illegal memory access" errors due to bugs in the C++ backend's internal chunking logic.

  2. Race Conditions & Deadlocks: Concurrent MSM execution was causing data races, deadlocks, and occasional proof corruptions due to instability in the C++/CUDA backend's concurrency handling.

  3. Excessive GPU Memory Usage: Pre-loading all proving key vectors at initialization consumed 80-90% of available GPU memory, preventing multiple circuits from being loaded and limiting maximum circuit size.

  4. Memory Leaks: Device memory was not being freed promptly, causing accumulation over multiple proofs and eventual OOM errors.

Solution Overview

  • Commit 1: Go-side MSM Chunking
  • Commit 2: Sequential MSM Execution
  • Commit 3: On-Demand Vector Loading

Impact

Stability:

  • Eliminates "illegal memory access" crashes on large circuits
  • Prevents race conditions and deadlocks
  • Stable proof generation across all supported curves

Performance:

  • Proof times remain similar (PCIe transfer overhead is negligible vs MSM computation)
  • Sequential execution trades minimal performance for significant stability gains
  • Verified on RTX 5070 Ti (16GB):
    • BLS12-377 (3.5M constraints): ~3.5s
    • BW6-761 (4M constraints): ~6s
    • BN254 (16M constraints): ~10s

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How has this been tested?

  • Integration Tests: Full test suite with custom circuits
  • Memory Stability: Sequential proof generation with circuit switching (3 circuits in loop)
  • Proof Verification: All generated proofs verified successfully
  • Long-running Tests: Multi-hour stress tests without memory leaks or crashes (TODO)

Test Environment:

  • NVIDIA RTX 5070 Ti (16GB VRAM)
  • CUDA 12.x
  • Ubuntu Linux
  • Multiple circuit types (BN254, BLS12-377, BLS12-381, BW6-761)

Checklist:

  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation (docs update can follow if needed)
  • I have added tests that prove my fix is effective or that my feature works (existing tests now pass reliably)
  • I did not modify files generated from templates (only modified template, regenerated all curves)
  • golangci-lint does not output errors locally
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules

Additional Notes

This PR prioritizes stability and reliability over maximum parallelism. The C++/CUDA backend's concurrency bugs made parallel MSM execution unreliable, leading to occasional proof corruptions and crashes.

  1. Predictable behavior across all circuit sizes and curves
  2. Efficient memory usage enabling production deployments
  3. Maintainable code with simplified control flow

Note

Improves stability and drastically reduces GPU memory usage in ICICLE Groth16 across BN254, BLS12-377/381, BW6-761.

  • Add Go-side MSM chunking for G1/G2 with memory-aware sizing, window tuning via ICICLE_MSM_MAX_WINDOW, and a safe chunk cap; replace C++ internal chunking
  • Serialize all GPU work per device with a global mutex and per-device NTT domain tracking/release to avoid races and witness corruption
  • Switch to on-demand loading of proving key vectors; remove preloading of G1.{A,B,K,Z}, G2.B, and commitment keys; free device memory immediately after use
  • Add optional PinToGPU to pre-load and reuse vectors, plus FreeGPUResources() to release pinned buffers and DenDevice
  • Rework commitment MSMs/POK and proof steps to use chunked helpers, temporary device views, and explicit frees; add runtime.KeepAlive to prevent GC-related copy issues
  • Introduce ICICLE_DEBUG logging, minor config/cleanup helpers, and safer defaults across curves

Written by Cursor Bugbot for commit 373b7ed. This will update automatically on new commits. Configure here.

p4u added 3 commits December 16, 2025 17:00
…d fix data race

- Implemented msmChunkedG1/G2 in Go to handle large MSMs by slicing inputs. This bypasses the internal C++ multi-chunking logic which was crashing on large BLS12-377 G2 MSMs (illegal memory access).
- Enforced maximum chunk size of 2^18 (262144) in Go wrapper to ensure stability across all curves.
- Aligned memory safety margins (bucketFactor=4.0, reduced=0.7) and indices estimation (4x) with C++ backend requirements.
- Fixed a data race in Prove where the function returned before the G2 MSM computation was complete.
- Updated generator template to reflect these changes for all curves.
- Removed parallel RunOnDevice calls for AR1, BS1, and BS2, consolidating them into a single sequential execution block.
- Removed channel-based synchronization (chArDone, chBs1Done, chBs2Done) in favor of direct sequential ordering to eliminate deadlocks and simplify control flow.
- Ensured computeKRS runs after all other MSMs within the same device context, preventing potential backend concurrency issues.
Drastically reduce GPU memory usage by loading proving key vectors only when
needed during proof generation, instead of pre-loading everything at initialization.

- Modified setupDevicePointers() to skip bulk loading of large vectors
- Added on-demand loading helpers:
  * loadG1(): loads with Montgomery→Standard conversion (for A,B,K,Z)
  * loadG1Raw(): preserves Montgomery form (for Commitment Keys)
  * loadG2(): loads with Montgomery→Standard conversion (for G2.B)
- Updated Prove() to load vectors on-demand in compute functions:
  * computeAR1(): loads G1.A, uses for MSM, frees immediately
  * computeBS1(): loads G1.B, uses for MSM, frees immediately
  * computeBS2(): loads G2.B, uses for MSM, frees immediately
  * computeKRS(): loads G1.K and G1.Z, uses for MSMs, frees immediately
- Fixed memory leak: privateCommittedValuesDevice now freed after POK computation
- Added nil pointer guards on all .Free() calls to prevent panics
- Commitment keys loaded on-demand during hint execution

Impact:
- Memory usage peak reduced (80-90%)
- Enables larger circuits and more concurrent proof operations
- Maintains correctness with proper Montgomery form handling
… scope

Concurrent calls to Prove() could simultaneously initialize pk.deviceInfo,
causing double initialization and GPU memory corruption.

Add setupMu sync.Mutex to ProvingKey struct and protect the entire
setupDevicePointers() function with Lock/Unlock. This ensures atomic
check-and-initialize semantics.

Fix race condition where the nttDomainMu mutex was released before
ReleaseDomain() and InitDomain() operations completed, allowing concurrent
threads to corrupt domain state.
…ry leaks

GPU Memory Leak in load functions: loadG1() and loadG2() now free GPU memory
before returning on Montgomery

NTT Domain Race Condition: added per-device RWMutex (nttDomainRWByDevice)
for NTT domain protection
@p4u
Copy link
Contributor Author

p4u commented Dec 19, 2025

I think it is now quite solid, thanks to cursor bugbot too

If true, the proving key is not uloaded from GPU memory, allowing
faster proving (win of 30-50 ms on a bls12-377 3M constraints circuit)
@p4u
Copy link
Contributor Author

p4u commented Dec 19, 2025

Added PinToGPU option for icicle proving keys. So the caller can choose to keep the Proving key in GPU memory, to avoid spending time on load/unload. On my tests, the win is about 30-50 ms for a bls12-377 3.5M constraint circuit. By default this option is set to false.

}

log.Info().Msg("All GPU resources freed")
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: FreeGPUResources may cause double-free when called multiple times

The FreeGPUResources documentation states "It is safe to call this multiple times" but the implementation doesn't support this contract. After calling Free() on each DeviceSlice, the pointers are not reset, so AsUnsafePointer() != nil may still return true on subsequent calls. Additionally, pk.deviceInfo is never set to nil after cleanup. This can lead to double-free corruption if the method is called more than once. The code needs to either reset the DeviceSlice values after freeing them or set pk.deviceInfo = nil at the end of the function.

Additional Locations (2)

Fix in Cursor Fix in Web

icicle_runtime.RunOnDevice(device, func(args ...any) {
for i := range pk.CommitmentKeys {
commitmentKeyBasisHost := icicle_core.HostSliceFromElements(pk.CommitmentKeys[i].Basis)
commitmentKeyBasisExpSigmaHost := icicle_core.HostSliceFromElements(pk.CommitmentKeys[i].BasisExpSigma)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: NTT domain unconditionally reinitialized for each ProvingKey

The InitDomain call is executed unconditionally for every ProvingKey setup, regardless of whether the domain is already initialized and large enough. The code tracks domain size via nttDomainMaxByDevice and only releases when a larger domain is needed (requestedDomainSize > currentMax), but then always calls InitDomain. When a smaller circuit is set up after a larger one, this calls InitDomain with a smaller generator on an already-initialized domain. Depending on ICICLE library behavior, this could panic or corrupt the NTT domain state, breaking proof generation for subsequent larger circuits. The InitDomain call needs to be guarded by the same condition that updates nttDomainMaxByDevice.

Additional Locations (2)

Fix in Cursor Fix in Web

Replace RWMutex with exclusive per-device mutex to prevent concurrent
CUDA stream access to shared NTT domain, fixing sporadic constraint
violations under load.
}

log.Info().Msg("All GPU resources freed")
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FreeGPUResources may double-free GPU memory on repeated calls

The FreeGPUResources documentation claims "It is safe to call this multiple times" but this is incorrect. After calling Free() on device slices like pk.PinnedG1Device.A, the pointer is not reset to nil/zero. Since pk.deviceInfo is also never set to nil, subsequent calls pass both the pk.deviceInfo == nil check and the AsUnsafePointer() != nil checks, resulting in double-free of already-freed GPU memory. This could cause crashes or memory corruption.

Additional Locations (2)

Fix in Cursor Fix in Web

commitmentKeysDeviceDone := make(chan struct{})
pk.CommitmentKeysDevice.Basis = make([]icicle_core.DeviceSlice, len(pk.CommitmentKeys))
pk.CommitmentKeysDevice.BasisExpSigma = make([]icicle_core.DeviceSlice, len(pk.CommitmentKeys))
icicle_runtime.RunOnDevice(device, func(args ...any) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NTT domain may be incorrectly reinitialized with smaller generator

InitDomain is called unconditionally even when requestedDomainSize <= currentMax. When a smaller circuit's proving key is loaded after a larger one, InitDomain is called with the smaller root of unity (rouIcicle). This could corrupt the existing larger NTT domain or cause incorrect NTT computations. The InitDomain call should be conditional on requestedDomainSize > currentMax or currentMax == 0.

Additional Locations (2)

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant