Skip to content

Conversation

@themighty1
Copy link
Collaborator

This PR improves QS check performance (5x on native build and 2.5x in the browser) by parallelizing the chi computation using the bootstrapped chis approach: for each parallel lane an independent chi is computed from the original chi.

Then, additionally it fuses the chi computation with the existing parallel terms computation (avoiding the need to alloc chis in memory)

Benchmarking the improvement

The benches were done with checked sizes 200K, 400K and 600K. Going higher gave no meaningful improvement, so I used just those 3 values.

In the native bench "Elements per sec" means "AND gates per second".

The benches were run on this branch https://github.com/themighty1/mpz/tree/feat/parallel_chis - the head of the branch corresponds to the current (before this PR) approach. You need to checkout the commit 498a3e9 which contains the new approach.

Before this PR

native bench
cargo bench -p mpz-zk-core --bench prover --features rayon -- check

prover/check/200K       time:   [4.2744 ms 4.2961 ms 4.3089 ms]
                        thrpt:  [47.530 Melem/s 47.671 Melem/s 47.914 Melem/s]
prover/check/400K       time:   [11.789 ms 11.813 ms 11.844 ms]
                        thrpt:  [34.043 Melem/s 34.133 Melem/s 34.202 Melem/s]
prover/check/600K       time:   [19.133 ms 19.176 ms 19.213 ms]
                        thrpt:  [31.312 Melem/s 31.373 Melem/s 31.442 Melem/s]

browser bench
cargo run --release --bin wasm-bench-runner -- -g zk_prover_core/check -c 8 --iterations 5 --samples 1

=== zk_prover_core ===
Name                                      Median (ms)  Per-iter (us)  AND gates/s
----------------------------------------------------------------------------------
zk_prover_core/check_200k                      206.17       41234.00       4.97M
zk_prover_core/check_400k                      372.35       74469.00       5.41M
zk_prover_core/check_600k                      564.65      112930.00       5.33M

After this PR

native bench
cargo bench -p mpz-zk-core --bench prover --features rayon -- check

prover/check/200K       time:   [1.0239 ms 1.0318 ms 1.0381 ms]
                        thrpt:  [197.29 Melem/s 198.49 Melem/s 200.01 Melem/s]
prover/check/400K       time:   [5.2671 ms 5.3024 ms 5.3709 ms]
                        thrpt:  [75.072 Melem/s 76.041 Melem/s 76.550 Melem/s]
prover/check/600K       time:   [4.0322 ms 4.1950 ms 4.3712 ms]
                        thrpt:  [137.63 Melem/s 143.41 Melem/s 149.20 Melem/s]

browser bench
cargo run --release --bin wasm-bench-runner -- -g zk_prover_core/check -c 8 --iterations 10 --samples 3

Name                                      Median (ms)  Per-iter (us)  AND gates/s
----------------------------------------------------------------------------------
zk_prover_core/check_200k                      170.62       17028.67      12.03M
zk_prover_core/check_400k                      297.88       31230.33      12.91M
zk_prover_core/check_600k                      461.40       48697.50      12.35M

Since 200K performs so well on native builds, I also modified the default config accordingly.

@themighty1 themighty1 requested a review from sinui0 December 23, 2025 09:51
Comment on lines 64 to 85
// Bootstrap 16 values via squaring
let mut bootstrapped = [Block::ZERO; 16];
let mut current = chi;
for b in &mut bootstrapped {
*b = current;
current = current.gfmul(current);
}

chis
// Hash each to get independent starting points
let mut starts = [Block::ZERO; 16];
for (i, boot) in bootstrapped.iter().enumerate() {
let mut hasher = Hasher::new();
hasher.update(&boot.to_bytes());
hasher.update(&(i as u64).to_le_bytes());
hasher.update(&(segment_size as u64).to_le_bytes());
let hash = hasher.finalize();
starts[i] =
Block::try_from(&hash.as_bytes()[..16]).expect("hash should be at least 16 bytes");
}

starts
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unnecessary, we can just use chi to seed a PRG and then use it to derive the evaluation points fully independently. See ChaCha docs and set_word_pos

Although, because ChaCha produces 64 byte blocks, we probably want to divide the terms into chunks of 4 to avoid wasting cycles.

@sinui0
Copy link
Collaborator

sinui0 commented Dec 23, 2025

Also, I didn't test it, but watch out for non-determinism when using Rayon, it uses work stealing which may cause order to differ across machines. This should be fine with the PRG approach if you are setting the PRG position based on an enumerator

@themighty1
Copy link
Collaborator Author

Good call, I added PRGs and we got a 30% speedup in wasm.

=== zk_prover_core ===
Name                                      Median (ms)  Per-iter (us)  AND gates/s
----------------------------------------------------------------------------------
zk_prover_core/check_200k                      271.41       13606.45      15.05M
zk_prover_core/check_400k                      516.28       25710.42      15.68M
zk_prover_core/check_600k                      770.23       38695.18      15.55M
zk_prover_core/check_800k                     1023.33       51088.32      15.66M
zk_prover_core/check_1m                       1241.31       61924.07      16.23M
zk_prover_core/check_10m                     12004.62      602480.85      16.60M

although native build took a 5% hit

prover/check/200K       time:   [1.0790 ms 1.0822 ms 1.0871 ms]
                        thrpt:  [188.38 Melem/s 189.24 Melem/s 189.80 Melem/s]

wrt to Rayon non-determinism, here all values get reduced, so non-determinism is not an issue.

@themighty1
Copy link
Collaborator Author

@sinui0 , ready for review

@themighty1 themighty1 requested a review from sinui0 December 24, 2025 13:18
Copy link
Collaborator

@sinui0 sinui0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions, which also apply to the verifier side.

Comment on lines 105 to 109
// Computation with pre-split lanes.
const PARALLELISM: usize = 16;
let n = macs.len();
let segment_size = n.div_ceil(PARALLELISM);
let starts = Self::compute_chi_starts(chi);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple comments:

  • Hardcoding a "parallelism" number doesn't seem right. The goal is so saturate the CPU cores while also optimizing for cache efficiency. So we should choose a chunk size, e.g. 1024, which is large enough to saturate caches but small enough to take advantage of work stealing.
  • We don't need multiple PRGs, ChaCha12 already has a set_stream API for producing independent streams.
Suggested change
// Computation with pre-split lanes.
const PARALLELISM: usize = 16;
let n = macs.len();
let segment_size = n.div_ceil(PARALLELISM);
let starts = Self::compute_chi_starts(chi);
const CHUNK_SIZE: usize = 1024;
let seed = *transcript.finalize().as_bytes();
let rng = ChaCha12Rng::from_seed(seed);

Comment on lines 111 to 128
let process_segment = |segment: &[Triple], mut rng: ChaCha12Rng| {
use rand_chacha::rand_core::RngCore;

let mut u_acc = Block::ZERO;
let mut v_acc = Block::ZERO;

for &triple in segment {
let mut chi_bytes = [0u8; 16];
rng.fill_bytes(&mut chi_bytes);
let chi = Block::from(chi_bytes);

let (u, v) = compute_terms(triple, chi);
u_acc ^= u;
v_acc ^= v;
}

(u_acc, v_acc)
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be able to prevent a copy with this

Suggested change
let process_segment = |segment: &[Triple], mut rng: ChaCha12Rng| {
use rand_chacha::rand_core::RngCore;
let mut u_acc = Block::ZERO;
let mut v_acc = Block::ZERO;
for &triple in segment {
let mut chi_bytes = [0u8; 16];
rng.fill_bytes(&mut chi_bytes);
let chi = Block::from(chi_bytes);
let (u, v) = compute_terms(triple, chi);
u_acc ^= u;
v_acc ^= v;
}
(u_acc, v_acc)
};
let process_segment = |rng: &mut ChaCha12Rng, segment: &[Triple]| {
use rand_chacha::rand_core::RngCore;
let mut u_acc = Block::ZERO;
let mut v_acc = Block::ZERO;
let mut chi = Block::ZERO;
for &triple in segment {
rng.fill_bytes(chi.as_bytes_mut());
let (u, v) = compute_terms(triple, chi);
u_acc ^= u;
v_acc ^= v;
}
(u_acc, v_acc)
};

.map(|(macs, chi)| compute_terms(macs, chi))
.par_chunks(segment_size)
.zip(starts.into_par_iter())
.map(|(segment, chi_start)| process_segment(segment, chi_start))
Copy link
Collaborator

@sinui0 sinui0 Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

map_with will clone the PRG as needed. We ensure the PRG stream is always independent (and deterministic) by keying it to the chunk ID.

Suggested change
.map(|(segment, chi_start)| process_segment(segment, chi_start))
.enumerate()
.map_with(
rng,
|rng, (stream_id, segment)| {
rng.set_stream(stream_id as u64);
process_segment(rng, segment)
}
)

@themighty1
Copy link
Collaborator Author

themighty1 commented Dec 29, 2025

I applied the latest pattern suggested and tested various segment sizes in my branch (I needed a separate branch because it has benches for various batch_size which mpz doesn't have) : https://github.com/themighty1/mpz/tree/perf/optimized_check_chunking

Bottom line:

  • with the latest pattern we lost a few more % points in the native bench (for a total of 15% compared to my initial approach in this PR)
  • SEGMENT_SIZE == 512 is optimal for both native and wasm build.
  • batch_size == 200K is optimal for native and 800K-1m is optimal for wasm.

All the native benches below were run with

cargo bench -p mpz-zk-core --bench prover --features rayon -- check

All the wasm benches were run with (changing the suffix to 400k, 600k etc)

cargo run --release --bin wasm-bench-runner -- -b zk_prover_core/check_200k -c 8 --iterations 10 --samples 20
Segment size 256

// native bench

prover/check/200K       time:   [1.1760 ms 1.1854 ms 1.1951 ms]
                        thrpt:  [171.37 Melem/s 172.77 Melem/s 174.15 Melem/s]

// browser bench

=== zk_prover_core ===
Name                                      Median (ms)  Per-iter (us)  AND gates/s
----------------------------------------------------------------------------------
zk_prover_core/check_200k                      135.42       13383.10      15.30M
zk_prover_core/check_400k                      255.12       25830.95      15.61M
zk_prover_core/check_600k                      361.73       36295.55      16.58M
zk_prover_core/check_800k                      480.04       48402.00      16.53M
zk_prover_core/check_1m                        597.90       59850.10      16.79M
Segment size 512

// native bench

prover/check/200K       time:   [1.1714 ms 1.1748 ms 1.1805 ms]
                        thrpt:  [173.49 Melem/s 174.33 Melem/s 174.84 Melem/s]

// browser bench

=== zk_prover_core ===
Name                                      Median (ms)  Per-iter (us)  AND gates/s
----------------------------------------------------------------------------------
zk_prover_core/check_200k                      135.18       13584.40      15.08M
zk_prover_core/check_400k                      249.83       25220.17      15.99M
zk_prover_core/check_600k                      364.98       36880.28      16.31M
zk_prover_core/check_800k                      477.86       47902.18      16.70M
zk_prover_core/check_1m                        596.65       59848.50      16.79M
Segment size 1024

// native bench

prover/check/200K       time:   [1.1973 ms 1.2036 ms 1.2101 ms]
                        thrpt:  [169.24 Melem/s 170.16 Melem/s 171.05 Melem/s]

// browser bench

=== zk_prover_core ===
Name                                      Median (ms)  Per-iter (us)  AND gates/s
----------------------------------------------------------------------------------
zk_prover_core/check_200k                      135.51       13484.58      15.19M
zk_prover_core/check_400k                      248.35       25221.50      15.99M
zk_prover_core/check_600k                      369.40       37581.78      16.01M
zk_prover_core/check_800k                      481.18       48139.70      16.62M
zk_prover_core/check_1m                        594.15       59629.20      16.85M

Benching higher segment_size values showed that for both native and wasm benches the throughput got only worse.

Segment size 12000

// native bench

prover/check/200K       time:   [1.5742 ms 1.5786 ms 1.5839 ms]
                        thrpt:  [129.30 Melem/s 129.74 Melem/s 130.10 Melem/s]

// browser bench

=== zk_prover_core ===
Name                                      Median (ms)  Per-iter (us)  AND gates/s
----------------------------------------------------------------------------------
zk_prover_core/check_200k                      172.70       17294.93      11.84M
zk_prover_core/check_400k                      281.62       28403.13      14.20M
zk_prover_core/check_600k                      394.71       39889.83      15.08M
zk_prover_core/check_800k                      520.02       52072.68      15.36M
zk_prover_core/check_1m                        627.75       62817.25      16.00M

@themighty1
Copy link
Collaborator Author

@sinui0 , ready for review.
We need to make sure our wasm build of QS uses 800K as the batch size. I assume this will be handled outside of mpz via a config.

@themighty1 themighty1 requested a review from sinui0 December 29, 2025 11:58
@themighty1 themighty1 merged commit ddb94fd into dev Jan 2, 2026
3 checks passed
@themighty1 themighty1 deleted the perf/bootstrapped_chis branch January 2, 2026 08:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants