perf(zk-core): use bootstrapped chis for the check #362

themighty1 · 2025-12-23T09:51:07Z

This PR improves QS check performance (5x on native build and 2.5x in the browser) by parallelizing the chi computation using the bootstrapped chis approach: for each parallel lane an independent chi is computed from the original chi.

Then, additionally it fuses the chi computation with the existing parallel terms computation (avoiding the need to alloc chis in memory)

Benchmarking the improvement

The benches were done with checked sizes 200K, 400K and 600K. Going higher gave no meaningful improvement, so I used just those 3 values.

In the native bench "Elements per sec" means "AND gates per second".

The benches were run on this branch https://github.com/themighty1/mpz/tree/feat/parallel_chis - the head of the branch corresponds to the current (before this PR) approach. You need to checkout the commit 498a3e9 which contains the new approach.

Before this PR

native bench
cargo bench -p mpz-zk-core --bench prover --features rayon -- check

prover/check/200K       time:   [4.2744 ms 4.2961 ms 4.3089 ms]
                        thrpt:  [47.530 Melem/s 47.671 Melem/s 47.914 Melem/s]
prover/check/400K       time:   [11.789 ms 11.813 ms 11.844 ms]
                        thrpt:  [34.043 Melem/s 34.133 Melem/s 34.202 Melem/s]
prover/check/600K       time:   [19.133 ms 19.176 ms 19.213 ms]
                        thrpt:  [31.312 Melem/s 31.373 Melem/s 31.442 Melem/s]

browser bench
cargo run --release --bin wasm-bench-runner -- -g zk_prover_core/check -c 8 --iterations 5 --samples 1

=== zk_prover_core ===
Name                                      Median (ms)  Per-iter (us)  AND gates/s
----------------------------------------------------------------------------------
zk_prover_core/check_200k                      206.17       41234.00       4.97M
zk_prover_core/check_400k                      372.35       74469.00       5.41M
zk_prover_core/check_600k                      564.65      112930.00       5.33M

After this PR

native bench
cargo bench -p mpz-zk-core --bench prover --features rayon -- check

prover/check/200K       time:   [1.0239 ms 1.0318 ms 1.0381 ms]
                        thrpt:  [197.29 Melem/s 198.49 Melem/s 200.01 Melem/s]
prover/check/400K       time:   [5.2671 ms 5.3024 ms 5.3709 ms]
                        thrpt:  [75.072 Melem/s 76.041 Melem/s 76.550 Melem/s]
prover/check/600K       time:   [4.0322 ms 4.1950 ms 4.3712 ms]
                        thrpt:  [137.63 Melem/s 143.41 Melem/s 149.20 Melem/s]

browser bench
cargo run --release --bin wasm-bench-runner -- -g zk_prover_core/check -c 8 --iterations 10 --samples 3

Name                                      Median (ms)  Per-iter (us)  AND gates/s
----------------------------------------------------------------------------------
zk_prover_core/check_200k                      170.62       17028.67      12.03M
zk_prover_core/check_400k                      297.88       31230.33      12.91M
zk_prover_core/check_600k                      461.40       48697.50      12.35M

Since 200K performs so well on native builds, I also modified the default config accordingly.

sinui0 · 2025-12-23T14:14:03Z

crates/zk-core/src/check.rs

+        // Bootstrap 16 values via squaring
+        let mut bootstrapped = [Block::ZERO; 16];
+        let mut current = chi;
+        for b in &mut bootstrapped {
+            *b = current;
+            current = current.gfmul(current);
        }

-        chis
+        // Hash each to get independent starting points
+        let mut starts = [Block::ZERO; 16];
+        for (i, boot) in bootstrapped.iter().enumerate() {
+            let mut hasher = Hasher::new();
+            hasher.update(&boot.to_bytes());
+            hasher.update(&(i as u64).to_le_bytes());
+            hasher.update(&(segment_size as u64).to_le_bytes());
+            let hash = hasher.finalize();
+            starts[i] =
+                Block::try_from(&hash.as_bytes()[..16]).expect("hash should be at least 16 bytes");
+        }
+
+        starts
    }


This seems unnecessary, we can just use chi to seed a PRG and then use it to derive the evaluation points fully independently. See ChaCha docs and set_word_pos

Although, because ChaCha produces 64 byte blocks, we probably want to divide the terms into chunks of 4 to avoid wasting cycles.

crates/zk-core/src/check.rs

sinui0 · 2025-12-23T14:25:02Z

Also, I didn't test it, but watch out for non-determinism when using Rayon, it uses work stealing which may cause order to differ across machines. This should be fine with the PRG approach if you are setting the PRG position based on an enumerator

themighty1 · 2025-12-24T06:47:01Z

Good call, I added PRGs and we got a 30% speedup in wasm.

=== zk_prover_core ===
Name                                      Median (ms)  Per-iter (us)  AND gates/s
----------------------------------------------------------------------------------
zk_prover_core/check_200k                      271.41       13606.45      15.05M
zk_prover_core/check_400k                      516.28       25710.42      15.68M
zk_prover_core/check_600k                      770.23       38695.18      15.55M
zk_prover_core/check_800k                     1023.33       51088.32      15.66M
zk_prover_core/check_1m                       1241.31       61924.07      16.23M
zk_prover_core/check_10m                     12004.62      602480.85      16.60M

although native build took a 5% hit

prover/check/200K       time:   [1.0790 ms 1.0822 ms 1.0871 ms]
                        thrpt:  [188.38 Melem/s 189.24 Melem/s 189.80 Melem/s]

wrt to Rayon non-determinism, here all values get reduced, so non-determinism is not an issue.

themighty1 · 2025-12-24T06:47:15Z

@sinui0 , ready for review

sinui0

Some suggestions, which also apply to the verifier side.

sinui0 · 2025-12-26T02:45:48Z

crates/zk-core/src/check.rs

+        // Computation with pre-split lanes.
+        const PARALLELISM: usize = 16;
+        let n = macs.len();
+        let segment_size = n.div_ceil(PARALLELISM);
+        let starts = Self::compute_chi_starts(chi);


Couple comments:

Hardcoding a "parallelism" number doesn't seem right. The goal is so saturate the CPU cores while also optimizing for cache efficiency. So we should choose a chunk size, e.g. 1024, which is large enough to saturate caches but small enough to take advantage of work stealing.

We don't need multiple PRGs, ChaCha12 already has a set_stream API for producing independent streams.

Suggested change

// Computation with pre-split lanes.

const PARALLELISM: usize = 16;

let n = macs.len();

let segment_size = n.div_ceil(PARALLELISM);

let starts = Self::compute_chi_starts(chi);

const CHUNK_SIZE: usize = 1024;

let seed = *transcript.finalize().as_bytes();

let rng = ChaCha12Rng::from_seed(seed);

sinui0 · 2025-12-26T02:47:47Z

crates/zk-core/src/check.rs

+        let process_segment = |segment: &[Triple], mut rng: ChaCha12Rng| {
+            use rand_chacha::rand_core::RngCore;
+
+            let mut u_acc = Block::ZERO;
+            let mut v_acc = Block::ZERO;
+
+            for &triple in segment {
+                let mut chi_bytes = [0u8; 16];
+                rng.fill_bytes(&mut chi_bytes);
+                let chi = Block::from(chi_bytes);
+
+                let (u, v) = compute_terms(triple, chi);
+                u_acc ^= u;
+                v_acc ^= v;
+            }
+
+            (u_acc, v_acc)
+        };


Might be able to prevent a copy with this

Suggested change

let process_segment = |segment: &[Triple], mut rng: ChaCha12Rng| {

use rand_chacha::rand_core::RngCore;

let mut u_acc = Block::ZERO;

let mut v_acc = Block::ZERO;

for &triple in segment {

let mut chi_bytes = [0u8; 16];

rng.fill_bytes(&mut chi_bytes);

let chi = Block::from(chi_bytes);

let (u, v) = compute_terms(triple, chi);

u_acc ^= u;

v_acc ^= v;

}

(u_acc, v_acc)

};

let process_segment = |rng: &mut ChaCha12Rng, segment: &[Triple]| {

use rand_chacha::rand_core::RngCore;

let mut u_acc = Block::ZERO;

let mut v_acc = Block::ZERO;

let mut chi = Block::ZERO;

for &triple in segment {

rng.fill_bytes(chi.as_bytes_mut());

let (u, v) = compute_terms(triple, chi);

u_acc ^= u;

v_acc ^= v;

}

(u_acc, v_acc)

};

sinui0 · 2025-12-26T02:49:36Z

crates/zk-core/src/check.rs

-                    .map(|(macs, chi)| compute_terms(macs, chi))
+                    .par_chunks(segment_size)
+                    .zip(starts.into_par_iter())
+                    .map(|(segment, chi_start)| process_segment(segment, chi_start))


map_with will clone the PRG as needed. We ensure the PRG stream is always independent (and deterministic) by keying it to the chunk ID.

Suggested change

.map(|(segment, chi_start)| process_segment(segment, chi_start))

.enumerate()

.map_with(

rng,

|rng, (stream_id, segment)| {

rng.set_stream(stream_id as u64);

process_segment(rng, segment)

}

)

themighty1 · 2025-12-29T11:55:17Z

I applied the latest pattern suggested and tested various segment sizes in my branch (I needed a separate branch because it has benches for various batch_size which mpz doesn't have) : https://github.com/themighty1/mpz/tree/perf/optimized_check_chunking

Bottom line:

with the latest pattern we lost a few more % points in the native bench (for a total of 15% compared to my initial approach in this PR)
SEGMENT_SIZE == 512 is optimal for both native and wasm build.
batch_size == 200K is optimal for native and 800K-1m is optimal for wasm.

All the native benches below were run with

cargo bench -p mpz-zk-core --bench prover --features rayon -- check

All the wasm benches were run with (changing the suffix to 400k, 600k etc)

cargo run --release --bin wasm-bench-runner -- -b zk_prover_core/check_200k -c 8 --iterations 10 --samples 20

Segment size 256

// native bench

prover/check/200K       time:   [1.1760 ms 1.1854 ms 1.1951 ms]
                        thrpt:  [171.37 Melem/s 172.77 Melem/s 174.15 Melem/s]

// browser bench

=== zk_prover_core ===
Name                                      Median (ms)  Per-iter (us)  AND gates/s
----------------------------------------------------------------------------------
zk_prover_core/check_200k                      135.42       13383.10      15.30M
zk_prover_core/check_400k                      255.12       25830.95      15.61M
zk_prover_core/check_600k                      361.73       36295.55      16.58M
zk_prover_core/check_800k                      480.04       48402.00      16.53M
zk_prover_core/check_1m                        597.90       59850.10      16.79M

Segment size 512

// native bench

prover/check/200K       time:   [1.1714 ms 1.1748 ms 1.1805 ms]
                        thrpt:  [173.49 Melem/s 174.33 Melem/s 174.84 Melem/s]

// browser bench

=== zk_prover_core ===
Name                                      Median (ms)  Per-iter (us)  AND gates/s
----------------------------------------------------------------------------------
zk_prover_core/check_200k                      135.18       13584.40      15.08M
zk_prover_core/check_400k                      249.83       25220.17      15.99M
zk_prover_core/check_600k                      364.98       36880.28      16.31M
zk_prover_core/check_800k                      477.86       47902.18      16.70M
zk_prover_core/check_1m                        596.65       59848.50      16.79M

Segment size 1024

// native bench

prover/check/200K       time:   [1.1973 ms 1.2036 ms 1.2101 ms]
                        thrpt:  [169.24 Melem/s 170.16 Melem/s 171.05 Melem/s]

// browser bench

=== zk_prover_core ===
Name                                      Median (ms)  Per-iter (us)  AND gates/s
----------------------------------------------------------------------------------
zk_prover_core/check_200k                      135.51       13484.58      15.19M
zk_prover_core/check_400k                      248.35       25221.50      15.99M
zk_prover_core/check_600k                      369.40       37581.78      16.01M
zk_prover_core/check_800k                      481.18       48139.70      16.62M
zk_prover_core/check_1m                        594.15       59629.20      16.85M

Benching higher segment_size values showed that for both native and wasm benches the throughput got only worse.

Segment size 12000

// native bench

prover/check/200K       time:   [1.5742 ms 1.5786 ms 1.5839 ms]
                        thrpt:  [129.30 Melem/s 129.74 Melem/s 130.10 Melem/s]

// browser bench

=== zk_prover_core ===
Name                                      Median (ms)  Per-iter (us)  AND gates/s
----------------------------------------------------------------------------------
zk_prover_core/check_200k                      172.70       17294.93      11.84M
zk_prover_core/check_400k                      281.62       28403.13      14.20M
zk_prover_core/check_600k                      394.71       39889.83      15.08M
zk_prover_core/check_800k                      520.02       52072.68      15.36M
zk_prover_core/check_1m                        627.75       62817.25      16.00M

themighty1 · 2025-12-29T11:57:23Z

@sinui0 , ready for review.
We need to make sure our wasm build of QS uses 800K as the batch size. I assume this will be handled outside of mpz via a config.

perf(zk-core): use bootstrapped chis for the check

f73a69f

themighty1 requested a review from sinui0 December 23, 2025 09:51

sinui0 requested changes Dec 23, 2025

View reviewed changes

use prgs

9080eea

themighty1 requested a review from sinui0 December 24, 2025 13:18

sinui0 requested changes Dec 26, 2025

View reviewed changes

themighty1 added 2 commits December 29, 2025 12:39

Merge branch 'dev' into perf/bootstrapped_chis

ecfca85

replace parallelism with per segment approach

da98d02

themighty1 requested a review from sinui0 December 29, 2025 11:58

sinui0 approved these changes Jan 1, 2026

View reviewed changes

themighty1 merged commit ddb94fd into dev Jan 2, 2026
3 checks passed

themighty1 deleted the perf/bootstrapped_chis branch January 2, 2026 08:32

This was referenced Jan 2, 2026

set wasm zk prover batch size tlsnotary/tlsn#1072

Closed

parallelize chi generation in QS consistency check #360

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(zk-core): use bootstrapped chis for the check #362

perf(zk-core): use bootstrapped chis for the check #362

Uh oh!

themighty1 commented Dec 23, 2025

Uh oh!

sinui0 Dec 23, 2025

Uh oh!

Uh oh!

sinui0 commented Dec 23, 2025

Uh oh!

themighty1 commented Dec 24, 2025

Uh oh!

themighty1 commented Dec 24, 2025

Uh oh!

sinui0 left a comment

Uh oh!

sinui0 Dec 26, 2025

Uh oh!

sinui0 Dec 26, 2025

Uh oh!

sinui0 Dec 26, 2025 •

edited

Loading

Uh oh!

themighty1 commented Dec 29, 2025 •

edited

Loading

Uh oh!

themighty1 commented Dec 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-                    .map(|(segment, chi_start)| process_segment(segment, chi_start))
+                    .enumerate()
+                    .map_with(
+                        rng,
+                        |rng, (stream_id, segment)| {
+                            rng.set_stream(stream_id as u64);
+                            process_segment(rng, segment)
+                        }
+                    )

perf(zk-core): use bootstrapped chis for the check #362

perf(zk-core): use bootstrapped chis for the check #362

Uh oh!

Conversation

themighty1 commented Dec 23, 2025

Benchmarking the improvement

Before this PR

After this PR

Uh oh!

sinui0 Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sinui0 commented Dec 23, 2025

Uh oh!

themighty1 commented Dec 24, 2025

Uh oh!

themighty1 commented Dec 24, 2025

Uh oh!

sinui0 left a comment

Choose a reason for hiding this comment

Uh oh!

sinui0 Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

sinui0 Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

sinui0 Dec 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

themighty1 commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

themighty1 commented Dec 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sinui0 Dec 26, 2025 •

edited

Loading

themighty1 commented Dec 29, 2025 •

edited

Loading