-
Notifications
You must be signed in to change notification settings - Fork 73
perf(zk-core): use bootstrapped chis for the check #362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
crates/zk-core/src/check.rs
Outdated
| // Bootstrap 16 values via squaring | ||
| let mut bootstrapped = [Block::ZERO; 16]; | ||
| let mut current = chi; | ||
| for b in &mut bootstrapped { | ||
| *b = current; | ||
| current = current.gfmul(current); | ||
| } | ||
|
|
||
| chis | ||
| // Hash each to get independent starting points | ||
| let mut starts = [Block::ZERO; 16]; | ||
| for (i, boot) in bootstrapped.iter().enumerate() { | ||
| let mut hasher = Hasher::new(); | ||
| hasher.update(&boot.to_bytes()); | ||
| hasher.update(&(i as u64).to_le_bytes()); | ||
| hasher.update(&(segment_size as u64).to_le_bytes()); | ||
| let hash = hasher.finalize(); | ||
| starts[i] = | ||
| Block::try_from(&hash.as_bytes()[..16]).expect("hash should be at least 16 bytes"); | ||
| } | ||
|
|
||
| starts | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems unnecessary, we can just use chi to seed a PRG and then use it to derive the evaluation points fully independently. See ChaCha docs and set_word_pos
Although, because ChaCha produces 64 byte blocks, we probably want to divide the terms into chunks of 4 to avoid wasting cycles.
|
Also, I didn't test it, but watch out for non-determinism when using Rayon, it uses work stealing which may cause order to differ across machines. This should be fine with the PRG approach if you are setting the PRG position based on an enumerator |
|
Good call, I added PRGs and we got a 30% speedup in wasm. although native build took a 5% hit wrt to Rayon non-determinism, here all values get reduced, so non-determinism is not an issue. |
|
@sinui0 , ready for review |
sinui0
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some suggestions, which also apply to the verifier side.
crates/zk-core/src/check.rs
Outdated
| // Computation with pre-split lanes. | ||
| const PARALLELISM: usize = 16; | ||
| let n = macs.len(); | ||
| let segment_size = n.div_ceil(PARALLELISM); | ||
| let starts = Self::compute_chi_starts(chi); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple comments:
- Hardcoding a "parallelism" number doesn't seem right. The goal is so saturate the CPU cores while also optimizing for cache efficiency. So we should choose a chunk size, e.g. 1024, which is large enough to saturate caches but small enough to take advantage of work stealing.
- We don't need multiple PRGs, ChaCha12 already has a
set_streamAPI for producing independent streams.
| // Computation with pre-split lanes. | |
| const PARALLELISM: usize = 16; | |
| let n = macs.len(); | |
| let segment_size = n.div_ceil(PARALLELISM); | |
| let starts = Self::compute_chi_starts(chi); | |
| const CHUNK_SIZE: usize = 1024; | |
| let seed = *transcript.finalize().as_bytes(); | |
| let rng = ChaCha12Rng::from_seed(seed); |
crates/zk-core/src/check.rs
Outdated
| let process_segment = |segment: &[Triple], mut rng: ChaCha12Rng| { | ||
| use rand_chacha::rand_core::RngCore; | ||
|
|
||
| let mut u_acc = Block::ZERO; | ||
| let mut v_acc = Block::ZERO; | ||
|
|
||
| for &triple in segment { | ||
| let mut chi_bytes = [0u8; 16]; | ||
| rng.fill_bytes(&mut chi_bytes); | ||
| let chi = Block::from(chi_bytes); | ||
|
|
||
| let (u, v) = compute_terms(triple, chi); | ||
| u_acc ^= u; | ||
| v_acc ^= v; | ||
| } | ||
|
|
||
| (u_acc, v_acc) | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be able to prevent a copy with this
| let process_segment = |segment: &[Triple], mut rng: ChaCha12Rng| { | |
| use rand_chacha::rand_core::RngCore; | |
| let mut u_acc = Block::ZERO; | |
| let mut v_acc = Block::ZERO; | |
| for &triple in segment { | |
| let mut chi_bytes = [0u8; 16]; | |
| rng.fill_bytes(&mut chi_bytes); | |
| let chi = Block::from(chi_bytes); | |
| let (u, v) = compute_terms(triple, chi); | |
| u_acc ^= u; | |
| v_acc ^= v; | |
| } | |
| (u_acc, v_acc) | |
| }; | |
| let process_segment = |rng: &mut ChaCha12Rng, segment: &[Triple]| { | |
| use rand_chacha::rand_core::RngCore; | |
| let mut u_acc = Block::ZERO; | |
| let mut v_acc = Block::ZERO; | |
| let mut chi = Block::ZERO; | |
| for &triple in segment { | |
| rng.fill_bytes(chi.as_bytes_mut()); | |
| let (u, v) = compute_terms(triple, chi); | |
| u_acc ^= u; | |
| v_acc ^= v; | |
| } | |
| (u_acc, v_acc) | |
| }; |
crates/zk-core/src/check.rs
Outdated
| .map(|(macs, chi)| compute_terms(macs, chi)) | ||
| .par_chunks(segment_size) | ||
| .zip(starts.into_par_iter()) | ||
| .map(|(segment, chi_start)| process_segment(segment, chi_start)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
map_with will clone the PRG as needed. We ensure the PRG stream is always independent (and deterministic) by keying it to the chunk ID.
| .map(|(segment, chi_start)| process_segment(segment, chi_start)) | |
| .enumerate() | |
| .map_with( | |
| rng, | |
| |rng, (stream_id, segment)| { | |
| rng.set_stream(stream_id as u64); | |
| process_segment(rng, segment) | |
| } | |
| ) |
|
I applied the latest pattern suggested and tested various segment sizes in my branch (I needed a separate branch because it has benches for various Bottom line:
All the native benches below were run with All the wasm benches were run with (changing the suffix to 400k, 600k etc) Segment size 256// native bench // browser bench Segment size 512// native bench // browser bench Segment size 1024// native bench // browser bench Benching higher segment_size values showed that for both native and wasm benches the throughput got only worse. Segment size 12000// native bench // browser bench |
|
@sinui0 , ready for review. |
This PR improves QS check performance (5x on native build and 2.5x in the browser) by parallelizing the chi computation using the bootstrapped chis approach: for each parallel lane an independent chi is computed from the original chi.
Then, additionally it fuses the chi computation with the existing parallel terms computation (avoiding the need to alloc chis in memory)
Benchmarking the improvement
The benches were done with checked sizes 200K, 400K and 600K. Going higher gave no meaningful improvement, so I used just those 3 values.
In the native bench "Elements per sec" means "AND gates per second".
The benches were run on this branch https://github.com/themighty1/mpz/tree/feat/parallel_chis - the head of the branch corresponds to the current (before this PR) approach. You need to checkout the commit 498a3e9 which contains the new approach.
Before this PR
After this PR
Since 200K performs so well on native builds, I also modified the default config accordingly.