Dual-tree differential + microbench harness for Plonky3's Poseidon2 over Goldilocks on aarch64 NEON.
Imports p3-goldilocks from two Plonky3 checkouts side-by-side — a baseline tree (audited, read-only) and an h1 tree (experimental, under test) — runs the same permute through both, and asserts bitwise + canonical equality. Originally built to validate experimental edits to the NEON ASM atoms in aarch64_neon/utils.rs and aarch64_neon/poseidon2_asm.rs without rebuilding the surrounding crate graph each time.
Four oracles, layered cheapest-to-strongest:
- Cheap properties (
smoke.rs) — determinism, lane independence, lane symmetry, composition, distinct-input/distinct-output. Catches catastrophic collapses (e.g. lane aliasing) the momentcargo testruns. - Per-layer oracle (
layer_oracles.rs) — dual asm vs scalar asm vs scalar generic, each of the three layers in isolation. When the full-permute oracle fires, this says which layer broke without bisecting. - Full-permute oracles (
permute_oracle.rs,asm_vs_generic.rs) — bitwise across trees and three-way differential within a tree, W=8. - Wider widths (
wider_widths.rs) — cross-tree bitwise + h1 asm-vs-generic at W ∈ {12, 16, 20}.
Cycle/instruction measurement uses perf_event_open with grouped counters (cycles + instructions scheduled together) so the IPC ratio doesn't drift under PMU multiplexing. Two bench tools share that counter module:
src/bin/cycle_bench.rs— lightweight A/B microbench, JSON-line output suitable for CI regression checks.benches/poseidon2_layers.rs— criterion target with Tukey filtering + confidence intervals for serious before/after runs.
src/
lib.rs re-exports the two trees as `baseline` and `h1`
cycle_counter.rs perf_event_open wrapper (cycles + instructions, grouped)
bin/cycle_bench.rs lightweight A/B microbench, JSON output
benches/
poseidon2_layers.rs criterion benches in CPU cycles (3 fns × 2 trees)
tests/
permute_oracle.rs full-permute bitwise oracle across trees, W=8
layer_oracles.rs per-layer oracle (dual asm vs scalar asm vs generic)
asm_vs_generic.rs three-way differential per tree, W=8
canonicality.rs round-constant canonicality check
smoke.rs cheap properties: determinism, lane indep/symmetry, ...
wider_widths.rs cross-tree + asm-vs-generic at W ∈ {12, 16, 20}
dual_tree_compiles.rs sanity: both trees link into the same binary
common/mod.rs shared scaffolding (per-tree helper modules)
scripts/
bench-pi.sh Pi 5 bench wrapper; pins core, checks governor + paranoid
Both Plonky3 trees are pulled as git deps pinned by rev in Cargo.toml. No local checkouts required — cargo build resolves everything from upstream.
Default pins:
| tree | commit | meaning |
|---|---|---|
| baseline | b6380137… |
parent of PR #1619 — pre-edit reference |
| h1 | af65376f… |
merge of PR #1623 — both edits landed |
To A/B different commits, edit the rev = "..." values in Cargo.toml. Both trees can point at the same repo (default — upstream Plonky3) or at forks. The two revs only need to expose p3-goldilocks, p3-field, p3-symmetric, and p3-poseidon2 as workspace members.
# Full oracle + property suite (aarch64 host required)
cargo test
# Criterion benches in CPU cycles (Linux + aarch64 + PMU access)
cargo bench --bench poseidon2_layers
# Lightweight A/B bench, one fn at a time
cargo run --release --bin cycle_bench -- --fn internal --tree both
cargo run --release --bin cycle_bench -- --fn external_initial --tree both --json
# Pi 5 wrapper: pins to core 3, checks governor + perf_event_paranoid
./scripts/bench-pi.sh quick # cycle_bench, both trees, 3 fns
./scripts/bench-pi.sh criterion # full criterion run
./scripts/bench-pi.sh bothPMU access requires kernel.perf_event_paranoid <= 1:
echo 1 | sudo tee /proc/sys/kernel/perf_event_paranoidFor stable cycle numbers, set the cpufreq governor to performance:
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor- aarch64 NEON only. The dual-w8 ASM the harness exercises is aarch64-specific. On x86 / macOS dev hosts the bench targets compile a stub
main()that exits cleanly, socargo benchdoesn't break the workflow. - Linux for cycle measurement. PMU access goes through
perf_event_open; thecycle_countermodule is#[cfg(target_os = "linux")]. The harness still tests on other Linux/aarch64 hosts, just without cycle counts. - Path deps. The two Plonky3 trees are wired by relative path. The harness is intentionally not portable as-is — adjust
Cargo.tomlfor your local layout.
Licensed under either of
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.