Perry vs 9 other runtimes on 8 identical benchmarks. All implementations
use f64/double arithmetic to match TypeScript's number type. No SIMD
intrinsics, no unsafe code, no non-default optimization flags — each
language's idiomatic release-mode build. A companion RESULTS_OPT.md
(phase 2 of this investigation) shows what happens when each language is
given flags equivalent to Perry's defaults.
See METHODOLOGY.md for iteration counts, clocks,
compiler versions, and a full explanation of which optimizations create
each delta.
Run date (Perry): 2026-05-06 — Perry commit main (v0.5.585),
both default and --fast-math columns.
Run date (other languages): 2026-04-25 (v0.5.249-era
polyglot run); their numbers haven't moved (same compilers, same
hardware), full polyglot rerun is on the followup list.
Hardware: Apple M1 Max (10 cores, 64 GB RAM), macOS 26.4.
Methodology: RUNS=11 per cell. Median wall-clock ms below;
full per-cell stats (median + p95 + σ + min + max) in RESULTS_AUTO.md.
Pinning: macOS scheduler hint via taskpolicy -t 0 -l 0
(P-core preferred via throughput/latency tiers, NOT strict affinity —
Apple does not expose unprivileged hard core pinning). Lower is better.
The two Perry columns (default / --fast-math): since v0.5.585,
LLVM reassoc + contract per-instruction fast-math flags on f64 ops
are opt-in. Default mode produces bit-exact f64 output with Node;
--fast-math permits the optimizer to reassociate FP chains and fuse
multiply-adds. The flag's win is concentrated on trivially-foldable
accumulator patterns (loop_overhead, math_intensive, accumulate);
on data-dependent kernels (loop_data_dependent), the columns are
identical because the sequential carry can't be reordered regardless.
See ../docs/src/cli/fast-math.md for the full behavior contract.
| Benchmark | Perry default | Perry --fast | Rust | C++ | Go | Swift | Java | Node | Bun | Python |
|---|---|---|---|---|---|---|---|---|---|---|
| fibonacci | 304 | 304 | 330 | 315 | 451 | 406 | 282 | 1022 | 589 | 16054 |
| loop_overhead | 95 | 12 | 98 | 98 | 98 | 143 | 100 | 54 | 46 | 3019 |
| loop_data_dependent | 221 | 221 | 229 | 129 | 128 | 233 | 229 | 322 | 232 | 10750 |
| array_write | 4 | 3 | 7 | 3 | 9 | 2 | 7 | 9 | 6 | 401 |
| array_read | 11 | 11 | 9 | 9 | 11 | 9 | 12 | 13 | 16 | 342 |
| math_intensive | 50 | 14 | 48 | 51 | 49 | 50 | 74 | 51 | 51 | 2238 |
| object_create | 2 | 0 | 0 | 0 | 0 | 0 | 5 | 11 | 6 | 164 |
| nested_loops | 17 | 17 | 8 | 8 | 10 | 8 | 11 | 18 | 21 | 484 |
| accumulate | 95 | 33 | 98 | 98 | 98 | 98 | 100 | 617 | 100 | 5048 |
New benchmark in v0.5.249: loop_data_dependent. Same shape as
loop_overhead but with a multiplicative carry through sum and
runtime-loaded array reads, so LLVM cannot apply reassoc, IV-simplify,
or autovectorization. The Rust version's loop body is verified at
the asm level (see bench.rs line 122) — emits a scalar fmul + fadd
chain with two array loads, no fold, no vectorize. The kernel
splits the field into two FP-contract clusters:
- FMA-contract pack (~128 ms): Go (default), C++
g++ -O3on Apple Clang. Both fusesum * a + binto a singleFMADDDinstruction (one IEEE-754 rounding instead of two), shortening the per-iteration dependency chain from ~6-8 cycles (FMUL→FADD on Apple silicon) to ~4 cycles (single FMADDD). - No-contract pack (~229-235 ms): Perry, Rust default
-O, Swift-O, Java without-XX:+UseFMA, Bun. All emit separateFMUL+FADDbecause contracting changes observable IEEE-754 result by up to 0.5 ULP per iteration and is not legal under the languages' default FP semantics.
LLVM matches the FMA pack with -ffast-math or
-ffp-contract=fast; Apple Clang's default at -O2+ already enables
contract. Verified at the asm level on rustc 1.94.1 / Apple Clang on
2026-04-25 — see bench.rs line 115 for the full asm dump and
toolchain notes. The dependency chain on sum is preserved in both
clusters; the win is one ISA-level fusion, not a fold or a vectorize.
Node 322 ms this run was a JIT-warm-up tail (σ=63, p95=447); on
quieter runs it lands in the no-contract pack alongside Bun.
Interesting tails surfaced by RUNS=11:
- Python
accumulatemedian 5052 ms but p95 9388 ms (σ=1454) — one run took 9.4 s, likely from GC pressure or thermal throttling during a 10 s+ tight loop. Best-of-5 hid this; the p95 is real. - Python
math_intensivemedian 2244 ms but p95 4091 ms (σ=531) — same pattern. - Most other cells have σ < 5% of median — distributions are tight.
The nested_loops and accumulate regressions vs the v0.5.164
baseline (8 → 17 ms and 24 → 34 ms) are still present and still
caused by the v0.5.237 generational-GC default flip — PERRY_GEN_GC=0
recovers the 8 / 24 ms baseline. The slight jitter on fibonacci
(302 → 312 ms median across runs) is genuine variance from the
underlying recursive call chain; best-of-5 had been picking the
lower tail, median is more representative.
Honest regressions vs v0.5.164 (when this table was last refreshed,
before generational GC became the default in v0.5.237):
nested_loops 8 → 17 ms, accumulate 24 → 33 ms. Both caused by
the v0.5.237 gen-GC default flip — the per-allocation gen-GC machinery
(write-barrier potential, age-bump pass) is overhead that allocation-heavy
compute benches don't recoup. PERRY_GEN_GC=0 recovers the 8 / 24 ms
baseline. The trade-off was deliberate; gen-GC's wins on long-running and
RSS-sensitive workloads (test_memory_json_churn 115 → 91 MB) outweigh
the small compute-bench regression. All other cells unchanged or
slightly faster.
Original v0.5.164 fix (#140)
that established the current loop_overhead/math_intensive/accumulate
baseline: an asm sideeffect loop-body barrier (from #74) and an
over-eager i32 shadow counter were blocking LLVM's vectorizer on pure-
accumulator loops; v0.5.164 scoped the i32 shadow to counters that
actually appear in an index subtree and refined the loop-body barrier
to fire only on truly-empty bodies — restoring the <2 x double>
parallel-accumulator reduction (vectorization width 2, interleave count
4). Perry beats Rust 3–8× on these cells.
cd benchmarks/polyglot
bash run_all.sh # best of 3 runs (default)
bash run_all.sh 5 # best of 5 runs (what the above table used)Required: Perry (cargo build --release from repo root).
Optional (any subset works; missing runtimes show as -): Node.js,
Bun, Static Hermes (shermes), Rust (rustc), C++ (g++ or clang++),
Swift, Go, Java (javac + java), Python 3.
See METHODOLOGY.md for what each benchmark measures,
compiler versions, why certain cells look the way they do, and where Perry
wins (loop_overhead, math_intensive, accumulate, array_read) vs
where it matches the compiled pack (nested_loops, fibonacci) vs where
it loses (object_create).
Perry 12 ms vs Rust/C++/Go/Swift/Java ~94 ms — Perry wins 7–8×. Two
stacked optimizations produce this: (1) Perry emits reassoc contract
on f64 ops because JS/TS number semantics can't observe the difference
(no signalling NaNs, no fenv, no strict -0 rules at the operator
level), so LLVM's IndVarSimplify can rewrite sum + 1.0 × N as an
integer induction variable. Rust/C++/Go/Swift default to strict-IEEE
fadd, which has a 3-cycle latency wall and is unreassociable. (2) On
top of the integer rewrite LLVM autovectorizes the body into a
<2 x double> parallel-accumulator reduction with interleave count 4
(four SIMD lanes worth of fadd happening per iteration). g++ -O3 -ffast-math on bench.cpp drops C++ from 96 ms to 11 ms, matching
Perry — same LLVM, same pipeline, one flag. See
RESULTS_OPT.md for the per-language opt-sweep (C++
opt = 12 ms, Rust opt = 24 ms on stable, Go = 99 ms because Go has no
fast-math flag).
This cell regressed to 32 ms between v0.5.91 and v0.5.162 after an
asm sideeffect barrier (from #74) and an unconditional i32 shadow slot
started blocking the vectorizer. Restored in v0.5.164 via #140.
Perry 14 ms, Rust/C++/Go/Swift/Java/Node/Bun all ~46–49 ms — Perry
wins 3×. Same fast-math default + vectorization story as
loop_overhead: the reciprocal divide has a 10+ cycle latency, but
vectorizing into a parallel 4-accumulator reduction keeps 4 independent
divides in flight so the scheduler hides the latency. C++ -ffast-math
matches Perry at 14 ms per RESULTS_OPT.md. Regressed
to 48 ms at v0.5.162, restored in v0.5.164 (#140).
Perry 24 ms, Rust/C++/Go/Swift/Java/Bun all 93–98 ms — Perry wins 4×.
Node 583 ms is an outlier because V8 doesn't inline the libm fmod call
on this pattern. Perry's type analysis emits srem instead of fmod
for the mod op (same optimization Node misses), LLVM vectorizes the
resulting integer chain + fadd reduction, and the two together beat
Rust's strict-IEEE scalar fmod+fadd by 4×. Regressed to 97 ms at
v0.5.162, restored in v0.5.164 (#140).
Perry 3 ms, C++/Swift 9 ms, Rust 10 ms, Go 10 ms, Java 11 ms. Perry
detects for (let i = 0; i < arr.length; i++) as statically in-bounds,
skips the JS undefined-on-OOB check, caches the length at loop entry,
and maintains a parallel i32 counter so the index is never a float → int
conversion. LLVM then autovectorizes to NEON 2-wide f64. C++ std::vector
has no bounds check by default but pays the chunk-boundary check from
-O3's vectorizer framing. Rust's iterator form (not used here) matches
Perry — see bench_opt.rs (phase 2).
Perry 2 ms, C++/Swift 2 ms, Rust 7 ms, Go 9 ms. Perry matches C++ here.
The Rust result is -O with bounds-checked indexing; .iter_mut() would
match Perry.
All compiled languages 8–10 ms. Perry 9 ms. This benchmark is cache-bound, not compute-bound — there is no optimization lever to pull. Perry matches the compiled pack.
Java 280 ms (JIT inlining), C++ 310 ms, Perry 311 ms, Rust 319 ms — the
top four languages all land within 10 ms of each other. Perry's type
inference refines the TS number parameter to i64 (because the function
only ever performs integer operations), producing add/sub/icmp (1 cycle
each) instead of the fadd/fsub/fcmp (2–3 cycles) that the f64-typed Rust
and C++ benchmarks emit. The reason Perry isn't dramatically further
ahead is that LLVM's recursion-folding optimizations on fib-shaped code
recover most of the gap at -O3. The Rust f64→i64 switch is a one-line
change (tested in bench_opt.rs) and drops Rust to ~280 ms.
Rust/C++/Go/Swift 0 ms: the compiler proves the struct never escapes and
eliminates the whole loop. Java 5 ms, Bun 5 ms, Node 8 ms, Perry 2 ms,
Hermes 2 ms. Perry is competitive here only because of the v0.5.17
scalar-replacement pass; without it this benchmark was ~10 ms. The 0 ms
floor from statically-typed compiled languages is an inherent tradeoff of
compiling a dynamic language — see METHODOLOGY.md.
bench.cpp— C++17bench.rs— Rust (no dependencies)bench.go— Gobench.swift— Swiftbench.java— Javabench.py— Python 3bench.zig— Zig (may need manual build; not in the current table)- Perry / Node / Bun / Hermes run the TS files in
../suite/
All implementations use the same algorithm, same data types (f64 /
double throughout), same iteration counts, and the same output format
(benchmark_name:elapsed_ms) so the runner can grep a single key per row.