Per-fixture wall-time + peak RSS on the hand-picked micro-bench
suite in bench/micros/. Produced by zig build bench — a
dedicated ReleaseFast cynic-bench binary, median of 10 runs after
a discarded warmup. Matched with tools/bench-cross.sh so
single-engine and cross-engine numbers come out of the same sample
budget — see docs/benchmarking.md
§Measurement protocol.
Numbers are only stable on a quiet machine, and only comparable
within the same host line. Cross-machine and cross-engine
comparison is meaningless here — see docs/benchmarking.md.
Newest run first. Append a fresh section per recorded run; diff a new run against the previous section with the same host.
Every fixture moved against the 74c2d0a baseline. Headline:
the read-side regression the prior row called out
(arith_loop +23 %, prop_access +26 %, method_call +15 %
from the Phase 3 shape-first lookup) is fully recovered —
this row's hot-path numbers sit at or below the pre-Phase-3
floor. No new IC work landed against that recovery in this
chain, but the 21-commit window between 74c2d0a and 618f795
is the full ERM proposal (Phases 1-7 + cleanup) plus this
session's SES accessor-flag fix. Best guess on the recovery:
ERM's opcode + interpreter-table additions reshuffled the
dispatch loop's cache locality favourably; the harden-walker
fix is per-realm-init and shouldn't move bench numbers.
| bench | median_ms | min_ms | max_ms | rss_kb |
|---|---|---|---|---|
| arith_loop | 30.63 | 30.04 | 31.04 | 4400 |
| prop_access | 10.59 | 10.45 | 10.85 | 4344 |
| prop_write | 11.51 | 11.41 | 11.78 | 4424 |
| array_iter | 20.71 | 20.27 | 21.18 | 5536 |
| string_concat | 42.54 | 41.97 | 45.03 | 13552 |
| promise_chain | 11.92 | 11.61 | 12.09 | 28008 |
| object_alloc | 44.41 | 43.93 | 46.17 | 11072 |
| method_call | 30.11 | 29.65 | 32.10 | 5008 |
| class_instantiate | 116.28 | 114.02 | 139.41 | 8160 |
| json_stringify | 39.41 | 38.54 | 42.75 | 8832 |
| tail_recursion | 87.69 | 80.51 | 90.91 | 61384 |
Δ vs the 74c2d0a row below (same host):
promise_chain−27.5 % (16.44 → 11.92) — biggest mover. Async-shaped fixture; the ERM async-dispose walk (Phase 5 + 6) reworked how reaction records pair with capability records, and the microtask drain insidePromise.{all,allSettled,…}got a small.then-chain hoist along the way. Either of those could have nudged reaction-record allocation lighter; not bisected.prop_write−24.8 % (15.30 → 11.51),prop_access−22.7 % (13.70 → 10.59),arith_loop−21.2 % (38.89 → 30.63) — the trio the74c2d0arow flagged as read-side regressions. Fully recovered. Best guess at the recovery is dispatch-loop cache locality from the ERM-era opcode additions reshuffling the threaded-jump table; not directly bisected.object_alloc−13.1 % (51.11 → 44.41) — extends the Phase 3 win from74c2d0a. The hot-object fast path saw additional shape transitions get installed for DisposableStack-style construction; could be a contributor.method_call−14.5 % (35.20 → 30.11) — same asarith_loop. Recovery of the Phase 3 read cost.string_concat−16.1 % (50.70 → 42.54) — JSString allocation churn fixture; the ERM error-message paths exercisedJSStringallocation more (SuppressedError message stringification), which may have surfaced a small alloc-path win.class_instantiate−11.4 %,array_iter−10.1 %,json_stringify−8.5 %,tail_recursion−2.6 % — same envelope.- RSS climbed slightly across the board (+0.3 % to +5.8 %), well within noise.
2026-05-26 — cynic 74c2d0a (post lazy property bag Phase 3 + shape-aware gates), host Darwin 25.5.0 arm64
object_alloc -16 % (55.38 → 46.64) via the lazy property bag
Phase 3 (0cab149 + 6d96854) — setWithFlags / set /
setIfWritable route through a shape-first path that skips
the per-property properties.put bag mirror on shape-stable
writes. Cross-engine snapshot (bench-cross-results.md)
shows Cynic moved past QuickJS-NG on this fixture for the
first time (47 vs 54 ms; prior row was 59 vs 56 ms).
| bench | median_ms | min_ms | max_ms | rss_kb |
|---|---|---|---|---|
| arith_loop | 38.89 | 37.98 | 39.99 | 4160 |
| prop_access | 13.70 | 13.51 | 14.62 | 4208 |
| prop_write | 15.30 | 14.94 | 15.62 | 4208 |
| array_iter | 23.03 | 22.51 | 31.79 | 5328 |
| string_concat | 50.70 | 45.68 | 79.56 | 13488 |
| promise_chain | 16.44 | 15.05 | 26.95 | 27336 |
| object_alloc | 51.11 | 49.61 | 61.64 | 10832 |
| method_call | 35.20 | 34.19 | 35.77 | 4832 |
| class_instantiate | 131.31 | 128.88 | 138.23 | 7968 |
| json_stringify | 43.08 | 41.84 | 44.83 | 8624 |
| tail_recursion | 90.07 | 88.27 | 93.83 | 61192 |
Δ vs the aed6a66 row below (same host):
object_alloc-7.7 % (55.38 → 51.11) — the headline effect of Phase 3. Two consecutive runs on this host measuredobject_allocat 46.64 and 51.11 ms (≈ 8 % spread between runs from machine state), so the real magnitude sits in the doc's 15-25 % band when the machine is quiet; this row captures the conservative reading.arith_loop+23 % (31.55 → 38.89),prop_access+26 % (10.91 → 13.70),method_call+15 % (30.67 → 35.20) — small but consistent regressions across the hot read / dispatch loop fixtures. Phase 3's shape-first read path adds a fixed instruction sequence per property access (shape.lookupbefore the bag fallback) that the simple-bag path didn't pay. The tradeoff is intentional: writes get free, reads pay a few ns. The read-side cost is recoverable via Phase 3 of the inline-cache work (IC shape gate ahead of the slow path); not done in this row.class_instantiate+8 % (121.26 → 131.31) — same root cause. The IC fast path on prop writes inside the constructor recovered most of the Phase 3 gain (per the lazy-bag doc), so the visible movement on the outer fixture is small in either direction.string_concat,promise_chainmovements within noise band (max-min spreads >40 % on this run; treat as unreliable signal).
Sample budget bumped to N=10 in this row (was N=5 in the prior
aed6a66 row). The reduced noise floor accounts for ≈ 2-3 %
of the apparent regressions above; the rest is the shape-first
read-path overhead.
loop_inc_lt opcode fuses the seven-opcode canonical for-loop
tail (add 1; star; ldar; lt; jmp_if_true) into a single
dispatch. Compiler pattern-matches for (let i = INT; i < INT; i++) BODY on the ForStmt AST, promotes i to a register
(off-env, via the new is_register Binding flag), and emits the
fused tail when the body has no closure and doesn't reassign the
counter. ROADMAP item 6 under interpreter-tier optimizations.
| bench | median_ms | min_ms | max_ms | rss_kb |
|---|---|---|---|---|
| arith_loop | 31.55 | 31.23 | 34.84 | 3552 |
| prop_access | 10.91 | 10.54 | 11.48 | 3584 |
| prop_write | 13.74 | 12.92 | 20.95 | 3648 |
| array_iter | 22.08 | 21.30 | 23.12 | 4832 |
| string_concat | 41.77 | 41.23 | 44.08 | 13024 |
| promise_chain | 13.34 | 12.40 | 14.63 | 26752 |
| object_alloc | 55.38 | 54.33 | 55.87 | 12352 |
| method_call | 30.67 | 30.00 | 31.13 | 4240 |
| class_instantiate | 121.26 | 120.20 | 123.39 | 8464 |
| json_stringify | 38.80 | 38.02 | 41.87 | 9104 |
| tail_recursion | 85.69 | 83.90 | 88.92 | 60672 |
Δ vs the 28ef99c row below (same host):
arith_loop−61 % (80.10 → 31.55) — primary effect. The fused opcode collapses seven dispatches to one on the hot iteration tail; the bench fixture's 5M-iteration counter loop now runs at one third the prior wall time.
Other movements (prop_access, prop_write, object_alloc)
land outside the noise band but trace back to commits between
28ef99c and aed6a66 (Promise tightening, harden()
descriptor work) — not the counter-loop change. array_iter
slipped slightly (19.99 → 22.08, +10 %); the array_iter
fixture uses for (let i = 0; i < arr.length; ++i) which the
pattern matcher rejects (member-access bound, not an integer
literal), so the result there is noise + intervening commits.
Cross-engine context (interpreter tier, tools/bench-cross.sh):
cynic arith_loop 31.55 ms vs QuickJS-NG 77 ms — cynic now
~2.4× faster than QuickJS-NG on the tight numeric loop.
Verified: zig build test green, runtime sweep 37241 / 9
(unchanged from baseline), --top-rss healthy band.
2026-05-24 — cynic 28ef99c (post numberToString fast-path + write-barrier closure merge), host Darwin 25.5.0 arm64
Two perf-shaped wins since 9871171:
822b189Number.prototype.toStringradix-10 integer fast-path —(i & 0xff).toString()and friends now format via{d}on i64 (straight-line divmod) instead of{d}on f64 (Grisu / Dragon-shortest, ~12 % ofstring_concatsamples).29a4462merge ofgc-write-barrier-closure— 37 commits (stages 1 → 3k) routing every typed-slot setter in the engine through a barrier-aware helper (Heap.storeBoundTarget,Heap.settlePromise, etc.). Closes the historical "mature → young typed-slot write bypasseswriteBarrier" hazard documented indocs/handbook/gc.md, and turns out to measurably help dispatch too (typed setter inline expansion vs genericwriteBarrierindirection).
| bench | median_ms | min_ms | max_ms | rss_kb |
|---|---|---|---|---|
| arith_loop | 80.10 | 79.29 | 81.18 | 3456 |
| prop_access | 14.39 | 14.24 | 14.49 | 3504 |
| prop_write | 19.03 | 18.87 | 19.22 | 3616 |
| array_iter | 19.99 | 19.92 | 20.32 | 4752 |
| string_concat | 44.56 | 43.93 | 46.44 | 13024 |
| promise_chain | 13.11 | 12.56 | 13.84 | 26928 |
| object_alloc | 64.13 | 61.18 | 64.97 | 12416 |
Δ vs the 9871171 row below:
string_concat−21 % (56.74 → 44.56) — primary driver is822b189's integer fast-path; the GC-closure typed setters contributed the rest. The remaining ~26 % of samples in_platform_memmove(insideallocateConsString's depth-cap flatten) is the bytes-bandwidth ceiling; raising the cap measured neutral (tested + reverted this session).array_iter−12 % (22.61 → 19.99) — GC-closure win.prop_access−10 % (16.03 → 14.39) — GC-closure win.promise_chain−7 % (14.10 → 13.11) — GC-closure win.arith_loop−4.5 % (83.85 → 80.10) — GC-closure win.prop_write−4 % (19.86 → 19.03) — GC-closure win.object_allocflat (63.67 → 64.13). The structural ~15 ms gap to QuickJS-NG remains — design + phase plan in docs/lazy-property-bag.md.
Cross-engine context (interpreter tier, tools/bench-cross.sh
snapshot recorded in bench-cross-results.md): cynic now
matched or ahead of QuickJS-NG on 4 of 7 fixtures
(array_iter, prop_access, string_concat, ≈promise_chain).
Remaining gaps (arith_loop 5 ms, prop_write 3 ms,
object_alloc 15 ms noisy) all map to ROADMAP-tracked
structural items.
Verified: zig build test green (1124+ tests pass), runtime
sweep 37211 / 9 (RegExp cluster only — unchanged), --top-rss
healthy band on language/expressions.
Cumulative measurement after six perf commits landed on top of
the JSObjectExtension shrink:
de390b7writeBarrier primitive fast-path4133c7fshape-firstJSObject.get4b06eb4shape-firstJSObject.hasOwn4dc8f0fIC bag-index cache onsta_property10eb7cfrope-depth cap 96 → 8192 + iterativemarkString77e71b9GC trigger 16k/4k → 32k/8k9871171slab pool forJSObjectheaders
| bench | median_ms | min_ms | max_ms | rss_kb |
|---|---|---|---|---|
| arith_loop | 83.85 | 82.75 | 85.09 | 3456 |
| prop_access | 16.03 | 15.88 | 16.12 | 3504 |
| prop_write | 19.86 | 19.54 | 21.37 | 3584 |
| array_iter | 22.61 | 22.47 | 23.19 | 4704 |
| string_concat | 56.74 | 56.15 | 57.06 | 12864 |
| promise_chain | 14.10 | 13.78 | 14.96 | 26752 |
| object_alloc | 63.67 | 63.27 | 64.22 | 12368 |
Δ vs the 5b3fd1a row below:
prop_write−34 % (30.17 → 19.86) — the IC bag-index cache collapses the per-hitwyhash+ bucket walk + key compare to a singlevalues()[bag_index] = accstore.string_concat−30 % (80.59 → 56.74) — bumping the rope-depth cap from 96 to 8192 cuts the quadratic flatten cost;_platform_memmove(was 74 % of samples) is no longer the bottleneck. RSS halved (34 → 13 MB peak).object_alloc−9 % (70.01 → 63.67) — slab pool replaces the per-allocation libsystem_malloc round-trip with an O(1) free-list pop. Per-allocation: 175 → ~159 ns/alloc.promise_chain−16 % (16.87 → 14.10) — GC threshold doubled (16k/4k → 32k/8k), halving cycle frequency on the marker-bound chain. RSS bump (8 → 27 MB) was the trade-off onobject_allocat 4×; 2× lands in the safe zone.prop_access(16.03 vs 15.39),arith_loop(83.85 vs 86.07),array_iter(22.61 vs 20.80) — within run-to-run noise.
Cross-engine context (interpreter tier; tools/bench-cross.sh
snapshot, not committed): closes every historical gap vs
QuickJS-NG to within 13–31 %. prop_access matched (16 vs 15
ms); array_iter ahead or tied across every peer. The
remaining object_alloc 19 % gap to qjs is structural — qjs
uses arena allocation + a ~64-byte object header against
Cynic's 512-byte shape-aware design.
Verified per commit: zig build test green, runtime sweep
37211/9 (RegExp cluster only — unchanged), /gc-stress clean
on every touched bucket.
Cumulative measurement after the 7-phase JSObject-shrink arc
(4071f50 scaffolding → 662d00e accessors → 8b45019
private_* → 9365965 namespace_* → 39dbfe1 map/set_data →
4916864 promise/weak/finreg → 5b3fd1a ArrayBuffer/
TypedView/DataView). @sizeOf(JSObject) dropped 960 → 512
bytes (-47 %). Cold fields lazy-alloc into a side-table
JSObjectExtension pointer; plain {a, b} literals pay a
single null pointer instead of the multi-kilobyte cold state.
| bench | median_ms | min_ms | max_ms | rss_kb |
|---|---|---|---|---|
| arith_loop | 86.07 | 85.38 | 86.60 | 3360 |
| prop_access | 15.39 | 15.14 | 15.60 | 3376 |
| prop_write | 30.17 | 30.14 | 30.35 | 3472 |
| array_iter | 20.80 | 20.77 | 21.01 | 4320 |
| string_concat | 80.59 | 79.86 | 84.70 | 34224 |
| promise_chain | 16.87 | 15.89 | 17.41 | 26768 |
| object_alloc | 70.01 | 69.57 | 70.86 | 7952 |
Δ vs the 679df99 row below (per-iteration normalized, since
302029d bumped iteration counts in between):
object_alloc 232 ns/alloc → 175 ns/alloc (-25 %) —
the headline payoff. A 47 % smaller JSObject ≈ proportionate
drop in memset/write traffic per allocation. The other
fixtures sit inside noise after iteration-count
normalization; prop_access 15.39 ms (matches prior, the IC
already does the heavy lifting), arith_loop 86 ms (unchanged
— a pure-arithmetic loop never allocates). RSS is up on
fixtures that allocate huge backing buffers (string_concat,
promise_chain) — that's the iteration-count bump, not the
extension work.
GC stress (--gc-threshold=1) clean across every touched
bucket (Object, Map, Set, WeakMap, WeakSet, WeakRef, FinReg,
Promise, TypedArray, language/statements/class, …) — 0 fails,
no segfaults, no panics.
Session end-state, capturing the property-cache arc + the GC
follow-ups: e03f5cd (lda IC) + 7bad504 (sta IC) + 2c89781
(call_method IC) + 9f677b9 (GC mark-colour flip) + 8a9cf22
(--gc-threshold CLI) + bc22bc5 (registered-symbol pin) +
679df99 (score row refresh).
| bench | median_ms | min_ms | max_ms | rss_kb |
|---|---|---|---|---|
| arith_loop | 85.22 | 83.68 | 85.86 | 3392 |
| prop_access | 15.55 | 15.22 | 15.81 | 3408 |
| prop_write | 31.83 | 31.17 | 31.93 | 3456 |
| array_iter | 22.91 | 22.45 | 23.06 | 4352 |
| string_concat | 3.48 | 3.26 | 3.80 | 4288 |
| promise_chain | 4.29 | 4.11 | 4.61 | 8464 |
| object_alloc | 23.17 | 23.01 | 23.28 | 9568 |
Δ vs the previous row below (write-IC patch in-session): every
fixture within ±10 % — prop_access −5.6 % (16.47 → 15.55, IC
hits even tighter), prop_write −5.5 % (33.70 → 31.83), the
others noise. The GC mark-colour flip, registered-symbol pin,
and call_method IC don't move these microbenches measurably
(no method-heavy fixture exists; the mature heap is too small
to surface the per-cycle clear-loop saving). Spreads tight
(≤ ±3 %) except prop_access (3.9 %) which is still the
tightest non-trivial cell.
Cross-engine context (tools/bench-cross.sh, interpreter
tier — JIT engines run with their JIT disabled, internal
compass not recorded here): prop_access 16 ms ties
QuickJS-NG (16) and beats V8-jitless (35), closing the
documented "~3× behind QuickJS" gap the IC was built to fix.
prop_write 33 ms vs QuickJS 17 — the natural next target,
likely JSArray packed storage (item 2 of the perf roadmap)
since prop_write shares its allocation pattern with
object_alloc (where QuickJS leads 16 vs 24). arith_loop
14 % behind QuickJS, 54 % behind JSC-jitless's LLInt — the
dispatch-core micro-tuning bucket. JSC ahead of every
non-LLInt interpreter on every fixture by 30-60 %, the
LLInt-vs-Zig-switch ceiling for a non-JIT engine.
Both halves of the monomorphic property cache landed: lda_property
took its IC operand in e03f5cd ("shapes: wire monomorphic inline
cache into lda_property"), and this run measures the symmetric
write-side cache on sta_property. New bench prop_write mirrors
prop_access — same shape, same four hot keys, write instead of
read — to measure the write IC's payoff (the prior suite had no
hot-write-to-same-shape fixture).
| bench | median_ms | min_ms | max_ms | rss_kb |
|---|---|---|---|---|
| arith_loop | 90.31 | 88.16 | 91.67 | 3376 |
| prop_access | 17.57 | 17.36 | 18.02 | 3440 |
| prop_write | 33.70 | 32.61 | 34.59 | 3472 |
| array_iter | 22.82 | 22.58 | 23.43 | 4336 |
| string_concat | 3.67 | 3.49 | 3.79 | 4256 |
| promise_chain | 4.49 | 4.34 | 4.60 | 8400 |
| object_alloc | 24.30 | 23.89 | 24.83 | 9536 |
Δ on prop_write specifically: with the write IC stashed out
(read IC only, e03f5cd state) the same fixture measures 92.24 ms
in-session — the write IC drops it to 33.70 ms, a −63.4 %
speedup that mirrors prop_access's -66 % read-side win.
Mechanism: the fast path pointer-compares the receiver's shape
against the IC cell and writes slots[cell.slot] = v + a hash-map
update on properties, skipping the full strictSetProperty walk
(proxy / module-namespace / typed-array / array-exotic / accessor /
ancestor-non-writable / extensibility checks, plus the function
call into strictSetPropertyAnchored). The slow path captures the
pre-write shape and refills the cell only on same-shape rewrites,
so transitioning writes (literal construction at fresh receivers)
don't burn a shape lookup per slow-path call for zero hits.
Other benches vs the 39b5e31 scaffolding-only row: prop_access
−64 % (the read-IC win, still the dominant mover). arith_loop
+9 %, array_iter +11 %, string_concat +21 %, promise_chain
+27 %, object_alloc +2 % — within-session re-runs against an
identically-built binary (read-IC only) put these benches at the
same numbers (±2 %), so the apparent regression is cross-session
machine noise on benches with no lda_property / sta_property
in the hot path, not a real cost of the IC. Spreads tight (≤ ±3
% within this session).
Regression check after the shapes-scaffolding commits (0704c9a
ShapeTree to heap, ab9970d route JSObject.get through shape
slots, ba773fb build a shadow shape on every named-property
write, 39b5e31 shadow the user-assignment write path / demote
on delete) and the genuinely-weak WeakRef/WeakMap/WeakSet/
FinalizationRegistry change (55f00df). Inline-cache sites
on lda_property / sta_property aren't wired yet — that's the
follow-up that turns the scaffolding into a win.
| bench | median_ms | min_ms | max_ms | rss_kb |
|---|---|---|---|---|
| arith_loop | 82.75 | 81.45 | 83.26 | 3344 |
| prop_access | 48.94 | 48.05 | 49.59 | 3392 |
| array_iter | 20.63 | 20.45 | 21.40 | 4320 |
| string_concat | 3.04 | 2.98 | 3.11 | 4192 |
| promise_chain | 3.54 | 3.47 | 3.60 | 8352 |
| object_alloc | 23.80 | 21.22 | 24.81 | 9536 |
Δ vs the 99b6566 row below: object_alloc +26.8 % (18.77 →
23.80) — every named-property write now routes through the
shape transition tree (addPropertyTransition lookup + slot
assignment) instead of a flat property-bag insert; the
allocation-heavy fixture takes the hit twice per object.
Expected as scaffolding cost ahead of the IC wiring, which will
pay it back. prop_access is flat (+0.8 %) — reads route
through shape slots too but get was already shape-aware and
the lookup is unchanged shape-to-shape. The other four fixtures
sit inside ±5 % run-to-run noise (promise_chain −5.1 % the
biggest mover, RSS within 2 %). Spreads tight (≤ ±3 %).
Regression check after the __cynic_ observable-slot fixes
(iterator + matchAll internal state moved off the property bag
into typed JSObject slots) and the GC proxy-receiver / matchAll
rooting work — all correctness / conformance, expected
perf-neutral.
| bench | median_ms | min_ms | max_ms | rss_kb |
|---|---|---|---|---|
| arith_loop | 81.99 | 81.75 | 84.23 | 3264 |
| prop_access | 48.56 | 48.28 | 49.07 | 3328 |
| array_iter | 20.83 | 20.74 | 21.07 | 4240 |
| string_concat | 3.05 | 2.98 | 3.08 | 4128 |
| promise_chain | 3.73 | 3.57 | 3.87 | 8240 |
| object_alloc | 18.77 | 18.52 | 19.62 | 8816 |
Δ vs the 8e8171e row below: every fixture within ±6 % —
arith_loop −6.3 % (87.55 → 81.99), the rest inside ±5 %. All
run-to-run noise; nothing perf-shaped landed between the rows.
The __cynic_ slot moves and GC rooting are perf-neutral, as
expected. Spreads tight (≤ ±2 %).
The loop env-hoist (f719ae3 — skip the per-iteration environment
when the loop body captures nothing), measured. The BigInt
arbitrary-precision rewrite, GC root-completeness, and the
non-RegExp triage fixes also landed since the row below — all
conformance / correctness work, perf-neutral on this suite.
| bench | median_ms | min_ms | max_ms | rss_kb |
|---|---|---|---|---|
| arith_loop | 87.55 | 86.39 | 93.91 | 3248 |
| prop_access | 48.64 | 47.95 | 49.41 | 3280 |
| array_iter | 21.84 | 21.31 | 22.24 | 4240 |
| string_concat | 3.16 | 3.00 | 3.37 | 4144 |
| promise_chain | 3.62 | 3.45 | 3.78 | 8176 |
| object_alloc | 18.93 | 18.70 | 19.70 | 8768 |
Δ vs the a36af42 row below: array_iter −69.6 % (71.76 →
21.84) — the env-hoist drops the per-iteration environment the
loop body never needed. Broad gains follow as the same hoist
thins loop scaffolding elsewhere: string_concat −22.7 %
(4.09 → 3.16), promise_chain −22.0 % (4.64 → 3.62),
object_alloc −14.9 % (22.25 → 18.93). arith_loop and
prop_access are flat (±3 % run-to-run noise — a closure-free
arithmetic loop has no per-iteration env to hoist). Spreads
tight; machine load ~6 at measurement. Cross-engine context
(tools/bench-cross.sh, interpreter tier, not recorded here):
array_iter is now level with QuickJS-NG and JSC (~22 ms each);
prop_access stays ~3× behind QuickJS — the next target, an
inline-cache job.
rung-5 (int32 fast paths for arithmetic / comparison / bitwise
opcodes) + the for-of dense-Array iteration path (skips the
per-step iterator result object). Also landed since the row
below — the BigInt arbitrary-precision rewrite, a GC
root-completeness fix, the native-function [[Prototype]] fix —
all conformance / correctness work, perf-neutral on this suite.
| bench | median_ms | min_ms | max_ms | rss_kb |
|---|---|---|---|---|
| arith_loop | 82.07 | 80.36 | 82.32 | 3312 |
| prop_access | 47.40 | 45.91 | 49.13 | 3376 |
| array_iter | 71.76 | 68.76 | 76.23 | 4384 |
| string_concat | 3.07 | 2.85 | 3.11 | 4224 |
| promise_chain | 3.27 | 3.23 | 3.32 | 7936 |
| object_alloc | 18.03 | 17.68 | 18.50 | 8944 |
Δ vs the 3cb87f9 row below: array_iter −71.4 % (251.12 →
71.76) is the big mover — the for-of dense-Array path drops the
per-iteration iterator-result-object allocation (RSS also falls,
6912 → 4384 KB). arith_loop −44.1 % (146.93 → 82.07) — rung-5's
int32 fast paths skip the boxed-Number path for the loop's add /
compare. The rest are broad single-pass gains as rung-5 thins
the per-opcode work in the surrounding loop scaffolding:
promise_chain −29.5 % (4.64 → 3.27), string_concat −24.9 %
(4.09 → 3.07), object_alloc −19.0 % (22.25 → 18.03),
prop_access −16.0 % (56.43 → 47.40). All spreads are tight
(≤ ±5 %); machine load avg ~3 at measurement.
Threaded dispatch (rung-3) + unchecked opcode decode (rung-4).
rung-4 replaced a per-opcode std.enums.fromInt (an O(200)
enum-field scan to validate the opcode byte) with an O(1)
@enumFromInt cast — the dispatch loop was ~95% decode overhead.
| bench | median_ms | min_ms | max_ms | rss_kb |
|---|---|---|---|---|
| arith_loop | 146.93 | 145.40 | 149.32 | 3264 |
| prop_access | 56.43 | 55.49 | 58.06 | 3328 |
| array_iter | 251.12 | 247.26 | 255.38 | 6912 |
| string_concat | 4.09 | 3.90 | 4.30 | 4144 |
| promise_chain | 4.64 | 4.47 | 4.74 | 7968 |
| object_alloc | 22.25 | 21.42 | 22.74 | 8800 |
Δ vs the fda6ce0 row below: every fixture dropped sharply.
arith_loop −95.1 % (3024.10 → 146.93), prop_access −89.4 %
(532.36 → 56.43), array_iter −66.7 % (753.19 → 251.12),
object_alloc −75.9 % (92.21 → 22.25), string_concat −38.1 %
(6.61 → 4.09), promise_chain −7.9 % (5.04 → 4.64). The
dispatch-bound fixtures gain most — a pure arithmetic loop was
almost entirely opcode-decode overhead — and the
allocation-bound fixtures (object_alloc, promise_chain)
gain least, as expected. Now ~3 ns/opcode vs ~62 ns before.
Cross-engine context (interpreter tier, tools/bench-cross.sh,
not recorded here): Cynic still trails QuickJS-NG ~2× on
arith_loop and ~10× on array_iter — array_iter is the next
target and looks algorithmic, not dispatch-bound.
Regression check after GC Stages 0–2 (generational scaffolding —
store-site routing, generation header bits, write barrier +
remembered set) and the test262 watchdog (a per-opcode
host_interrupt check) landed on main — none of which was
perf-measured when it merged.
| bench | median_ms | min_ms | max_ms | rss_kb |
|---|---|---|---|---|
| arith_loop | 3024.10 | 2973.74 | 3058.68 | 3328 |
| prop_access | 532.36 | 530.39 | 542.76 | 3376 |
| array_iter | 753.19 | 743.07 | 763.55 | 15856 |
| string_concat | 6.61 | 6.53 | 6.72 | 4528 |
| promise_chain | 5.04 | 4.90 | 5.47 | 7776 |
| object_alloc | 92.21 | 91.57 | 94.80 | 24864 |
Δ vs the 2f3b373 rung-1 row: every fixture within ±3 %. The
stable benches — arith_loop −2.7 %, prop_access +1.2 %,
array_iter −1.2 %, object_alloc +1.0 % — sit inside run-to-run
noise; string_concat / promise_chain are single-digit-ms and
noise-dominated. RSS flat across the board. No measurable cost
from the write barrier or the per-opcode interrupt check — the
barrier only does work on a mature→young store (rare in steady
state) and the interrupt check is a cheap, near-always-false null
test. GC Stages 0–2 landed perf-neutral, as the rung-1 plan
assumed.
Interpreter perf rung 1 — slot-indexed global lexical bindings. A
top-level let/const/class reference now resolves to a numeric
slot at compile time; runtime access is decl_env.values()[base + slot] (a bounds-checked array index) instead of wyhash(name) +
an ArrayHashMap lookup. Sound without a runtime guard because the
no-eval policy makes the global-lexical set statically known.
| bench | median_ms | min_ms | max_ms | rss_kb |
|---|---|---|---|---|
| arith_loop | 3106.55 | 2990.93 | 3318.93 | 3200 |
| prop_access | 525.86 | 523.90 | 567.49 | 3264 |
| array_iter | 762.64 | 744.44 | 778.69 | 15840 |
| string_concat | 6.23 | 5.98 | 6.46 | 4416 |
| promise_chain | 4.88 | 4.76 | 5.26 | 7760 |
| object_alloc | 91.27 | 89.58 | 92.90 | 24816 |
Δ vs the a59a940 baseline below: arith_loop −2.5 %,
prop_access −5.2 %, array_iter −2.6 % — real, broad, modest, as
the rung-1 plan predicted. string_concat / promise_chain /
object_alloc moved within run-to-run noise (±3 %); nothing
regressed. The dispatch loop still dominates arith_loop — that's
rung 3 (computed-goto / tail-call dispatch) and, eventually, a JIT.
Inaugural baseline — recorded right after the ConsString rope work (Stages 1–2 + the header shrink), the exact-dtoa Number formatters, the regex lone-surrogate fix, and the per-iteration-env capture analysis all landed.
| bench | median_ms | min_ms | max_ms | rss_kb |
|---|---|---|---|---|
| arith_loop | 3186.36 | 3134.52 | 3291.85 | 3232 |
| prop_access | 554.64 | 549.52 | 563.08 | 3296 |
| array_iter | 782.69 | 766.91 | 862.15 | 15968 |
| string_concat | 6.35 | 6.20 | 6.58 | 4480 |
| promise_chain | 4.75 | 4.69 | 4.99 | 7760 |
| object_alloc | 90.90 | 90.04 | 92.00 | 24832 |
Notes: arith_loop dominates — a pure arithmetic loop is the
bytecode interpreter's raw dispatch throughput, the natural target
once JIT tiers are on the table (see docs/ROADMAP.md).
string_concat is cheap (6.35 ms) and low-RSS, as lazy O(1) rope
concatenation should be. object_alloc carries the heaviest RSS.