-
Notifications
You must be signed in to change notification settings - Fork 0
Closed
Description
Problem
naml is 2.2x slower than Bun on a binary tree benchmark (100k inserts, 100k searches, recursive traversals). The target was Rust-level performance.
naml: 166.75 ms (Cranelift JIT)
Bun: 74.81 ms (JavaScriptCore JIT)
Root Cause
Every array operation (arr[i], arr.push(v), arr.len()) compiles to a full C ABI runtime function call instead of inline Cranelift IR. This accounts for the majority of the gap.
| Operation | naml (current) | Rust/Bun equivalent | Overhead |
|---|---|---|---|
arr[i] |
call naml_array_get (~50 cycles) |
Inline mov (~1-2 cycles) |
~50x |
arr.push(v) |
call naml_array_push (~100 cycles) |
Inline store + capacity check (~5 cycles) | ~20x |
arr.len() |
call naml_array_len (~40 cycles) |
Inline field load (~1 cycle) | ~40x |
With 100k+ array operations in hot paths, this adds ~10-20M excess CPU cycles.
Additional Bottlenecks
P0 — Inline array operations
naml_array_get→ emit Cranelift IR: loaddataptr from array struct, bounds check,load [data + index * 8]naml_array_set→ same pattern withstorenaml_array_len→ single field load from array struct offsetnaml_array_push→ inline fast path (len < capacity), call runtime only for realloc
P1 — Cache function signatures
- Every call site does
make_signature()+declare_function()(131 occurrences in codegen) - Should declare each runtime function once and reuse
FuncRef
P1 — Booleans as i8 instead of i64
- All bools use
types::I64— wastes registers and cache - Every comparison does
icmp→uextendto i64, unnecessary widening
P2 — Skip refcount for function-local array args
- Arrays passed to recursive functions (
tree_size(lefts, rights, node)) get atomic incref/decref per call frame - These arrays are never freed mid-recursion — refcount is pure overhead
P3 — Loop-invariant hoisting
arr.len()is re-called every loop iteration even when array doesn't change- No bounds check elimination for
i < lenguarded loops
P3 — Array literal bulk init
[1, 2, 3]compiles toarray_new()+ N ×array_push()calls- Should be a single
array_from_values()call
Expected Impact
Inlining array ops alone (P0) should close most of the 2.2x gap. Combined with P1 fixes, naml should match or beat Bun and approach native Rust speed for array-heavy workloads.
Benchmark
Run benches/run_bench.sh to reproduce. Both naml and Bun produce identical output:
nodes: 100000
found: 100000
sum: 49999950000
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels