Skip to content

Performance: inline array operations and close 2.2x gap with Bun #35

@mjm918

Description

@mjm918

Problem

naml is 2.2x slower than Bun on a binary tree benchmark (100k inserts, 100k searches, recursive traversals). The target was Rust-level performance.

naml:  166.75 ms (Cranelift JIT)
Bun:    74.81 ms (JavaScriptCore JIT)

Root Cause

Every array operation (arr[i], arr.push(v), arr.len()) compiles to a full C ABI runtime function call instead of inline Cranelift IR. This accounts for the majority of the gap.

Operation naml (current) Rust/Bun equivalent Overhead
arr[i] call naml_array_get (~50 cycles) Inline mov (~1-2 cycles) ~50x
arr.push(v) call naml_array_push (~100 cycles) Inline store + capacity check (~5 cycles) ~20x
arr.len() call naml_array_len (~40 cycles) Inline field load (~1 cycle) ~40x

With 100k+ array operations in hot paths, this adds ~10-20M excess CPU cycles.

Additional Bottlenecks

P0 — Inline array operations

  • naml_array_get → emit Cranelift IR: load data ptr from array struct, bounds check, load [data + index * 8]
  • naml_array_set → same pattern with store
  • naml_array_len → single field load from array struct offset
  • naml_array_push → inline fast path (len < capacity), call runtime only for realloc

P1 — Cache function signatures

  • Every call site does make_signature() + declare_function() (131 occurrences in codegen)
  • Should declare each runtime function once and reuse FuncRef

P1 — Booleans as i8 instead of i64

  • All bools use types::I64 — wastes registers and cache
  • Every comparison does icmpuextend to i64, unnecessary widening

P2 — Skip refcount for function-local array args

  • Arrays passed to recursive functions (tree_size(lefts, rights, node)) get atomic incref/decref per call frame
  • These arrays are never freed mid-recursion — refcount is pure overhead

P3 — Loop-invariant hoisting

  • arr.len() is re-called every loop iteration even when array doesn't change
  • No bounds check elimination for i < len guarded loops

P3 — Array literal bulk init

  • [1, 2, 3] compiles to array_new() + N × array_push() calls
  • Should be a single array_from_values() call

Expected Impact

Inlining array ops alone (P0) should close most of the 2.2x gap. Combined with P1 fixes, naml should match or beat Bun and approach native Rust speed for array-heavy workloads.

Benchmark

Run benches/run_bench.sh to reproduce. Both naml and Bun produce identical output:

nodes: 100000
found: 100000
sum: 49999950000

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions