Performance: inline array operations and close 2.2x gap with Bun

## Problem

naml is **2.2x slower than Bun** on a binary tree benchmark (100k inserts, 100k searches, recursive traversals). The target was Rust-level performance.

```
naml:  166.75 ms (Cranelift JIT)
Bun:    74.81 ms (JavaScriptCore JIT)
```

## Root Cause

Every array operation (`arr[i]`, `arr.push(v)`, `arr.len()`) compiles to a **full C ABI runtime function call** instead of inline Cranelift IR. This accounts for the majority of the gap.

| Operation | naml (current) | Rust/Bun equivalent | Overhead |
|-----------|---------------|---------------------|----------|
| `arr[i]` | `call naml_array_get` (~50 cycles) | Inline `mov` (~1-2 cycles) | ~50x |
| `arr.push(v)` | `call naml_array_push` (~100 cycles) | Inline store + capacity check (~5 cycles) | ~20x |
| `arr.len()` | `call naml_array_len` (~40 cycles) | Inline field load (~1 cycle) | ~40x |

With 100k+ array operations in hot paths, this adds ~10-20M excess CPU cycles.

## Additional Bottlenecks

### P0 — Inline array operations
- `naml_array_get` → emit Cranelift IR: load `data` ptr from array struct, bounds check, `load [data + index * 8]`
- `naml_array_set` → same pattern with `store`
- `naml_array_len` → single field load from array struct offset
- `naml_array_push` → inline fast path (len < capacity), call runtime only for realloc

### P1 — Cache function signatures
- Every call site does `make_signature()` + `declare_function()` (131 occurrences in codegen)
- Should declare each runtime function once and reuse `FuncRef`

### P1 — Booleans as i8 instead of i64
- All bools use `types::I64` — wastes registers and cache
- Every comparison does `icmp` → `uextend` to i64, unnecessary widening

### P2 — Skip refcount for function-local array args
- Arrays passed to recursive functions (`tree_size(lefts, rights, node)`) get atomic incref/decref per call frame
- These arrays are never freed mid-recursion — refcount is pure overhead

### P3 — Loop-invariant hoisting
- `arr.len()` is re-called every loop iteration even when array doesn't change
- No bounds check elimination for `i < len` guarded loops

### P3 — Array literal bulk init
- `[1, 2, 3]` compiles to `array_new()` + N × `array_push()` calls
- Should be a single `array_from_values()` call

## Expected Impact

Inlining array ops alone (P0) should close most of the 2.2x gap. Combined with P1 fixes, naml should match or beat Bun and approach native Rust speed for array-heavy workloads.

## Benchmark

Run `benches/run_bench.sh` to reproduce. Both naml and Bun produce identical output:
```
nodes: 100000
found: 100000
sum: 49999950000
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance: inline array operations and close 2.2x gap with Bun #35

Problem

Root Cause

Additional Bottlenecks

P0 — Inline array operations

P1 — Cache function signatures

P1 — Booleans as i8 instead of i64

P2 — Skip refcount for function-local array args

P3 — Loop-invariant hoisting

P3 — Array literal bulk init

Expected Impact

Benchmark

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Operation	naml (current)	Rust/Bun equivalent	Overhead
`arr[i]`	`call naml_array_get` (~50 cycles)	Inline `mov` (~1-2 cycles)	~50x
`arr.push(v)`	`call naml_array_push` (~100 cycles)	Inline store + capacity check (~5 cycles)	~20x
`arr.len()`	`call naml_array_len` (~40 cycles)	Inline field load (~1 cycle)	~40x

Performance: inline array operations and close 2.2x gap with Bun #35

Description

Problem

Root Cause

Additional Bottlenecks

P0 — Inline array operations

P1 — Cache function signatures

P1 — Booleans as i8 instead of i64

P2 — Skip refcount for function-local array args

P3 — Loop-invariant hoisting

P3 — Array literal bulk init

Expected Impact

Benchmark

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions