Releases: john-rocky/CoreML-LLM
v1.5.0 — Qwen3-VL 2B stateful MLState + multifunction prefill
Phase 1 ship: ports the ANEMLL Qwen3-1.7B recipe onto Qwen3-VL 2B so the
text backbone runs ANE-resident and TTFT drops ~4× on vision prompts vs
v1.4.0.
iPhone 17 Pro (measured)
| v1.5.0 | v1.4.0 | |
|---|---|---|
| decode | 22–24 tok/s | ~10 tok/s |
| phys_footprint | 256 MB | 1.7 GB |
| vision TTFT (1st turn, ~200 tok prompt) | 2.7 s | ~5.5 s |
| vision TTFT (with chat history, ~600 tok) | 4.0 s | ~16 s |
The 6.4× memory drop is the headline — KV cache lives inside ANE via
MLState + slice_update, so there is no silent GPU spill. Decode 2.4×
on top, vision TTFT 2–4× shorter.
What landed
MLState+slice_updateKV writes — no Swift KV marshaling- Multifunction
.mlpackageper chunk:infer(T=1) +prefill_b8(T=8)
sharing onekv_cache_0state chunk_0_visionmultifunction with DeepStack injection + T=8 batched
prefill — the 196 image-pad tokens get the same 8× per-token speedup- New ModelDownloader entry "Qwen3-VL 2B (stateful, Phase 1)" pulling
from a dedicated HF repo
(mlboydaisuke/qwen3-vl-2b-stateful-coreml)
Backward compat
The legacy qwen3vl_2b entry (v1.4.0 recurrent + batched T=8) is left
untouched alongside the new entry. Existing users keep working; the new
entry is opt-in via the picker.
Not included (Phase 2)
- KV state reuse across chat turns (replies after the first turn should
go near-instant) - App-launch ANE compile pre-warm
- Prompt-builder restructure so vision chat 2nd turn benefits from KV
reuse
Diagnostic
Per-prefill breakdown is logged once per generate:
[Qwen3VL2BStateful] prefill inputIds=N batchedText=A×8 batchedVision=B×8 t1Steps=K elapsed=Xms
Credits
Recipe ported from Anemll/Anemll's
Qwen3-1.7B Core ML build.
v1.3.0 — Qwen3-VL 2B (text + vision on iPhone ANE)
First vision-capable Qwen on iPhone ANE. 28-layer GQA text backbone + DeepStack-injected Qwen3-VL vision tower, shipping end-to-end image description at 7.5 tok/s on iPhone 17 Pro.
Highlights
- Qwen3-VL 2B text on ANE — 4 INT8 body chunks × 7 layers + chunk_head. 93% ANE per body chunk, 84.6% head. Max context 2048. mmap'd fp16
embed_weight.binkeepsphys_footprint~200 MB. - Vision encoder on ANE — fixed-grid 448×448, 196 image tokens/picture via spatial_merge=2. 90.5% ANE @ INT8 palettized (406 MB).
- DeepStack-aware chunk_0_vision on ANE — same 7 layers as
chunk_0plusds_0/ds_1/ds_2inputs and avisual_activescalar gate. 93.1% ANE. On image-pad token steps the generatormemcpys the merger row intohidden_inand flips the gate; otherwise zeroed DeepStack buffers + gate=0 so the graph stays static. - Interleaved mRoPE for image tokens — Qwen3-VL's
mrope_section=[24, 20, 20]withmrope_interleaved=Truecycles T/H/W across the first 60 dims of half head_dim. Swift synthesizes per-image cos/sin; text tokens after the image span get their RoPE position shifted to match HF'sget_rope_index. - HF-processor-matching patchify in Swift — Core ML vision input is pre-patchified
(784, 1536)so the in-graph reshape stays rank 5. Earlier rank-10 in-graph permute compiled on Mac Studio ANE but faulted iPhone A18 Pro ANE withEXC_BAD_ACCESSatMLModel(contentsOf:). - Bundle:
mlboydaisuke/qwen3-vl-2b-coreml— 2.9 GB (text chunks +chunk_0_vision+embed_weight.bin+vision.mlpackage).
iPhone 17 Pro results
| Path | Decode | Prefill (text) | Vision prefill (196 tokens) |
|---|---|---|---|
| Text only | ~7.5 tok/s | recurrent | — |
| Text + image | ~7.5 tok/s | recurrent | ~26 s (first token) |
Integration notes
- Swift:
Qwen3VL2BGeneratoraccepts an optionalvisionFeatures+imagePadTokenId;LLMRunnerauto-detects both the encoder and thechunk_0_visionchunk and flipshasVision = trueonly when both load. - Device deploy:
devicectl device copy to --domain-type appDataContainer --domain-identifier com.example.CoreMLLLMChat --source .../vision.mlmodelc --destination Documents/Models/qwen3-vl-2b/qwen3_vl_2b_vision/vision.mlmodelc(also works forchunk_0_vision.mlmodelc). TheModelDownloaderlists both paths so the HF download populates them too.
Full diff: #130
v1.2.0 — N=1024 prefill + faster load for Gemma 4 E2B
Highlights
Gemma 4 E2B — N=1024 batched prefill
- Doubles prefill capacity from 512 → 1024 tokens, unlocking multi-turn
chat with >512-token context without falling back to per-token decode. - Shipped via new HF branch
n1024;
existingmainbranch (N=512) preserved for older app builds. - Weight.bin is shared with decode chunks — download size unchanged.
- Fix (a878c44):
writeSlidingFromPrefillnow writes only the last
Wsource positions whenrealLen > W. Prior code crashed with
EXC_BAD_ACCESS on any prompt > 512 tokens oncePREFILL_N > W.
Faster time-to-usable
LLM_DEFER_PREFILLdefault-on (14a9965): decode chunks load
synchronously, prefill chunks load in background. ~80s → ~35s on
iPhone 17 Pro. First prompt during the load window falls back to
per-token decode. SetLLM_DEFER_PREFILL=0to opt out.
Other
- Gemma 3 — FunctionGemma-270M + EmbeddingGemma-300M on ANE (#129)
- perf: prefill ↔ decode transition prewarm (eeaa488, dc22c06)
- docs: iPhone ANE is realLen-aware — multifunction variants not a lever (22fef1c)
- revert: PrefixCache default-on (d471a7f) — math didn't favor broad deployment
Compatibility
- New clones: default to HF
n1024branch → N=1024 prefill - Existing clones without `git pull`: keep using HF
main→ N=512 (no change) - Existing clones that `git pull` + rebuild: get the Swift SWA fix; existing cached N=512 models keep working (fix is no-op for realLen ≤ W)
Revert
v1.1.0 — Qwen3.5 2B on iPhone ANE
What's new
First 2B-class hybrid SSM + attention LLM shipped on CoreML.
Qwen3.5 2B on iPhone ANE
- Qwen3.5 2B on iPhone ANE — 24-layer hybrid SSM + attention (2.04 B params) split into 4 INT8 transformer chunks (6 layers each). Every chunk compiles on iPhone 17 Pro ANE at ≥ 90% op placement. ~17 tok/s decode, 2048-token context. Bundle:
mlboydaisuke/qwen3.5-2B-CoreML. - mmap'd fp16 embed sidecar —
embed_tokens.weightships as a raw 1 GB fp16 file (embed_weight.bin), Swiftmmap(..., MAP_PRIVATE)withMADV_RANDOM. Only the handful of rows touched per prompt page in, and those pages stay "clean" so they don't count againstphys_footprint. Reported memory during generation: ~200 MB (vs ~2 GB with a CoreML chunk_embed mlpackage that would dequantize the full 1 GB into process memory). - Why 4 chunks (not 2) — palettization shrinks disk, not ANE memory: INT8 weights re-expand to fp16 inside the ANE region, and iPhone ANE's per-mlprogram compile envelope rejected the 2-chunk split (~2 GB fp16/chunk) with
MILCompilerForANE: Couldn't communicate with a helper application→ silent GPU fallback at 3.4 GB Metal heap and 7 tok/s. 4 chunks at ≤ 1 GB INT8 (≤ ~1.7 GB fp16) each match the Gemma 4 E4B envelope that's proven ANE-resident. - 2048-token context window — bumped from 128 so chat turns don't truncate after ~10 lines. Only full-attention state scales with
max_seq(6 layers × 2 ×max_seq× 256 × fp16 × 2); +22 MB total vs the 128-token ceiling. - App binary shrunk ~5 GB — removed stale Qwen3.5-0.8B mlpackages from the Xcode target.
- ChatView tap-to-dismiss — tap outside the input field or drag-scroll to dismiss the keyboard.
Device spec (iPhone 17 Pro)
| metric | value |
|---|---|
| Decode | 17 tok/s |
| phys_footprint (inference) | ~200 MB |
| Metal heap (sustained) | 0 GB |
| ANE placement (chunk_a/b/c/d) | 90.7% / 91.1% / 90.7% / 90.8% |
| Context window | 2048 tokens |
| Bundle (HF) | 2.4 GB |
Full changelog: docs/RELEASE_v1_1_0.md.
v1.0.3 — INT8 Qwen3.5 + cleaner defaults
Qwen3.5 0.8B becomes measurably faster and smaller. Ship quality confirmed on Mac ANE + iPhone 17 Pro.
Headline numbers (Mac ANE, 80-token long-gen)
| variant | decode tok/s | bundle | loops? |
|---|---|---|---|
| FP16 greedy (prior) | 43 | 1.5 GB | none |
| INT8 greedy (new default) | 52 (+20%) | 754 MB (50%) | none |
iPhone 17 Pro shows the same relative speedup (ANE decode: 20 → 28 tok/s).
Why INT8 is faster, not just smaller
k-means palettization keeps weights in an 8-bit index + fp16 LUT form that ANE's dequant hardware unpacks inline with matmul. Net compute is ~20% faster than raw fp16 on Apple's NPU because memory bandwidth (not raw matmul throughput) is the bottleneck for a 0.8B model.
Correctness fixes bundled in
- Full EOS stop set: 248044 (
<|endoftext|>) + 248045 (<|im_start|>) + 248046 (<|im_end|>) +tok.eosTokenId. Prior releases missed 248044, which let the model emit<|endoftext|>as a visible literal and then fabricate a fake "Human:" follow-up turn. - System-role filter in chat template: drops UI-status system messages ("Loading…", "Model loaded!") so they don't get forwarded as real system prompts that derail the instruct model.
- Multi-byte UTF-8 streaming: accumulated-decode + diff-emit so emoji 😊 and CJK glyphs that span multiple tokens render cleanly instead of showing U+FFFD mojibake.
- Stride-safe logit reading: fallback to contiguous layout if Core ML reports zero strides for an ANE-dispatched fp16 output (seen in the wild as "!!!!" degenerate output).
Swift marshaling: custom MLFeatureProvider
Qwen35Generator now uses a Qwen35DecodeFeatures adapter that delegates state-input lookup to the previous decode call's MLFeatureProvider — 48 MLFeatureValue wrappers per step become zero. Combined with reusable MLMultiArrays for token/position/cos/sin and a native Float16 NEON fastArgmax, this cut iPhone decode latency from ~70 ms/step to ~35 ms/step.
Diagnostic
On load the generator now prints the MLComputePlan op placement so you can confirm ANE vs GPU at a glance:
[Qwen35] compute plan (int8, requested=ANE): total=2218 ANE=2008 (90.5%) GPU=0 (0.0%) CPU=5 (0.2%)
Included PRs
- #120 INT8 palettized decode as default
- #118 UTF-8 streaming fix
- #117 README refresh for v1.0.0
- #116 ANE default restore
- (#119 Gemma marshal cache closed — measured impact <0.5 tok/s, not shipping.)
Memory footprint
A 0.8B model inherently needs its weights in RAM (~750 MB INT8, ~1.4 GB fp16 equivalent once dequantized on device). Total app memory on iPhone in ANE mode sits around 1.6-2 GB. "0 GB Metal heap" means the GPU is not allocating — ANE runs on the unified-memory weight mmap + its own plan cache.
v1.0.2 — UTF-8 streaming fix + docs
Hotfix for emoji and CJK glyph rendering in Qwen3.5 chat output.
Fix
Qwen's BPE tokenizer often splits multi-byte UTF-8 (😊, some CJK glyphs) across multiple tokens. The per-token decode path in v1.0.0 / v1.0.1 yielded partial byte sequences that rendered as U+FFFD replacement characters (mojibake).
v1.0.2 accumulates all generated token ids and decodes the full sequence each step, emitting only the diff. Multi-byte codepoints are now emitted as a single unit when complete.
Included
- #118 fix(qwen3.5): preserve multi-byte UTF-8 in streaming decode
- #117 docs: README refreshed for v1.0.0 — added Qwen3.5 performance table, model-selection guide, What's new entry
- #116 fix(qwen3.5): ANE as default compute units
Current behavior (iPhone 17 Pro, CPU+ANE default)
- Decode: ~20 tok/s
- Prefill (recurrent via decode): ~20 tok/s
- Prefill (non-stateful 2-chunk bench): ~170 tok/s
- Metal heap sustained: 0 GB
- Total app memory: ~1.6 GB (fp16 weight mmap + ANE plan cache)
- Chat template applied automatically (instruct-tuned)
- Emoji and CJK now render correctly mid-stream
v1.0.1 — ANE default restored
Hotfix for v1.0.0. The default compute units for Qwen3.5 had been temporarily switched to GPU during profile/benchmark debugging and weren't restored before the ship commit. v1.0.0 ran on GPU by default, consuming ~3 GB of extra Metal heap beyond the model weights.
Fix
Default restored to .cpuAndNeuralEngine for both prefill and decode. Behaves as the v1.0.0 release notes originally described.
Memory note (applies to both ANE and GPU paths)
A 0.8B fp16 model inherently needs ~1.4 GB of RAM just for weights — this is unified memory on Apple Silicon, shared between CPU mmap and the ANE plan cache. Total app memory on ANE mode sits around 1.6-2 GB (weights + ~200 MB ANE runtime + ~300 MB app baseline).
"0 GB Metal heap" means the GPU is not allocating additional buffer memory. On GPU mode, Metal allocates another ~3 GB on top of the 1.6 GB baseline, bringing total to ~4.6 GB.
| path | Metal heap | total app memory |
|---|---|---|
| ANE (default) | 0 GB | ~1.6 GB |
| GPU (bit-exact) | ~3 GB | ~4.6 GB |
| CPU | 0 GB | ~1.6 GB |
Included
- #116 fix(qwen3.5): restore ANE default compute units
v1.0.0 — Qwen3.5 0.8B shipping on iPhone ANE
First stable release. Ships Qwen3.5 0.8B (hybrid Gated-DeltaNet SSM + attention) as a first-class chat model in the iOS sample app, running on Apple Neural Engine.
This is also the first CoreML port of a hybrid SSM/attention LLM on iPhone we're aware of — prior CoreML LLMs have been pure Transformer.
What's new
Model: mlboydaisuke/qwen3.5-0.8B-CoreML
- 1.4 GB fp16 decode mlpackage (prefill performed recurrently via the same model)
- Runs on CPU / GPU / Apple Neural Engine
- 99.9% ANE operator placement
iPhone 17 Pro performance (decode, steady state):
- ANE: 20 tok/s, 0 GB Metal heap
- GPU: 22 tok/s, ~3 GB Metal heap, bit-exact with fp32
- Prefill (non-stateful 2-chunk path): 170 tok/s on ANE — 3.0× LiteRT-LM baseline
App integration:
- Available Models → "Qwen3.5 0.8B (ANE)" → chat via the regular ChatView
- Qwen chat template applied automatically (instruct-tuned, proper
<|im_start|>/<|im_end|>wrapping) - EOS detection and graceful early stop
- Photo/video/mic pickers auto-hide (text-only model)
Precision note
On ANE, argmax on the 248K-vocab logits is fp16-fragile — strict greedy top-1 matches the fp32 oracle only 60% of the time on short prompts. However:
- oracle top-1 is in ANE top-3 for 100% of tested positions (both Mac M4 and iPhone A18)
- Hidden state over 24 layers stays at cos ≥ 0.9998 vs fp32
- Generated text is semantically equivalent to fp32 PyTorch (measured on 3 prompts × 50 tokens: same characters, same plot, same narrative arc)
Sampling-mode generation (temperature > 0) is effectively indistinguishable from fp32 output.
Behind the ship
Swift marshal optimizations (decoded 13 → 22 tok/s GPU, 14 → 20 tok/s ANE):
- Reusable MLMultiArrays + cached MLFeatureValue wrappers for the 4 non-state inputs
- Custom MLFeatureProvider that delegates state lookup to previous output — skips 48 MLFeatureValue wraps per step
- Single-pass Float16 argmax via native NEON compare (no Float32 conversion buffer)
- vDSP-accelerated sampling path
- Explicit memset on state init (MLMultiArray isn't guaranteed to zero-init on iOS)
Research artifacts (non-shipping) in conversion/:
- Chunk-split decode with fp32 boundary (proves ANE drift is uniform per-layer)
- In-graph argmax variant (measured 2 ms/step slower on Mac — argmax forces CPU placement)
- Conv2d 1x1 replacement for ANE precision (no effect)
- MLState API migration attempt (GPU 180 tok/s achievable but parity broken — Gemma-era issue reconfirmed)
Installation
Open the sample app → tap Get Model → select Qwen3.5 0.8B (ANE) → download (~1.4 GB). First load takes ~4 min on ANE for E5 compile (cached thereafter).
Full PR: #112
v0.8.0 — Gemma 4 E4B
Gemma 4 E4B on Apple Neural Engine. Second shipping model option for CoreML-LLM, alongside E2B.
Highlights
- Gemma 4 E4B — 42 layers, hidden=2560, 2 KV heads, text-only decoder
- ~14 tok/s baseline decode on iPhone 17 Pro at INT4
- 100% ANE placement — verified via
MLComputePlan - Bundle published at
mlboydaisuke/gemma-4-E4B-coreml(5.5 GB INT4-palettized)
What it does
Switch between E2B and E4B in the Models picker. E2B stays byte-identical to v0.7.0 (multimodal: image + video + audio + text). E4B is text-only but higher quality — 4B effective parameters vs E2B's 2B.
| E2B (v0.7.0) | E4B (new) | |
|---|---|---|
| Parameters | ~2B effective | ~4B effective |
| num_hidden_layers | 35 | 42 |
| hidden_size | 1536 | 2560 |
| num_key_value_heads | 1 | 2 |
| Decode speed | ~31 tok/s | ~14 tok/s |
| Per-step latency | ~32 ms | ~71 ms |
| ANE placement | 99.78% | 100% |
| Bundle size (INT4) | 3.1 GB | 5.5 GB |
Under the hood
- Generalized Gemma 4 conversion pipeline — chunk boundaries and KV-producer layer indices are derived from HF model config.
compute_chunk_boundaries(config)gives E2B's hand-tuned[(0,8),(8,15),(15,25),(25,35)]unchanged; E4B yields[(0,12),(12,24),(24,33),(33,42)]. Adding future Gemma 4 variants is a registry entry away. - Output alias convention — the chunk-2 producer KV outputs are named
kv13_*/kv14_*regardless of actual layer index (E2B: L13/L14, E4B: L22/L23). Keeps ~20 Swift call sites untouched across variants. - Dynamic KV cache shapes in Swift —
ChunkedEnginereads(slots, num_kv_heads)from each chunk'sK_sliding_in/K_full_ininput description. E2B (nkv=1, 7/1 + 5/2) unchanged; E4B (nkv=2, 10/2 + 10/2) just works. - Safer model switching —
LLMRunner.loadModelnow releases the previous model before allocating the new one, avoiding a ~8 GB double-buffer peak during E2B ↔ E4B swaps. - Self-healing downloader — skips prefill weight-sharing when prefill metadata wasn't part of the fetched file list; the engine cleans up zero-metadata
prefill_chunk*.mlmodelcdirectories on launch. - One-shot bundle builder —
python conversion/build_gemma4_bundle.py --model gemma4-e4b --ctx 2048produces a complete ready-to-ship directory (chunks + compiled.mlmodelc+ INT8 embeds + INT8 PLE + RoPE + tokenizer +model_config.json).
Upgrading
- Swift Package users — bump to
from: "0.8.0". E2B behavior preserved byte-for-byte; E4B appears as a newModelInfo.gemma4e4boption. - iOS sample app — Clean Build Folder, rebuild. Models picker will show "Gemma 4 E4B" alongside E2B.
- Converters — new
--model gemma4-e4bflag forbuild_verify_chunks.py; newbuild_gemma4_bundle.pyend-to-end bundler.
See the README for the updated architecture, performance tables, and conversion recipes. Full diff: #97.
Not in this release
- Vision / audio towers for E4B (text-only decoder only)
- Prefill chunks for E4B (decode-only ship — TTFT is unbatched)
- Speculative decoding (MTP/EAGLE-3 drafter) for E4B
v0.7.0 — Video Multimodal
Video understanding on iPhone — fully on-device
Gemma 4 E2B now processes video clips through a native video vision encoder, running entirely on-device via CoreML.
What's new
- Native video vision encoder (
vision_video.mlmodelc) — traces the HF vision tower at video-grade resolution (384x384, 64 tokens/frame). Parity vs HF forward: cosine = 1.0000. Ships as part of the Gemma 4 E2B bundle on HuggingFace (3.1 GB total). Falls back to Swift-side 2x2 pooling when absent. - Uniform frame sampling —
maxFramesdistributed evenly across the full clip duration (not just the first N seconds).fpscaps the sampling rate for short clips. Matches Gemma 4'snum_framessemantic. - Per-frame thumbnails in chat — the user's video message bubble shows the exact frames the encoder received, with
MM:SScaptions. <|video|>placeholder token (258884) — uses the correct video-specific token id per HF'sGemma4Processor, with bidirectional attention within each frame's vision group during prefill.- Static output shape —
vision_video.mlmodelcoutput is[1, 64, 1536](fully static), eliminating the E5RT dynamic-shape error on iOS. - Audio encoder fix — re-uploaded
audio.mlmodelcto HF with consistent spec + weights (fixes error code -14 from a prior partial upload).
Usage
let llm = try await CoreMLLLM.load(model: .gemma4e2b)
let analysis = try await llm.generate(
"Describe this video frame by frame.",
videoURL: URL(fileURLWithPath: "/path/to/clip.mp4"),
videoOptions: .init(fps: 1.0, maxFrames: 6))PRs included
- #81 — Phase 1 pool fallback + Phase 2 native encoder
- #82 — Uniform frame sampling + per-frame thumbnails
- #84 —
<|video|>placeholder + bidirectional vision group mask - #95 — Background download fix (adopt surviving tasks)
Performance
No decode speed regression. Video encoder runs on .cpuAndGPU (same as image/audio encoders). Decode chunks remain on ANE at 99.78% placement.
| Metric | v0.6.2 | v0.7.0 |
|---|---|---|
| Bundle size | 2.8 GB | 3.1 GB |
| Modalities | Image + Audio + Text | Image + Video + Audio + Text |
| Decode speed | ~31 tok/s | ~31 tok/s |