Skip to content

Releases: john-rocky/CoreML-LLM

v1.5.0 — Qwen3-VL 2B stateful MLState + multifunction prefill

25 Apr 03:55

Choose a tag to compare

Phase 1 ship: ports the ANEMLL Qwen3-1.7B recipe onto Qwen3-VL 2B so the
text backbone runs ANE-resident and TTFT drops ~4× on vision prompts vs
v1.4.0.

iPhone 17 Pro (measured)

v1.5.0 v1.4.0
decode 22–24 tok/s ~10 tok/s
phys_footprint 256 MB 1.7 GB
vision TTFT (1st turn, ~200 tok prompt) 2.7 s ~5.5 s
vision TTFT (with chat history, ~600 tok) 4.0 s ~16 s

The 6.4× memory drop is the headline — KV cache lives inside ANE via
MLState + slice_update, so there is no silent GPU spill. Decode 2.4×
on top, vision TTFT 2–4× shorter.

What landed

  • MLState + slice_update KV writes — no Swift KV marshaling
  • Multifunction .mlpackage per chunk: infer (T=1) + prefill_b8 (T=8)
    sharing one kv_cache_0 state
  • chunk_0_vision multifunction with DeepStack injection + T=8 batched
    prefill — the 196 image-pad tokens get the same 8× per-token speedup
  • New ModelDownloader entry "Qwen3-VL 2B (stateful, Phase 1)" pulling
    from a dedicated HF repo
    (mlboydaisuke/qwen3-vl-2b-stateful-coreml)

Backward compat

The legacy qwen3vl_2b entry (v1.4.0 recurrent + batched T=8) is left
untouched alongside the new entry. Existing users keep working; the new
entry is opt-in via the picker.

Not included (Phase 2)

  • KV state reuse across chat turns (replies after the first turn should
    go near-instant)
  • App-launch ANE compile pre-warm
  • Prompt-builder restructure so vision chat 2nd turn benefits from KV
    reuse

Diagnostic

Per-prefill breakdown is logged once per generate:

[Qwen3VL2BStateful] prefill inputIds=N batchedText=A×8 batchedVision=B×8 t1Steps=K elapsed=Xms

Credits

Recipe ported from Anemll/Anemll's
Qwen3-1.7B Core ML build.

v1.3.0 — Qwen3-VL 2B (text + vision on iPhone ANE)

23 Apr 15:36
03b11d2

Choose a tag to compare

First vision-capable Qwen on iPhone ANE. 28-layer GQA text backbone + DeepStack-injected Qwen3-VL vision tower, shipping end-to-end image description at 7.5 tok/s on iPhone 17 Pro.

Highlights

  • Qwen3-VL 2B text on ANE — 4 INT8 body chunks × 7 layers + chunk_head. 93% ANE per body chunk, 84.6% head. Max context 2048. mmap'd fp16 embed_weight.bin keeps phys_footprint ~200 MB.
  • Vision encoder on ANE — fixed-grid 448×448, 196 image tokens/picture via spatial_merge=2. 90.5% ANE @ INT8 palettized (406 MB).
  • DeepStack-aware chunk_0_vision on ANE — same 7 layers as chunk_0 plus ds_0/ds_1/ds_2 inputs and a visual_active scalar gate. 93.1% ANE. On image-pad token steps the generator memcpys the merger row into hidden_in and flips the gate; otherwise zeroed DeepStack buffers + gate=0 so the graph stays static.
  • Interleaved mRoPE for image tokens — Qwen3-VL's mrope_section=[24, 20, 20] with mrope_interleaved=True cycles T/H/W across the first 60 dims of half head_dim. Swift synthesizes per-image cos/sin; text tokens after the image span get their RoPE position shifted to match HF's get_rope_index.
  • HF-processor-matching patchify in Swift — Core ML vision input is pre-patchified (784, 1536) so the in-graph reshape stays rank 5. Earlier rank-10 in-graph permute compiled on Mac Studio ANE but faulted iPhone A18 Pro ANE with EXC_BAD_ACCESS at MLModel(contentsOf:).
  • Bundle: mlboydaisuke/qwen3-vl-2b-coreml — 2.9 GB (text chunks + chunk_0_vision + embed_weight.bin + vision.mlpackage).

iPhone 17 Pro results

Path Decode Prefill (text) Vision prefill (196 tokens)
Text only ~7.5 tok/s recurrent
Text + image ~7.5 tok/s recurrent ~26 s (first token)

Integration notes

  • Swift: Qwen3VL2BGenerator accepts an optional visionFeatures + imagePadTokenId; LLMRunner auto-detects both the encoder and the chunk_0_vision chunk and flips hasVision = true only when both load.
  • Device deploy: devicectl device copy to --domain-type appDataContainer --domain-identifier com.example.CoreMLLLMChat --source .../vision.mlmodelc --destination Documents/Models/qwen3-vl-2b/qwen3_vl_2b_vision/vision.mlmodelc (also works for chunk_0_vision.mlmodelc). The ModelDownloader lists both paths so the HF download populates them too.

Full diff: #130

v1.2.0 — N=1024 prefill + faster load for Gemma 4 E2B

23 Apr 09:20

Choose a tag to compare

Highlights

Gemma 4 E2B — N=1024 batched prefill

  • Doubles prefill capacity from 512 → 1024 tokens, unlocking multi-turn
    chat with >512-token context without falling back to per-token decode.
  • Shipped via new HF branch n1024;
    existing main branch (N=512) preserved for older app builds.
  • Weight.bin is shared with decode chunks — download size unchanged.
  • Fix (a878c44): writeSlidingFromPrefill now writes only the last
    W source positions when realLen > W. Prior code crashed with
    EXC_BAD_ACCESS on any prompt > 512 tokens once PREFILL_N > W.

Faster time-to-usable

  • LLM_DEFER_PREFILL default-on (14a9965): decode chunks load
    synchronously, prefill chunks load in background. ~80s → ~35s on
    iPhone 17 Pro. First prompt during the load window falls back to
    per-token decode. Set LLM_DEFER_PREFILL=0 to opt out.

Other

  • Gemma 3 — FunctionGemma-270M + EmbeddingGemma-300M on ANE (#129)
  • perf: prefill ↔ decode transition prewarm (eeaa488, dc22c06)
  • docs: iPhone ANE is realLen-aware — multifunction variants not a lever (22fef1c)
  • revert: PrefixCache default-on (d471a7f) — math didn't favor broad deployment

Compatibility

  • New clones: default to HF n1024 branch → N=1024 prefill
  • Existing clones without `git pull`: keep using HF main → N=512 (no change)
  • Existing clones that `git pull` + rebuild: get the Swift SWA fix; existing cached N=512 models keep working (fix is no-op for realLen ≤ W)

Revert

  • `git revert b7bed2e` — revert HF branch switch (back to main / N=512)
  • `git revert 14a9965` — revert deferred-load default
  • `git revert a878c44` — revert N=1024 + SWA fix (requires rebuilding N=512 mlpackage)

v1.1.0 — Qwen3.5 2B on iPhone ANE

22 Apr 08:08
8e358de

Choose a tag to compare

What's new

First 2B-class hybrid SSM + attention LLM shipped on CoreML.

Qwen3.5 2B on iPhone ANE

  • Qwen3.5 2B on iPhone ANE — 24-layer hybrid SSM + attention (2.04 B params) split into 4 INT8 transformer chunks (6 layers each). Every chunk compiles on iPhone 17 Pro ANE at ≥ 90% op placement. ~17 tok/s decode, 2048-token context. Bundle: mlboydaisuke/qwen3.5-2B-CoreML.
  • mmap'd fp16 embed sidecarembed_tokens.weight ships as a raw 1 GB fp16 file (embed_weight.bin), Swift mmap(..., MAP_PRIVATE) with MADV_RANDOM. Only the handful of rows touched per prompt page in, and those pages stay "clean" so they don't count against phys_footprint. Reported memory during generation: ~200 MB (vs ~2 GB with a CoreML chunk_embed mlpackage that would dequantize the full 1 GB into process memory).
  • Why 4 chunks (not 2) — palettization shrinks disk, not ANE memory: INT8 weights re-expand to fp16 inside the ANE region, and iPhone ANE's per-mlprogram compile envelope rejected the 2-chunk split (~2 GB fp16/chunk) with MILCompilerForANE: Couldn't communicate with a helper application → silent GPU fallback at 3.4 GB Metal heap and 7 tok/s. 4 chunks at ≤ 1 GB INT8 (≤ ~1.7 GB fp16) each match the Gemma 4 E4B envelope that's proven ANE-resident.
  • 2048-token context window — bumped from 128 so chat turns don't truncate after ~10 lines. Only full-attention state scales with max_seq (6 layers × 2 × max_seq × 256 × fp16 × 2); +22 MB total vs the 128-token ceiling.
  • App binary shrunk ~5 GB — removed stale Qwen3.5-0.8B mlpackages from the Xcode target.
  • ChatView tap-to-dismiss — tap outside the input field or drag-scroll to dismiss the keyboard.

Device spec (iPhone 17 Pro)

metric value
Decode 17 tok/s
phys_footprint (inference) ~200 MB
Metal heap (sustained) 0 GB
ANE placement (chunk_a/b/c/d) 90.7% / 91.1% / 90.7% / 90.8%
Context window 2048 tokens
Bundle (HF) 2.4 GB

Full changelog: docs/RELEASE_v1_1_0.md.

v1.0.3 — INT8 Qwen3.5 + cleaner defaults

22 Apr 03:24
837bf48

Choose a tag to compare

Qwen3.5 0.8B becomes measurably faster and smaller. Ship quality confirmed on Mac ANE + iPhone 17 Pro.

Headline numbers (Mac ANE, 80-token long-gen)

variant decode tok/s bundle loops?
FP16 greedy (prior) 43 1.5 GB none
INT8 greedy (new default) 52 (+20%) 754 MB (50%) none

iPhone 17 Pro shows the same relative speedup (ANE decode: 20 → 28 tok/s).

Why INT8 is faster, not just smaller

k-means palettization keeps weights in an 8-bit index + fp16 LUT form that ANE's dequant hardware unpacks inline with matmul. Net compute is ~20% faster than raw fp16 on Apple's NPU because memory bandwidth (not raw matmul throughput) is the bottleneck for a 0.8B model.

Correctness fixes bundled in

  • Full EOS stop set: 248044 (<|endoftext|>) + 248045 (<|im_start|>) + 248046 (<|im_end|>) + tok.eosTokenId. Prior releases missed 248044, which let the model emit <|endoftext|> as a visible literal and then fabricate a fake "Human:" follow-up turn.
  • System-role filter in chat template: drops UI-status system messages ("Loading…", "Model loaded!") so they don't get forwarded as real system prompts that derail the instruct model.
  • Multi-byte UTF-8 streaming: accumulated-decode + diff-emit so emoji 😊 and CJK glyphs that span multiple tokens render cleanly instead of showing U+FFFD mojibake.
  • Stride-safe logit reading: fallback to contiguous layout if Core ML reports zero strides for an ANE-dispatched fp16 output (seen in the wild as "!!!!" degenerate output).

Swift marshaling: custom MLFeatureProvider

Qwen35Generator now uses a Qwen35DecodeFeatures adapter that delegates state-input lookup to the previous decode call's MLFeatureProvider — 48 MLFeatureValue wrappers per step become zero. Combined with reusable MLMultiArrays for token/position/cos/sin and a native Float16 NEON fastArgmax, this cut iPhone decode latency from ~70 ms/step to ~35 ms/step.

Diagnostic

On load the generator now prints the MLComputePlan op placement so you can confirm ANE vs GPU at a glance:

[Qwen35] compute plan (int8, requested=ANE): total=2218 ANE=2008 (90.5%) GPU=0 (0.0%) CPU=5 (0.2%)

Included PRs

  • #120 INT8 palettized decode as default
  • #118 UTF-8 streaming fix
  • #117 README refresh for v1.0.0
  • #116 ANE default restore
  • (#119 Gemma marshal cache closed — measured impact <0.5 tok/s, not shipping.)

Memory footprint

A 0.8B model inherently needs its weights in RAM (~750 MB INT8, ~1.4 GB fp16 equivalent once dequantized on device). Total app memory on iPhone in ANE mode sits around 1.6-2 GB. "0 GB Metal heap" means the GPU is not allocating — ANE runs on the unified-memory weight mmap + its own plan cache.

v1.0.2 — UTF-8 streaming fix + docs

22 Apr 01:36
99cf93f

Choose a tag to compare

Hotfix for emoji and CJK glyph rendering in Qwen3.5 chat output.

Fix

Qwen's BPE tokenizer often splits multi-byte UTF-8 (😊, some CJK glyphs) across multiple tokens. The per-token decode path in v1.0.0 / v1.0.1 yielded partial byte sequences that rendered as U+FFFD replacement characters (mojibake).

v1.0.2 accumulates all generated token ids and decodes the full sequence each step, emitting only the diff. Multi-byte codepoints are now emitted as a single unit when complete.

Included

  • #118 fix(qwen3.5): preserve multi-byte UTF-8 in streaming decode
  • #117 docs: README refreshed for v1.0.0 — added Qwen3.5 performance table, model-selection guide, What's new entry
  • #116 fix(qwen3.5): ANE as default compute units

Current behavior (iPhone 17 Pro, CPU+ANE default)

  • Decode: ~20 tok/s
  • Prefill (recurrent via decode): ~20 tok/s
  • Prefill (non-stateful 2-chunk bench): ~170 tok/s
  • Metal heap sustained: 0 GB
  • Total app memory: ~1.6 GB (fp16 weight mmap + ANE plan cache)
  • Chat template applied automatically (instruct-tuned)
  • Emoji and CJK now render correctly mid-stream

v1.0.1 — ANE default restored

22 Apr 01:23
5771589

Choose a tag to compare

Hotfix for v1.0.0. The default compute units for Qwen3.5 had been temporarily switched to GPU during profile/benchmark debugging and weren't restored before the ship commit. v1.0.0 ran on GPU by default, consuming ~3 GB of extra Metal heap beyond the model weights.

Fix

Default restored to .cpuAndNeuralEngine for both prefill and decode. Behaves as the v1.0.0 release notes originally described.

Memory note (applies to both ANE and GPU paths)

A 0.8B fp16 model inherently needs ~1.4 GB of RAM just for weights — this is unified memory on Apple Silicon, shared between CPU mmap and the ANE plan cache. Total app memory on ANE mode sits around 1.6-2 GB (weights + ~200 MB ANE runtime + ~300 MB app baseline).

"0 GB Metal heap" means the GPU is not allocating additional buffer memory. On GPU mode, Metal allocates another ~3 GB on top of the 1.6 GB baseline, bringing total to ~4.6 GB.

path Metal heap total app memory
ANE (default) 0 GB ~1.6 GB
GPU (bit-exact) ~3 GB ~4.6 GB
CPU 0 GB ~1.6 GB

Included

  • #116 fix(qwen3.5): restore ANE default compute units

v1.0.0 — Qwen3.5 0.8B shipping on iPhone ANE

22 Apr 01:15
1ecc9be

Choose a tag to compare

First stable release. Ships Qwen3.5 0.8B (hybrid Gated-DeltaNet SSM + attention) as a first-class chat model in the iOS sample app, running on Apple Neural Engine.

This is also the first CoreML port of a hybrid SSM/attention LLM on iPhone we're aware of — prior CoreML LLMs have been pure Transformer.

What's new

Model: mlboydaisuke/qwen3.5-0.8B-CoreML

  • 1.4 GB fp16 decode mlpackage (prefill performed recurrently via the same model)
  • Runs on CPU / GPU / Apple Neural Engine
  • 99.9% ANE operator placement

iPhone 17 Pro performance (decode, steady state):

  • ANE: 20 tok/s, 0 GB Metal heap
  • GPU: 22 tok/s, ~3 GB Metal heap, bit-exact with fp32
  • Prefill (non-stateful 2-chunk path): 170 tok/s on ANE — 3.0× LiteRT-LM baseline

App integration:

  • Available Models → "Qwen3.5 0.8B (ANE)" → chat via the regular ChatView
  • Qwen chat template applied automatically (instruct-tuned, proper <|im_start|>/<|im_end|> wrapping)
  • EOS detection and graceful early stop
  • Photo/video/mic pickers auto-hide (text-only model)

Precision note

On ANE, argmax on the 248K-vocab logits is fp16-fragile — strict greedy top-1 matches the fp32 oracle only 60% of the time on short prompts. However:

  • oracle top-1 is in ANE top-3 for 100% of tested positions (both Mac M4 and iPhone A18)
  • Hidden state over 24 layers stays at cos ≥ 0.9998 vs fp32
  • Generated text is semantically equivalent to fp32 PyTorch (measured on 3 prompts × 50 tokens: same characters, same plot, same narrative arc)

Sampling-mode generation (temperature > 0) is effectively indistinguishable from fp32 output.

Behind the ship

Swift marshal optimizations (decoded 13 → 22 tok/s GPU, 14 → 20 tok/s ANE):

  • Reusable MLMultiArrays + cached MLFeatureValue wrappers for the 4 non-state inputs
  • Custom MLFeatureProvider that delegates state lookup to previous output — skips 48 MLFeatureValue wraps per step
  • Single-pass Float16 argmax via native NEON compare (no Float32 conversion buffer)
  • vDSP-accelerated sampling path
  • Explicit memset on state init (MLMultiArray isn't guaranteed to zero-init on iOS)

Research artifacts (non-shipping) in conversion/:

  • Chunk-split decode with fp32 boundary (proves ANE drift is uniform per-layer)
  • In-graph argmax variant (measured 2 ms/step slower on Mac — argmax forces CPU placement)
  • Conv2d 1x1 replacement for ANE precision (no effect)
  • MLState API migration attempt (GPU 180 tok/s achievable but parity broken — Gemma-era issue reconfirmed)

Installation

Open the sample app → tap Get Model → select Qwen3.5 0.8B (ANE) → download (~1.4 GB). First load takes ~4 min on ANE for E5 compile (cached thereafter).

Full PR: #112

v0.8.0 — Gemma 4 E4B

18 Apr 05:11
eb19ecb

Choose a tag to compare

Gemma 4 E4B on Apple Neural Engine. Second shipping model option for CoreML-LLM, alongside E2B.

Highlights

  • Gemma 4 E4B — 42 layers, hidden=2560, 2 KV heads, text-only decoder
  • ~14 tok/s baseline decode on iPhone 17 Pro at INT4
  • 100% ANE placement — verified via MLComputePlan
  • Bundle published at mlboydaisuke/gemma-4-E4B-coreml (5.5 GB INT4-palettized)

What it does

Switch between E2B and E4B in the Models picker. E2B stays byte-identical to v0.7.0 (multimodal: image + video + audio + text). E4B is text-only but higher quality — 4B effective parameters vs E2B's 2B.

E2B (v0.7.0) E4B (new)
Parameters ~2B effective ~4B effective
num_hidden_layers 35 42
hidden_size 1536 2560
num_key_value_heads 1 2
Decode speed ~31 tok/s ~14 tok/s
Per-step latency ~32 ms ~71 ms
ANE placement 99.78% 100%
Bundle size (INT4) 3.1 GB 5.5 GB

Under the hood

  • Generalized Gemma 4 conversion pipeline — chunk boundaries and KV-producer layer indices are derived from HF model config. compute_chunk_boundaries(config) gives E2B's hand-tuned [(0,8),(8,15),(15,25),(25,35)] unchanged; E4B yields [(0,12),(12,24),(24,33),(33,42)]. Adding future Gemma 4 variants is a registry entry away.
  • Output alias convention — the chunk-2 producer KV outputs are named kv13_*/kv14_* regardless of actual layer index (E2B: L13/L14, E4B: L22/L23). Keeps ~20 Swift call sites untouched across variants.
  • Dynamic KV cache shapes in SwiftChunkedEngine reads (slots, num_kv_heads) from each chunk's K_sliding_in/K_full_in input description. E2B (nkv=1, 7/1 + 5/2) unchanged; E4B (nkv=2, 10/2 + 10/2) just works.
  • Safer model switchingLLMRunner.loadModel now releases the previous model before allocating the new one, avoiding a ~8 GB double-buffer peak during E2B ↔ E4B swaps.
  • Self-healing downloader — skips prefill weight-sharing when prefill metadata wasn't part of the fetched file list; the engine cleans up zero-metadata prefill_chunk*.mlmodelc directories on launch.
  • One-shot bundle builderpython conversion/build_gemma4_bundle.py --model gemma4-e4b --ctx 2048 produces a complete ready-to-ship directory (chunks + compiled .mlmodelc + INT8 embeds + INT8 PLE + RoPE + tokenizer + model_config.json).

Upgrading

  • Swift Package users — bump to from: "0.8.0". E2B behavior preserved byte-for-byte; E4B appears as a new ModelInfo.gemma4e4b option.
  • iOS sample app — Clean Build Folder, rebuild. Models picker will show "Gemma 4 E4B" alongside E2B.
  • Converters — new --model gemma4-e4b flag for build_verify_chunks.py; new build_gemma4_bundle.py end-to-end bundler.

See the README for the updated architecture, performance tables, and conversion recipes. Full diff: #97.

Not in this release

  • Vision / audio towers for E4B (text-only decoder only)
  • Prefill chunks for E4B (decode-only ship — TTFT is unbatched)
  • Speculative decoding (MTP/EAGLE-3 drafter) for E4B

v0.7.0 — Video Multimodal

18 Apr 04:07

Choose a tag to compare

Video understanding on iPhone — fully on-device

Gemma 4 E2B now processes video clips through a native video vision encoder, running entirely on-device via CoreML.

What's new

  • Native video vision encoder (vision_video.mlmodelc) — traces the HF vision tower at video-grade resolution (384x384, 64 tokens/frame). Parity vs HF forward: cosine = 1.0000. Ships as part of the Gemma 4 E2B bundle on HuggingFace (3.1 GB total). Falls back to Swift-side 2x2 pooling when absent.
  • Uniform frame samplingmaxFrames distributed evenly across the full clip duration (not just the first N seconds). fps caps the sampling rate for short clips. Matches Gemma 4's num_frames semantic.
  • Per-frame thumbnails in chat — the user's video message bubble shows the exact frames the encoder received, with MM:SS captions.
  • <|video|> placeholder token (258884) — uses the correct video-specific token id per HF's Gemma4Processor, with bidirectional attention within each frame's vision group during prefill.
  • Static output shapevision_video.mlmodelc output is [1, 64, 1536] (fully static), eliminating the E5RT dynamic-shape error on iOS.
  • Audio encoder fix — re-uploaded audio.mlmodelc to HF with consistent spec + weights (fixes error code -14 from a prior partial upload).

Usage

let llm = try await CoreMLLLM.load(model: .gemma4e2b)
let analysis = try await llm.generate(
    "Describe this video frame by frame.",
    videoURL: URL(fileURLWithPath: "/path/to/clip.mp4"),
    videoOptions: .init(fps: 1.0, maxFrames: 6))

PRs included

  • #81 — Phase 1 pool fallback + Phase 2 native encoder
  • #82 — Uniform frame sampling + per-frame thumbnails
  • #84<|video|> placeholder + bidirectional vision group mask
  • #95 — Background download fix (adopt surviving tasks)

Performance

No decode speed regression. Video encoder runs on .cpuAndGPU (same as image/audio encoders). Decode chunks remain on ANE at 99.78% placement.

Metric v0.6.2 v0.7.0
Bundle size 2.8 GB 3.1 GB
Modalities Image + Audio + Text Image + Video + Audio + Text
Decode speed ~31 tok/s ~31 tok/s