On-device LLMs on the Apple Neural Engine. Run Gemma 4, Qwen3.5, Qwen3-VL, FunctionGemma, and EmbeddingGemma on iPhone with CoreML — ANE-first, battery-friendly, no server.
Where MLX Swift is the right call when you want maximum GPU throughput, CoreML-LLM is what you use when the LLM should live on the ANE so the GPU stays free for the rest of the app.
| Model | Size | Task | iPhone 17 Pro decode | HuggingFace |
|---|---|---|---|---|
| Gemma 4 E2B | 3.1 GB | Text + image + video + audio | 34.2 tok/s (3-chunk) / 31.6 (4-chunk) | mlboydaisuke/gemma-4-E2B-coreml |
| Gemma 4 E4B | 5.5 GB | Text | ~14 tok/s | mlboydaisuke/gemma-4-E4B-coreml |
| Qwen3.5 2B | 2.4 GB | Text | ~17 tok/s (~200 MB RSS) | mlboydaisuke/qwen3.5-2B-CoreML |
| Qwen3.5 0.8B | 1.4 GB | Text | ~20 tok/s | mlboydaisuke/qwen3.5-0.8B-CoreML |
| Qwen3-VL 2B (stateful) | 2.3 GB | Text + image (DeepStack) | ~24 tok/s (256 MB RSS, TTFT 125 ms on resumed turn) | mlboydaisuke/qwen3-vl-2b-stateful-coreml |
| FunctionGemma-270M | 850 MB | Function calling | (specialist) | mlboydaisuke/functiongemma-270m-coreml |
| EmbeddingGemma-300M | 295 MB | Sentence embeddings (768/512/256/128) | (specialist) | mlboydaisuke/embeddinggemma-300m-coreml |
| Qwen3-VL 2B (legacy, recurrent) | 2.9 GB | Text + image (DeepStack) | ~7.5 tok/s | mlboydaisuke/qwen3-vl-2b-coreml |
| Qwen2.5 0.5B | 302 MB | Text | — | mlboydaisuke/qwen2.5-0.5b-coreml |
All numbers are iPhone 17 Pro A19 Pro, 2048-token context, ANE-only (no GPU fallback at runtime unless noted). Methodology: docs/BENCHMARKING.md.
Which one should I pick?
- Multimodal (image / video / audio) → Gemma 4 E2B
- Image + text chat, lowest memory + fastest follow-up → Qwen3-VL 2B (stateful)
- Text-only, maximum quality under ≤3 GB → Qwen3.5 2B
- Text-only, maximum quality → Gemma 4 E4B
- Text-only, fastest + smallest → Qwen3.5 0.8B
- Tool / function calling → FunctionGemma-270M
- Sentence embeddings / RAG → EmbeddingGemma-300M
Text (E2B)![]() |
Text (E4B)![]() |
Image![]() |
Video![]() |
Audioaudio.mov |
|
| Image (Qwen3-VL 2B) |
|
Models Zoo is a pre-built app shipping CoreML-LLM. Open it, pick a model, download, chat.
open Examples/CoreMLLLMChat/CoreMLLLMChat.xcodeprojSet your development team → build to an iOS 18+ device → Get Model → download → chat. Compute units default to .cpuAndNeuralEngine (ANE).
dependencies: [
.package(url: "https://github.com/john-rocky/CoreML-LLM", from: "1.4.0"),
]import CoreMLLLM
// Download + load in one call
let llm = try await CoreMLLLM.load(model: .gemma4e2b) { print($0) }
// Simple / streaming / multi-turn
let answer = try await llm.generate("What is the capital of France?")
for await tok in try await llm.stream("Tell me a story") { print(tok, terminator: "") }
let messages: [CoreMLLLM.Message] = [
.init(role: .user, content: "Hi!"),
.init(role: .assistant, content: "Hello!"),
.init(role: .user, content: "What is 2+2?"),
]
for await tok in try await llm.stream(messages) { print(tok, terminator: "") }
// Multimodal (Gemma 4)
let caption = try await llm.generate("Describe this image", image: cgImage)
let transcript = try await llm.generate("What did they say?", audio: pcmSamples)
let analysis = try await llm.generate(
"Describe this video frame by frame.",
videoURL: URL(fileURLWithPath: "/path/to/clip.mp4"),
videoOptions: .init(fps: 1.0, maxFrames: 6))
// Fastest decode on iPhone 17 Pro A19 Pro: opt into the 3-chunk path.
// Set in the Xcode scheme: Environment Variables → LLM_3CHUNK = 1.
// +8.2 % tok/s, bit-equivalent to the default 4-chunk decode.Downloads run in the background via URLSessionConfiguration.background with pause/resume support:
let url = try await ModelDownloader.shared.download(.gemma4e2b)
ModelDownloader.shared.pause()
ModelDownloader.shared.resumeDownload()Two specialists with their own narrow Swift APIs. Ship them alongside a chat model (Gemma 4, Qwen3.5) for tool calling + RAG.
import CoreMLLLM
let dir = FileManager.default
.urls(for: .applicationSupportDirectory, in: .userDomainMask)[0]
// Function calling (850 MB, ≥ 92% ANE, batched prefill T=32)
let fg = try await FunctionGemma.downloadAndLoad(modelsDir: dir)
let (text, call) = try fg.generateFunctionCall(
userPrompt: "Turn on the flashlight",
tools: [[
"type": "function",
"function": [
"name": "toggle_flashlight",
"description": "Turn the phone flashlight on or off.",
"parameters": ["type": "object", "properties": [:], "required": []],
],
]])
// call = "call:toggle_flashlight{}"
// Embeddings (295 MB, 99.80% ANE, Matryoshka 768/512/256/128)
let eg = try await EmbeddingGemma.downloadAndLoad(modelsDir: dir)
let vec = try eg.encode(text: "How do cats behave?",
task: .retrievalQuery, dim: 768)Standalone sample at Examples/Gemma3Demo/ imports CoreMLLLM and exercises both without pulling the Gemma 4 chat stack. Full I/O contracts in docs/FUNCTIONGEMMA.md + docs/EMBEDDINGGEMMA.md.
Prompt ─┐
▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Prefill ch1 │─►│ Prefill ch2 │─►│ Prefill ch3 │─►│ Prefill ch4 │─► first token
│ L0-7 + PLE │ │ L8-14, kv13/ │ │ L15-24 shared│ │ L25-34 + LM │
└──────────────┘ │ kv14 out │ └──────────────┘ └──────────────┘
│ └──────────────┘ ▲ ▲
│ │ │ │
│ └────────────┬────┴─────────────────┘
│ │ kv13_k/v, kv14_k/v (shared)
▼ ▼
writes K/V to persistent SWA caches
│
▼ (decode loop, 1 token per step)
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Decode ch1 │─►│ Decode ch2 │─►│ Decode ch3 │─►│ Decode ch4 │─► next token
└──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
As of v1.4.0, the default shipping Gemma 4 E2B decode collapses ch2+ch3 into a single 17-layer chunk when LLM_3CHUNK=1 is set — 3 ANE dispatches per token instead of 4, +8.2 % on iPhone A19 Pro. See docs/THREE_CHUNK_MAC_BENCH.md.
| Technique | What | Why |
|---|---|---|
| ANERMSNorm | cat([x,-x]) → LayerNorm → slice |
ANE has optimized LayerNorm; bare RMSNorm is slow |
| Conv2d-Linear | nn.Linear → nn.Conv2d(kernel_size=1) |
ANE executes Conv2d ~3× faster than matmul |
| In-graph argmax | Argmax inside the CoreML graph | Avoids shipping 256K logits from ANE to CPU |
| Manual softmax | max/sub/exp/sum/div with explicit fp16 casts |
Prevents PyTorch fp16→fp32 upcast in torch.exp |
| Pre-computed RoPE | cos/sin as model inputs, looked up in Swift | Eliminates gather / greater_equal (int ops → CPU) |
| Explicit KV I/O | Plain tensor inputs/outputs, no MLState |
Avoids int64 state indices that break ANE placement |
| Sliding window | Shift-based cache for 28/35 layers | O(W) per step instead of O(ctx) |
| Batched prefill | One CoreML call per 512-token chunk | Order-of-magnitude faster TTFT vs per-token |
| PLE in-graph | Conv2d projection + per-layer norm | 8 ms → 1.8 ms/token |
| 3-chunk decode (v1.4) | Merge chunk2+chunk3 into one 17-layer block | −1 ANE dispatch, +8.2 % tok/s |
MLX Swift targets the Apple GPU (Metal). Great on a plugged-in Mac pushing a 70B. This library targets the ANE, which matters when:
- The GPU should stay free for rendering, games, or other ML work
- The LLM must coexist with foreground apps without competing for the same silicon
- You want the most power-efficient compute unit on Apple silicon
The two are complementary — run MLX on desktop, run CoreML-LLM inside an iPhone app.
cd conversion
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# Qwen2.5 0.5B (~2 min)
python convert.py --model qwen2.5-0.5b --output ./output/qwen2.5-0.5b
# Gemma 4 — one-shot bundle builder (chunks + embeds + PLE + RoPE +
# tokenizer + model_config.json, ready for USB sideload or HF upload)
python build_gemma4_bundle.py --model gemma4-e2b --ctx 2048
python build_gemma4_bundle.py --model gemma4-e4b --ctx 2048
# Gemma 4 E2B 3-chunk decode (optional, +8.2 % tok/s on iPhone A19 Pro)
python build_gemma4_3way.py --model gemma4-e2b --ctx 2048
python install_3way_bundle.py
# Specialists
python build_functiongemma_bundle.py --ctx 2048 --quantize int8 --prefill-t 32
python build_embeddinggemma_bundle.py --max-seq-len 128 --quantize int8Step-by-step: docs/ADDING_MODELS.md. Full reference (quant, .mlpackage → .mlmodelc, iPhone deployment): docs/CONVERSION.md.
| Topic | File |
|---|---|
| HF conversion, ANE tricks, INT4/INT8/W8A8 rationale | docs/CONVERSION.md |
| Adding a new architecture | docs/ADDING_MODELS.md |
| Benchmark methodology (tok/s, ANE %, memory) | docs/BENCHMARKING.md |
| 3-chunk decode (+8.2 %) | docs/THREE_CHUNK_MAC_BENCH.md |
.mlpackage vs .mlmodelc, format gotchas |
docs/DEPLOYMENT.md |
| Image pipeline | docs/MULTIMODAL.md |
| Video pipeline | docs/VIDEO_PHASE2_CONTINUATION.md |
| Audio pipeline | docs/AUDIO.md |
| 8K context roadmap, ANE-compat matrix | docs/SPEED_8K.md |
| FunctionGemma I/O contract | docs/FUNCTIONGEMMA.md |
| EmbeddingGemma I/O contract, Matryoshka recipe | docs/EMBEDDINGGEMMA.md |
| Research background, competitive landscape | docs/RESEARCH.md |
| Decision log (WFA, Flash, W8A8, Medusa, EAGLE-3, SDPA fusion, KV alias, Topology I) | docs/EXPERIMENTS.md |
Current release: v1.6.0 (release notes).
- v1.6.0 — Qwen3-VL 2B stateful Phase 2: cross-turn KV reuse + ANE prewarm. Same-prompt 2nd TTFT 4 s → 125 ms (~32×), vision-chat 2nd-turn TTFT 125 ms (target was <500 ms). LCP-matched MLState resume + image-pinned-to-first-user-turn prompt builder + per-chunk dummy predict at load (231 ms total).
- v1.5.0 — Qwen3-VL 2B stateful Phase 1: MLState + slice_update KV cache + multifunction prefill_b8. 24 tok/s decode at 256 MB phys_footprint on iPhone 17 Pro (vs 7.5 tok/s / 1.7 GB on the v1.3 recurrent build — 3.2× decode, 6.4× memory drop). 4-chunk INT8 + fp16 embed sidecar.
- v1.4.0 — Gemma 4 E2B 3-chunk decode (opt-in,
LLM_3CHUNK=1): 31.6 → 34.2 tok/s on iPhone 17 Pro A19 Pro (+8.2 %). Bit-equivalent to 4-chunk by construction. Closes the ANE-ceiling sweep for E2B; five additional lossless probes (SDPA fusion, K=V alias, Topology I boundary search, blockwise palettization, native softmax) all landed as negative results — see docs/EXPERIMENTS.md. - v1.3.0 — Qwen3-VL 2B (text + vision on ANE, 196 image tokens, DeepStack injection at L0/1/2, interleaved mRoPE for image tokens). 28-layer GQA, 2.9 GB bundle, ~7.5 tok/s text decode. (Recurrent KV — superseded by v1.5.0 stateful build; kept for backward compatibility.)
- v1.2.0 — FunctionGemma-270M (function calling, batched prefill T=32) and EmbeddingGemma-300M (99.80 % ANE, Matryoshka 768/512/256/128). Standalone
Gemma3Demosample. - v1.1.0 — Qwen3.5 2B (4 INT8 chunks + mmap fp16 embed sidecar, ~200 MB phys_footprint for a 2B-param model).
- v1.0.0 — Qwen3.5 0.8B (first hybrid SSM+attention LLM on CoreML, 99.9 % ANE).
- v0.8.0 — Gemma 4 E4B (42-layer text decoder, 100 % ANE).
- v0.7.0 — Video multimodal (native 384×384 vision encoder, 64 tokens/frame).
- v0.6.2 — Audio multimodal (12-layer Conformer encoder).
Full history: GitHub Releases.
Sources/CoreMLLLM/ Swift Package (`import CoreMLLLM`)
CoreMLLLM.swift Public API — load, generate, stream
ChunkedEngine.swift SWA decode + prefill engine (3/4-chunk)
FunctionGemma.swift Function-calling specialist
EmbeddingGemma.swift Sentence-embedding specialist
ModelDownloader.swift Background download, pause/resume
ImageProcessor.swift Vision preprocessing (image + video)
AudioProcessor.swift Mel + Conformer
…
Examples/CoreMLLLMChat/ iOS sample app (chat + multimodal)
Examples/Gemma3Demo/ Standalone sample (FunctionGemma + EmbeddingGemma)
conversion/ Python conversion pipeline
convert.py CLI entry point
build_gemma4_bundle.py One-shot Gemma 4 bundle builder
build_gemma4_3way.py 3-chunk decode variant (v1.4)
build_functiongemma_bundle.py
build_embeddinggemma_bundle.py
models/ Per-architecture PyTorch traces
docs/ Design docs, benchmarks, decision log
- Inference: iOS 18+ / macOS 15+
- Conversion: Python 3.10–3.12, coremltools 8+, PyTorch 2.2+
- Sample apps: Xcode 16+
MIT for the CoreML-LLM code. Model weights inherit the original licenses (Gemma weights: Gemma Terms of Use; Qwen weights: Apache 2.0; Qwen3-VL vision weights: Apache 2.0).



