Releases · john-rocky/CoreML-LLM

25 Apr 03:55

v1.5.0

e9feb2a

v1.5.0 — Qwen3-VL 2B stateful MLState + multifunction prefill Latest

Latest

Phase 1 ship: ports the ANEMLL Qwen3-1.7B recipe onto Qwen3-VL 2B so the
text backbone runs ANE-resident and TTFT drops ~4× on vision prompts vs
v1.4.0.

iPhone 17 Pro (measured)

	v1.5.0	v1.4.0
decode	22–24 tok/s	~10 tok/s
phys_footprint	256 MB	1.7 GB
vision TTFT (1st turn, ~200 tok prompt)	2.7 s	~5.5 s
vision TTFT (with chat history, ~600 tok)	4.0 s	~16 s

The 6.4× memory drop is the headline — KV cache lives inside ANE via
MLState + slice_update, so there is no silent GPU spill. Decode 2.4×
on top, vision TTFT 2–4× shorter.

What landed

MLState + slice_update KV writes — no Swift KV marshaling
Multifunction .mlpackage per chunk: infer (T=1) + prefill_b8 (T=8)
sharing one kv_cache_0 state
chunk_0_vision multifunction with DeepStack injection + T=8 batched
prefill — the 196 image-pad tokens get the same 8× per-token speedup
New ModelDownloader entry "Qwen3-VL 2B (stateful, Phase 1)" pulling
from a dedicated HF repo
(mlboydaisuke/qwen3-vl-2b-stateful-coreml)

Backward compat

The legacy qwen3vl_2b entry (v1.4.0 recurrent + batched T=8) is left
untouched alongside the new entry. Existing users keep working; the new
entry is opt-in via the picker.

Not included (Phase 2)

KV state reuse across chat turns (replies after the first turn should
go near-instant)
App-launch ANE compile pre-warm
Prompt-builder restructure so vision chat 2nd turn benefits from KV
reuse

Diagnostic

Per-prefill breakdown is logged once per generate:

[Qwen3VL2BStateful] prefill inputIds=N batchedText=A×8 batchedVision=B×8 t1Steps=K elapsed=Xms

Credits

Recipe ported from Anemll/Anemll's
Qwen3-1.7B Core ML build.

Assets 2

23 Apr 15:36

john-rocky

v1.3.0

03b11d2

v1.3.0 — Qwen3-VL 2B (text + vision on iPhone ANE)

First vision-capable Qwen on iPhone ANE. 28-layer GQA text backbone + DeepStack-injected Qwen3-VL vision tower, shipping end-to-end image description at 7.5 tok/s on iPhone 17 Pro.

Highlights

Qwen3-VL 2B text on ANE — 4 INT8 body chunks × 7 layers + chunk_head. 93% ANE per body chunk, 84.6% head. Max context 2048. mmap'd fp16 embed_weight.bin keeps phys_footprint ~200 MB.
Vision encoder on ANE — fixed-grid 448×448, 196 image tokens/picture via spatial_merge=2. 90.5% ANE @ INT8 palettized (406 MB).
DeepStack-aware chunk_0_vision on ANE — same 7 layers as chunk_0 plus ds_0/ds_1/ds_2 inputs and a visual_active scalar gate. 93.1% ANE. On image-pad token steps the generator memcpys the merger row into hidden_in and flips the gate; otherwise zeroed DeepStack buffers + gate=0 so the graph stays static.
Interleaved mRoPE for image tokens — Qwen3-VL's mrope_section=[24, 20, 20] with mrope_interleaved=True cycles T/H/W across the first 60 dims of half head_dim. Swift synthesizes per-image cos/sin; text tokens after the image span get their RoPE position shifted to match HF's get_rope_index.
HF-processor-matching patchify in Swift — Core ML vision input is pre-patchified (784, 1536) so the in-graph reshape stays rank 5. Earlier rank-10 in-graph permute compiled on Mac Studio ANE but faulted iPhone A18 Pro ANE with EXC_BAD_ACCESS at MLModel(contentsOf:).
Bundle: mlboydaisuke/qwen3-vl-2b-coreml — 2.9 GB (text chunks + chunk_0_vision + embed_weight.bin + vision.mlpackage).

iPhone 17 Pro results

Path	Decode	Prefill (text)	Vision prefill (196 tokens)
Text only	~7.5 tok/s	recurrent	—
Text + image	~7.5 tok/s	recurrent	~26 s (first token)

Integration notes

Swift: Qwen3VL2BGenerator accepts an optional visionFeatures + imagePadTokenId; LLMRunner auto-detects both the encoder and the chunk_0_vision chunk and flips hasVision = true only when both load.
Device deploy: devicectl device copy to --domain-type appDataContainer --domain-identifier com.example.CoreMLLLMChat --source .../vision.mlmodelc --destination Documents/Models/qwen3-vl-2b/qwen3_vl_2b_vision/vision.mlmodelc (also works for chunk_0_vision.mlmodelc). The ModelDownloader lists both paths so the HF download populates them too.

Full diff: #130

Assets 2

23 Apr 09:20

john-rocky

v1.2.0

b7bed2e

v1.2.0 — N=1024 prefill + faster load for Gemma 4 E2B

Highlights

Gemma 4 E2B — N=1024 batched prefill

Doubles prefill capacity from 512 → 1024 tokens, unlocking multi-turn
chat with >512-token context without falling back to per-token decode.
Shipped via new HF branch n1024;
existing main branch (N=512) preserved for older app builds.
Weight.bin is shared with decode chunks — download size unchanged.
Fix (a878c44): writeSlidingFromPrefill now writes only the last
W source positions when realLen > W. Prior code crashed with
EXC_BAD_ACCESS on any prompt > 512 tokens once PREFILL_N > W.

Faster time-to-usable

LLM_DEFER_PREFILL default-on (14a9965): decode chunks load
synchronously, prefill chunks load in background. ~80s → ~35s on
iPhone 17 Pro. First prompt during the load window falls back to
per-token decode. Set LLM_DEFER_PREFILL=0 to opt out.

Other

Gemma 3 — FunctionGemma-270M + EmbeddingGemma-300M on ANE (#129)
perf: prefill ↔ decode transition prewarm (eeaa488, dc22c06)
docs: iPhone ANE is realLen-aware — multifunction variants not a lever (22fef1c)
revert: PrefixCache default-on (d471a7f) — math didn't favor broad deployment

Compatibility

New clones: default to HF n1024 branch → N=1024 prefill
Existing clones without `git pull`: keep using HF main → N=512 (no change)
Existing clones that `git pull` + rebuild: get the Swift SWA fix; existing cached N=512 models keep working (fix is no-op for realLen ≤ W)

Revert

`git revert b7bed2e` — revert HF branch switch (back to main / N=512)
`git revert 14a9965` — revert deferred-load default
`git revert a878c44` — revert N=1024 + SWA fix (requires rebuilding N=512 mlpackage)

Assets 2

22 Apr 08:08

john-rocky

v1.1.0

8e358de

v1.1.0 — Qwen3.5 2B on iPhone ANE

What's new

First 2B-class hybrid SSM + attention LLM shipped on CoreML.

Qwen3.5 2B on iPhone ANE

Qwen3.5 2B on iPhone ANE — 24-layer hybrid SSM + attention (2.04 B params) split into 4 INT8 transformer chunks (6 layers each). Every chunk compiles on iPhone 17 Pro ANE at ≥ 90% op placement. ~17 tok/s decode, 2048-token context. Bundle: mlboydaisuke/qwen3.5-2B-CoreML.
mmap'd fp16 embed sidecar — embed_tokens.weight ships as a raw 1 GB fp16 file (embed_weight.bin), Swift mmap(..., MAP_PRIVATE) with MADV_RANDOM. Only the handful of rows touched per prompt page in, and those pages stay "clean" so they don't count against phys_footprint. Reported memory during generation: ~200 MB (vs ~2 GB with a CoreML chunk_embed mlpackage that would dequantize the full 1 GB into process memory).
Why 4 chunks (not 2) — palettization shrinks disk, not ANE memory: INT8 weights re-expand to fp16 inside the ANE region, and iPhone ANE's per-mlprogram compile envelope rejected the 2-chunk split (~2 GB fp16/chunk) with MILCompilerForANE: Couldn't communicate with a helper application → silent GPU fallback at 3.4 GB Metal heap and 7 tok/s. 4 chunks at ≤ 1 GB INT8 (≤ ~1.7 GB fp16) each match the Gemma 4 E4B envelope that's proven ANE-resident.
2048-token context window — bumped from 128 so chat turns don't truncate after ~10 lines. Only full-attention state scales with max_seq (6 layers × 2 × max_seq × 256 × fp16 × 2); +22 MB total vs the 128-token ceiling.
App binary shrunk ~5 GB — removed stale Qwen3.5-0.8B mlpackages from the Xcode target.
ChatView tap-to-dismiss — tap outside the input field or drag-scroll to dismiss the keyboard.

Device spec (iPhone 17 Pro)

metric	value
Decode	17 tok/s
phys_footprint (inference)	~200 MB
Metal heap (sustained)	0 GB
ANE placement (chunk_a/b/c/d)	90.7% / 91.1% / 90.7% / 90.8%
Context window	2048 tokens
Bundle (HF)	2.4 GB

Full changelog: docs/RELEASE_v1_1_0.md.

Assets 2

22 Apr 03:24

john-rocky

v1.0.3

837bf48

v1.0.3 — INT8 Qwen3.5 + cleaner defaults

Qwen3.5 0.8B becomes measurably faster and smaller. Ship quality confirmed on Mac ANE + iPhone 17 Pro.

Headline numbers (Mac ANE, 80-token long-gen)

variant	decode tok/s	bundle	loops?
FP16 greedy (prior)	43	1.5 GB	none
INT8 greedy (new default)	52 (+20%)	754 MB (50%)	none

iPhone 17 Pro shows the same relative speedup (ANE decode: 20 → 28 tok/s).

Why INT8 is faster, not just smaller

k-means palettization keeps weights in an 8-bit index + fp16 LUT form that ANE's dequant hardware unpacks inline with matmul. Net compute is ~20% faster than raw fp16 on Apple's NPU because memory bandwidth (not raw matmul throughput) is the bottleneck for a 0.8B model.

Correctness fixes bundled in

Full EOS stop set: 248044 (<|endoftext|>) + 248045 (<|im_start|>) + 248046 (<|im_end|>) + tok.eosTokenId. Prior releases missed 248044, which let the model emit <|endoftext|> as a visible literal and then fabricate a fake "Human:" follow-up turn.
System-role filter in chat template: drops UI-status system messages ("Loading…", "Model loaded!") so they don't get forwarded as real system prompts that derail the instruct model.
Multi-byte UTF-8 streaming: accumulated-decode + diff-emit so emoji 😊 and CJK glyphs that span multiple tokens render cleanly instead of showing U+FFFD mojibake.
Stride-safe logit reading: fallback to contiguous layout if Core ML reports zero strides for an ANE-dispatched fp16 output (seen in the wild as "!!!!" degenerate output).

Swift marshaling: custom MLFeatureProvider

Qwen35Generator now uses a Qwen35DecodeFeatures adapter that delegates state-input lookup to the previous decode call's MLFeatureProvider — 48 MLFeatureValue wrappers per step become zero. Combined with reusable MLMultiArrays for token/position/cos/sin and a native Float16 NEON fastArgmax, this cut iPhone decode latency from ~70 ms/step to ~35 ms/step.

Diagnostic

On load the generator now prints the MLComputePlan op placement so you can confirm ANE vs GPU at a glance:

[Qwen35] compute plan (int8, requested=ANE): total=2218 ANE=2008 (90.5%) GPU=0 (0.0%) CPU=5 (0.2%)

Included PRs

#120 INT8 palettized decode as default
#118 UTF-8 streaming fix
#117 README refresh for v1.0.0
#116 ANE default restore
(#119 Gemma marshal cache closed — measured impact <0.5 tok/s, not shipping.)

Memory footprint

A 0.8B model inherently needs its weights in RAM (~750 MB INT8, ~1.4 GB fp16 equivalent once dequantized on device). Total app memory on iPhone in ANE mode sits around 1.6-2 GB. "0 GB Metal heap" means the GPU is not allocating — ANE runs on the unified-memory weight mmap + its own plan cache.

Assets 2

22 Apr 01:36

john-rocky

v1.0.2

99cf93f

v1.0.2 — UTF-8 streaming fix + docs

Hotfix for emoji and CJK glyph rendering in Qwen3.5 chat output.

Fix

Qwen's BPE tokenizer often splits multi-byte UTF-8 (😊, some CJK glyphs) across multiple tokens. The per-token decode path in v1.0.0 / v1.0.1 yielded partial byte sequences that rendered as U+FFFD replacement characters (mojibake).

v1.0.2 accumulates all generated token ids and decodes the full sequence each step, emitting only the diff. Multi-byte codepoints are now emitted as a single unit when complete.

Included

#118 fix(qwen3.5): preserve multi-byte UTF-8 in streaming decode
#117 docs: README refreshed for v1.0.0 — added Qwen3.5 performance table, model-selection guide, What's new entry
#116 fix(qwen3.5): ANE as default compute units

Current behavior (iPhone 17 Pro, CPU+ANE default)

Decode: ~20 tok/s
Prefill (recurrent via decode): ~20 tok/s
Prefill (non-stateful 2-chunk bench): ~170 tok/s
Metal heap sustained: 0 GB
Total app memory: ~1.6 GB (fp16 weight mmap + ANE plan cache)
Chat template applied automatically (instruct-tuned)
Emoji and CJK now render correctly mid-stream

Assets 2

22 Apr 01:23

john-rocky

v1.0.1

5771589

v1.0.1 — ANE default restored

Hotfix for v1.0.0. The default compute units for Qwen3.5 had been temporarily switched to GPU during profile/benchmark debugging and weren't restored before the ship commit. v1.0.0 ran on GPU by default, consuming ~3 GB of extra Metal heap beyond the model weights.

Fix

Default restored to .cpuAndNeuralEngine for both prefill and decode. Behaves as the v1.0.0 release notes originally described.

Memory note (applies to both ANE and GPU paths)

A 0.8B fp16 model inherently needs ~1.4 GB of RAM just for weights — this is unified memory on Apple Silicon, shared between CPU mmap and the ANE plan cache. Total app memory on ANE mode sits around 1.6-2 GB (weights + ~200 MB ANE runtime + ~300 MB app baseline).

"0 GB Metal heap" means the GPU is not allocating additional buffer memory. On GPU mode, Metal allocates another ~3 GB on top of the 1.6 GB baseline, bringing total to ~4.6 GB.

path	Metal heap	total app memory
ANE (default)	0 GB	~1.6 GB
GPU (bit-exact)	~3 GB	~4.6 GB
CPU	0 GB	~1.6 GB

Included

#116 fix(qwen3.5): restore ANE default compute units

Assets 2

22 Apr 01:15

john-rocky

v1.0.0

1ecc9be

v1.0.0 — Qwen3.5 0.8B shipping on iPhone ANE

First stable release. Ships Qwen3.5 0.8B (hybrid Gated-DeltaNet SSM + attention) as a first-class chat model in the iOS sample app, running on Apple Neural Engine.

This is also the first CoreML port of a hybrid SSM/attention LLM on iPhone we're aware of — prior CoreML LLMs have been pure Transformer.

What's new

Model: mlboydaisuke/qwen3.5-0.8B-CoreML

1.4 GB fp16 decode mlpackage (prefill performed recurrently via the same model)
Runs on CPU / GPU / Apple Neural Engine
99.9% ANE operator placement

iPhone 17 Pro performance (decode, steady state):

ANE: 20 tok/s, 0 GB Metal heap
GPU: 22 tok/s, ~3 GB Metal heap, bit-exact with fp32
Prefill (non-stateful 2-chunk path): 170 tok/s on ANE — 3.0× LiteRT-LM baseline

App integration:

Available Models → "Qwen3.5 0.8B (ANE)" → chat via the regular ChatView
Qwen chat template applied automatically (instruct-tuned, proper <|im_start|>/<|im_end|> wrapping)
EOS detection and graceful early stop
Photo/video/mic pickers auto-hide (text-only model)

Precision note

On ANE, argmax on the 248K-vocab logits is fp16-fragile — strict greedy top-1 matches the fp32 oracle only 60% of the time on short prompts. However:

oracle top-1 is in ANE top-3 for 100% of tested positions (both Mac M4 and iPhone A18)
Hidden state over 24 layers stays at cos ≥ 0.9998 vs fp32
Generated text is semantically equivalent to fp32 PyTorch (measured on 3 prompts × 50 tokens: same characters, same plot, same narrative arc)

Sampling-mode generation (temperature > 0) is effectively indistinguishable from fp32 output.

Behind the ship

Swift marshal optimizations (decoded 13 → 22 tok/s GPU, 14 → 20 tok/s ANE):

Reusable MLMultiArrays + cached MLFeatureValue wrappers for the 4 non-state inputs
Custom MLFeatureProvider that delegates state lookup to previous output — skips 48 MLFeatureValue wraps per step
Single-pass Float16 argmax via native NEON compare (no Float32 conversion buffer)
vDSP-accelerated sampling path
Explicit memset on state init (MLMultiArray isn't guaranteed to zero-init on iOS)

Research artifacts (non-shipping) in conversion/:

Chunk-split decode with fp32 boundary (proves ANE drift is uniform per-layer)
In-graph argmax variant (measured 2 ms/step slower on Mac — argmax forces CPU placement)
Conv2d 1x1 replacement for ANE precision (no effect)
MLState API migration attempt (GPU 180 tok/s achievable but parity broken — Gemma-era issue reconfirmed)

Installation

Open the sample app → tap Get Model → select Qwen3.5 0.8B (ANE) → download (~1.4 GB). First load takes ~4 min on ANE for E5 compile (cached thereafter).

Full PR: #112

Assets 2

18 Apr 05:11

john-rocky

v0.8.0

eb19ecb

v0.8.0 — Gemma 4 E4B

Gemma 4 E4B on Apple Neural Engine. Second shipping model option for CoreML-LLM, alongside E2B.

Highlights

Gemma 4 E4B — 42 layers, hidden=2560, 2 KV heads, text-only decoder
~14 tok/s baseline decode on iPhone 17 Pro at INT4
100% ANE placement — verified via MLComputePlan
Bundle published at mlboydaisuke/gemma-4-E4B-coreml (5.5 GB INT4-palettized)

What it does

Switch between E2B and E4B in the Models picker. E2B stays byte-identical to v0.7.0 (multimodal: image + video + audio + text). E4B is text-only but higher quality — 4B effective parameters vs E2B's 2B.

	E2B (v0.7.0)	E4B (new)
Parameters	~2B effective	~4B effective
num_hidden_layers	35	42
hidden_size	1536	2560
num_key_value_heads	1	2
Decode speed	~31 tok/s	~14 tok/s
Per-step latency	~32 ms	~71 ms
ANE placement	99.78%	100%
Bundle size (INT4)	3.1 GB	5.5 GB

Under the hood

Generalized Gemma 4 conversion pipeline — chunk boundaries and KV-producer layer indices are derived from HF model config. compute_chunk_boundaries(config) gives E2B's hand-tuned [(0,8),(8,15),(15,25),(25,35)] unchanged; E4B yields [(0,12),(12,24),(24,33),(33,42)]. Adding future Gemma 4 variants is a registry entry away.
Output alias convention — the chunk-2 producer KV outputs are named kv13_*/kv14_* regardless of actual layer index (E2B: L13/L14, E4B: L22/L23). Keeps ~20 Swift call sites untouched across variants.
Dynamic KV cache shapes in Swift — ChunkedEngine reads (slots, num_kv_heads) from each chunk's K_sliding_in/K_full_in input description. E2B (nkv=1, 7/1 + 5/2) unchanged; E4B (nkv=2, 10/2 + 10/2) just works.
Safer model switching — LLMRunner.loadModel now releases the previous model before allocating the new one, avoiding a ~8 GB double-buffer peak during E2B ↔ E4B swaps.
Self-healing downloader — skips prefill weight-sharing when prefill metadata wasn't part of the fetched file list; the engine cleans up zero-metadata prefill_chunk*.mlmodelc directories on launch.
One-shot bundle builder — python conversion/build_gemma4_bundle.py --model gemma4-e4b --ctx 2048 produces a complete ready-to-ship directory (chunks + compiled .mlmodelc + INT8 embeds + INT8 PLE + RoPE + tokenizer + model_config.json).

Upgrading

Swift Package users — bump to from: "0.8.0". E2B behavior preserved byte-for-byte; E4B appears as a new ModelInfo.gemma4e4b option.
iOS sample app — Clean Build Folder, rebuild. Models picker will show "Gemma 4 E4B" alongside E2B.
Converters — new --model gemma4-e4b flag for build_verify_chunks.py; new build_gemma4_bundle.py end-to-end bundler.

See the README for the updated architecture, performance tables, and conversion recipes. Full diff: #97.

Not in this release

Vision / audio towers for E4B (text-only decoder only)
Prefill chunks for E4B (decode-only ship — TTFT is unbatched)
Speculative decoding (MTP/EAGLE-3 drafter) for E4B

Assets 2

18 Apr 04:07

john-rocky

v0.7.0

f845bc1

v0.7.0 — Video Multimodal

Video understanding on iPhone — fully on-device

Gemma 4 E2B now processes video clips through a native video vision encoder, running entirely on-device via CoreML.

What's new

Native video vision encoder (vision_video.mlmodelc) — traces the HF vision tower at video-grade resolution (384x384, 64 tokens/frame). Parity vs HF forward: cosine = 1.0000. Ships as part of the Gemma 4 E2B bundle on HuggingFace (3.1 GB total). Falls back to Swift-side 2x2 pooling when absent.
Uniform frame sampling — maxFrames distributed evenly across the full clip duration (not just the first N seconds). fps caps the sampling rate for short clips. Matches Gemma 4's num_frames semantic.
Per-frame thumbnails in chat — the user's video message bubble shows the exact frames the encoder received, with MM:SS captions.
<|video|> placeholder token (258884) — uses the correct video-specific token id per HF's Gemma4Processor, with bidirectional attention within each frame's vision group during prefill.
Static output shape — vision_video.mlmodelc output is [1, 64, 1536] (fully static), eliminating the E5RT dynamic-shape error on iOS.
Audio encoder fix — re-uploaded audio.mlmodelc to HF with consistent spec + weights (fixes error code -14 from a prior partial upload).

Usage

let llm = try await CoreMLLLM.load(model: .gemma4e2b)
let analysis = try await llm.generate(
    "Describe this video frame by frame.",
    videoURL: URL(fileURLWithPath: "/path/to/clip.mp4"),
    videoOptions: .init(fps: 1.0, maxFrames: 6))

PRs included

#81 — Phase 1 pool fallback + Phase 2 native encoder
#82 — Uniform frame sampling + per-frame thumbnails
#84 — <|video|> placeholder + bidirectional vision group mask
#95 — Background download fix (adopt surviving tasks)

Performance

No decode speed regression. Video encoder runs on .cpuAndGPU (same as image/audio encoders). Decode chunks remain on ANE at 99.78% placement.

Metric	v0.6.2	v0.7.0
Bundle size	2.8 GB	3.1 GB
Modalities	Image + Audio + Text	Image + Video + Audio + Text
Decode speed	~31 tok/s	~31 tok/s

Assets 2

Releases: john-rocky/CoreML-LLM

v1.5.0 — Qwen3-VL 2B stateful MLState + multifunction prefill

iPhone 17 Pro (measured)

What landed

Backward compat

Not included (Phase 2)

Diagnostic

Credits

Uh oh!

v1.3.0 — Qwen3-VL 2B (text + vision on iPhone ANE)

Highlights

iPhone 17 Pro results

Integration notes

Uh oh!

v1.2.0 — N=1024 prefill + faster load for Gemma 4 E2B

Highlights

Gemma 4 E2B — N=1024 batched prefill

Faster time-to-usable

Other

Compatibility

Revert

Uh oh!

v1.1.0 — Qwen3.5 2B on iPhone ANE

What's new

Qwen3.5 2B on iPhone ANE

Device spec (iPhone 17 Pro)

Uh oh!

v1.0.3 — INT8 Qwen3.5 + cleaner defaults

Headline numbers (Mac ANE, 80-token long-gen)

Why INT8 is faster, not just smaller

Correctness fixes bundled in

Swift marshaling: custom MLFeatureProvider

Diagnostic

Included PRs

Memory footprint

Uh oh!

v1.0.2 — UTF-8 streaming fix + docs

Fix

Included

Current behavior (iPhone 17 Pro, CPU+ANE default)

Uh oh!

v1.0.1 — ANE default restored

Fix

Memory note (applies to both ANE and GPU paths)

Included

Uh oh!

v1.0.0 — Qwen3.5 0.8B shipping on iPhone ANE

What's new

Precision note

Behind the ship

Installation

Uh oh!

v0.8.0 — Gemma 4 E4B

Highlights

What it does

Under the hood

Upgrading

Not in this release

Uh oh!

v0.7.0 — Video Multimodal

Video understanding on iPhone — fully on-device

What's new

Usage

PRs included

Performance

Uh oh!