This folder contains SwiftUI source files for a small iOS app that can load a .cellm model + tokenizer.json and run text generation through cellm’s Rust core via the C FFI.
- LLM text generation (tokenize prompt → prefill → decode tokens)
- VLM image description through native
.cellmpath incellm-sdk(vision encoder + multimodal prompt packing + text decode) - Backend request from iOS UI (
CPU/Metal) through FFI (cellm_engine_create_v4) - KV encoding + TurboQuant runtime knobs are also passed through
cellm_engine_create_v4 - Active backend reporting (
cellm_engine_backend_name) so app confirms what was selected - One-tap sample asset download in-app (GitHub-hosted
.cellm+ sample image + tokenizer)
Note: backend selection is strict in this build. If Metal is requested and unavailable, engine creation fails instead of silently falling back to CPU.
- Native VLM path is currently CPU math in this phase.
-
Build the XCFramework used by the app target:
cd /cellm zsh scripts/build_xcframework.sh -
Open the generated project:
/cellm/bindings/ios/CellmDemo.xcodeproj
-
Add the Swift files from this folder into your app target:
CellmFFI.swiftLLMView.swiftVLMView.swiftCellmDemoApp.swift(optional; or copy the view code into your app’s existingAppentry)
-
Build + Run on a real iPhone (recommended). You can either:
- tap the in-app sample download buttons, or
- use the document picker manually.
Manual picker flow:
- the model file (example:
qwen2.5-0.5b-int8-v1.cellmorgemma-4-E2B-it-int4-aggr-v5.cellmd) - the tokenizer file (example:
tokenizer.json) - backend (
Metalrecommended on iPhone/iPad with Apple GPU)
- In
LLMtab, tap Download Qwen stable model + tokenizer (~1.6 GB) - Tap Run Qwen Smoke Test
- Default smoke prompt:
Return exactly one uppercase letter: R - The output panel shows generation diagnostics:
prompt_tokensgenerated_tokensfirst_pieceprefill/decode/totaltiming in ms
- In
LLMtab, tap Download Gemma 3 1B int8 model + tokenizer (~1.2 GB) - Prompt example:
What is the capital of France?If I buy 12 donuts and eat 5, how many donuts are left for tomorrow?
- Choose
Metalfor acceleration, orCPUfor deterministic CPU-only validation.
https://huggingface.co/jeffasante/cellm-models/resolve/main/qwen2.5-0.5b-int8-v1/qwen2.5-0.5b-int8-v1.cellm?download=truehttps://huggingface.co/jeffasante/cellm-models/resolve/main/qwen2.5-0.5b-int8-v1/tokenizer.json?download=truehttps://huggingface.co/jeffasante/cellm-models/resolve/main/qwen2.5-0.5b-int8-v1/tokenizer_config.json?download=truehttps://github.com/jeffasante/cellm/blob/main/models/smollm2-135m-int8.cellmhttps://huggingface.co/jeffasante/cellm-models/resolve/main/gemma-4-E2B-it-int4-aggr-v5/gemma-4-E2B-it-int4-aggr-v5.cellmd?download=truehttps://huggingface.co/jeffasante/cellm-models/resolve/main/gemma-4-E2B-it-int4-aggr-v5/tokenizer.json?download=truehttps://huggingface.co/jeffasante/cellm-models/resolve/main/gemma-4-E2B-it-int4-aggr-v5/tokenizer_config.json?download=truehttps://huggingface.co/HuggingFaceTB/SmolLM2-135M/resolve/main/tokenizer.jsonhttps://github.com/jeffasante/cellm/blob/main/models/smolvlm-256m-int8.cellmhttps://github.com/jeffasante/cellm/blob/main/models/test_images/rococo_1.jpg
The app normalizes GitHub blob URLs to raw-download URLs before fetching.
For Gemma3 in LLM tab, Advanced Actions now support downloading model-only or tokenizer JSONs-only for phone import workflows.
- Large model files are slow to load over the simulator and can exceed simulator storage limits. A physical device is the fastest way to validate end-to-end.
- Keep
tokenizer.jsonnext to the.cellmmodel when you manage files on disk; the app lets you pick both explicitly. - Qwen and LLM backend selection are strict; no automatic CPU fallback when
Metalis requested. - Qwen compact int4 can still be degenerate; use the stable Qwen model when validating response quality on-device.
To reduce long decode stalls and memory churn on iPhone when using Qwen with Metal selected, we patched the KV cache Metal path in:
crates/cellm-cache/src/kvcache.rs
- Added reusable scratch buffers inside
MetalKvStorage:
k_f16,v_f16,k_f32,v_f32,q_f32,out_f32,bases_u32- these are kept and grown as needed, instead of allocating every token step.
- Replaced tiny scalar Metal buffers with inline constants:
- switched
base/len/seq/head_dim/...kernel args toset_bytes(...) - removes many tiny per-dispatch buffer allocations.
- Wrapped command submission in
autoreleasepool:
- dispatch path now uses an autorelease pool around command buffer + encoder work
- helps prevent memory buildup during long generation loops on iOS.
- Lower allocation pressure in decode hot path.
- Lower risk of iOS memory kill (
IDEDebugSessionErrorDomain Code 11) during long runs. - Better baseline for the next phase (full attention/math parity and full Metal path).
cargo check --workspacepassed.xcodebuild ... CellmDemo ...simulator build passed (BUILD SUCCEEDED).