A deterministic benchmark for Apple Foundation Models structured output and tool
calling, driven entirely through the on-device fm CLI that ships with macOS Apple
Intelligence.
📊 Live report: https://dwstevens.github.io/fmbench/
It answers a practical question: can you trust the free, on-device model to do real tool calling and structured extraction? The short version — yes for structure and routing, no for arithmetic — and this harness measures exactly where the line is.
Why it matters: the model runs on the Apple Neural Engine, not the GPU. It's free, private, unmetered, and doesn't contend with GPU workloads — so it can sit alongside local training/inference and do all your JSON-shaped grunt work for nothing.
Four suites, all graded in code (no LLM judge), all run with --greedy for
reproducible scores:
| Suite | What it tests |
|---|---|
| Tool routing | An anyOf union of 5 tools; correct tool selection + argument extraction from natural language. |
| Nested extraction | Deep schemas ($defs/$ref, arrays-of-objects, nested objects, optionals); per-field accuracy. |
| Enum / const enforcement | Hard constraints, including adversarial prompts ("give it in Kelvin", "set status to pending-review") that must still land in-set. |
| Failure modes | Characterization, not pass/fail: arithmetic it will get wrong, no-tool-fits confabulation, and guardrail refusals. |
Plus a performance + resource track: throughput (tok/s), time-to-first-token, a
CPU-by-process table showing where the work lands, optional CPU/GPU/ANE power
(--power), and an ANE hardware-activity capture via Instruments (--ane) that
proves the model runs on the Neural Engine even when every power meter reads zero.
Every run is deterministic (greedy decoding) and graded by code, so scores are stable and runs are diffable over time — it works as a regression harness as the OS updates.
From one run on macOS 27 Golden Gate (build 26A5353q, M-series MacBook Pro),
system model, greedy decoding:
Capability — suite scores
| Suite | Cases | Valid JSON | Avg score |
|---|---|---|---|
| Tool Routing | 14 | 100% | 100% |
| Nested Extraction | 8 | 100% | 97% |
| Enum / Const Enforcement | 14 | 100% | 100% |
| Failure Modes | 10 | 100% | 78% (1 guardrail refusal) |
Performance & compute
| Metric | Value |
|---|---|
| Throughput | ~52 tok/s |
| Time-to-first-token | ~340 ms |
ANE duty cycle under load (--ane) |
~87% |
| Neural Engine ops / run | ~1,500 |
fm client CPU |
~3% (it's a thin XPC client) |
| GPU power under load | flat (~idle) — work isn't on the GPU |
Behavioral findings
- Structure is bulletproof: 100% valid JSON across every suite — the decoder is
grammar-constrained, so malformed output is impossible.
enum/constare hard enforced (ask for Kelvin against[celsius, fahrenheit]and it physically cannot emit Kelvin). - Arithmetic is not: 2/4 correct — it returned
47.25for a total that was50.70. Extract raw fields; compute derived values in your own code. - No-tool-fits: 5/5 correctly used an explicit
respond_directlyescape tool. Without one it confabulates a tool call — so always give it an escape hatch. - It runs on the ANE: an Instruments Core ML trace captured ~1,500 "Neural Engine Prediction" hardware intervals at ~87% ANE duty cycle during generation — direct proof inference is on the Apple Neural Engine, leaving CPU and GPU free.
powermetrics, mactop, and even raw IOReport energy counters all report ~0 W for the
ANE under load on some Macs — which is misleading. Two reasons: (1) ANE inference is
hundreds of sub-30 ms bursts, so any instantaneous power sample lands in the idle gaps;
(2) on this hardware the CPU/ANE energy rails simply aren't populated (they read 0 even
when busy). The fix isn't a faster sampler — it's a different kind of signal. Instruments'
ane-hw-intervals-internal table records every ANE hardware interval directly, so
fmbench --ane can show the ANE was busy ~87% of the generation window even though every
watt-meter read zero. The GPU's only readable power rail, meanwhile, stays flat under load
— measured confirmation the work isn't on the GPU.
Requires macOS with Apple Intelligence enabled (so /usr/bin/fm exists) and
uv.
git clone https://github.com/dwstevens/fmbench
cd fmbench
uv run python run.py # all suites + perf/resource track
uv run python run.py --suite routing # a single suite
uv run python run.py --quick # 4 cases/suite, no perf (fast smoke test)
uv run python run.py --power # add a CPU/GPU/ANE power table (needs sudo)
uv run python run.py --ane # capture ANE hardware activity (Instruments, ~20s, no sudo)--ane needs the Xcode command-line tools (xctrace); everything else needs only fm.
Each run writes a timestamped folder under results/ with report.html (a styled,
screenshot-ready page), report.md, and results.json for diffing runs over time.
Add --publish to also copy the report into docs/ — commit and push, and it serves
from GitHub Pages.
fm's schema format is almost standard JSON Schema, with two things to know — both
handled by fmbench/schemas.py:
- Every object must carry an
x-orderarray listing its property order. A schema without it is rejected with "data couldn't be read because it is missing." enumandconstwork but aren't exposed byfm schema object. The constrained decoder enforces them hard (ask for Kelvin againstenum: [celsius, fahrenheit]and it cannot emit Kelvin) — so build schemas directly and add them to pin tool discriminators (const) and lock choice fields (enum).
run.py # entrypoint: run suites, grade, write reports
fmbench/
schemas.py # fm-dialect schema builder + named schema registry
runner.py # invoke `fm`, parse JSON, detect guardrail refusals
ane.py # ANE hardware-interval capture via Instruments Core ML trace
grading.py # structural / routing / field / enum / failure graders
perf.py # throughput, TTFT, process-CPU + optional power sampling
report.py # JSON + Markdown + pretty HTML
cases/*.jsonl # one file per suite; objectively-correct expected values
We've mapped a lot, but this model's ceiling is far from charted. Threads worth pulling — PRs and findings welcome:
- End-to-end pipelines (the obvious next step). Chain the model through a full
pipeline — interactive chat → freeform response → structured extraction → tool-schema
fill, each stage its own
fmcall — and measure where errors compound across stages. Can a ~3B on-device model drive a multi-stage agent loop, or does it drift? - The context ceiling. Structured output breaks at ~85–90 fields because the schema-as-grammar shares the ~4K window with prompt + output. How much does schema verbosity actually cost in tokens? Would a more compact schema buy more fields? Is there a larger-context configuration?
- Multimodal.
fm respond --imageis untested here — how good is structured extraction from images (receipts, forms, screenshots) on the ANE? fm serveas an agent backend. It speaks the OpenAI API — wire it into a real tool-use loop and see how many steps it sustains before losing the thread.- Reasoning workarounds. Arithmetic is ~50%. Does chain-of-thought help, or is the honest fix always "give it a calculator tool"?
- Guardrails. Benign prompts get refused sometimes — what's the false-positive rate,
and does
--guardrails permissive-content-transformationsmove it? - PCC vs
system. The Private Cloud Compute model wasn't reachable in our context; the capability / latency / quota delta vs on-device is wide open. - As a regression harness. fmbench is deterministic by design — run it across OS updates and watch the numbers move as Apple ships new weights.
MIT © 2026 David Stevens