Guidance for AI coding agents (and humans) working in fmbench. This is the
canonical agent guide; CLAUDE.md points here.
A deterministic benchmark for Apple Foundation Models structured output and tool
calling, driven entirely through the on-device fm CLI (macOS Apple Intelligence). No
network, no third-party Python deps — standard library only. Outputs are graded in code,
not by an LLM judge, and every run uses --greedy so scores are reproducible.
- Requires macOS with Apple Intelligence enabled, so
/usr/bin/fmexists. Check withfm available(expect "System model available"). - Python is managed with
uv. Do not use barepipor activate the venv manually.
uv run python run.py # all suites + perf/resource track
uv run python run.py --quick # 4 cases/suite, no perf — fast smoke test
uv run python run.py --suite routing # one suite (repeatable: --suite a --suite b)
uv run python run.py --no-perf # skip perf/resource sampling
uv run python run.py --power # add CPU/GPU/ANE power table (prompts for sudo)
uv run python run.py --ane # ANE hardware-interval capture via Instruments (~20s, needs xctrace)
uv run python -m py_compile run.py fmbench/*.py # syntax check
uv run python -m fmbench.schemas schemas # regenerate schema filesUse --quick while iterating: it exercises the full pipeline (schema → fm →
parse → grade → report) in ~20s without the multi-minute perf track.
run.py— entrypoint. Loads cases, applies the per-suite default instruction (or a case override), runs each throughfm, grades, prints colored tables, writes reports.fmbench/schemas.py— builds schemas in the fm dialect and exposes a named registry (build_all()/write_all()). Cases reference schemas by key.fmbench/runner.py— invokesfm respond, strips ANSI, parses JSON, and flags guardrail refusals as a distinct outcome.fmbench/grading.py— pure functions: structural validation plus one grader per suite (routing,extraction,constraints,failure_modes). No I/O, nofmcalls.fmbench/perf.py— throughput, time-to-first-token, per-process CPU sampling, and optionalpowermetricspower sampling.fmbench/ane.py— records a system-wide Instruments "Core ML" trace viaxctracewhile generating, exports theane-hw-intervals-internaltable, and reports ANE op count + active time + duty cycle. The authoritative "is it the ANE?" signal; no sudo.fmbench/report.py— renders JSON + Markdown + a self-contained HTML page.cases/*.jsonl— one file per suite; one JSON object per line.
Data flow: cases/<suite>.jsonl → run.py → runner.run_fm → grading.GRADERS[suite]
→ report.write_all → results/<timestamp>/{report.html,report.md,results.json}.
- Every object must include an
x-orderarray listing property order. Omit it andfmrejects the schema with "data couldn't be read because it is missing." Theobj()/union()builders inschemas.pyhandle this — always go through them. enumandconstwork but are not exposed byfm schema object. The decoder enforces them hard (an out-of-enum value literally cannot be emitted). Useconstto pin tool discriminators andenumto lock choice fields.- Nested objects go in root-level
$defsand are referenced with$ref. Unions (tool routers) use a root of{title, $defs, anyOf:[{$ref}...]}.
- Expected values in
cases/*.jsonlmust be objectively correct — the harness scores against ground truth, so a real model miss should surface as a low score, never be tuned away. String comparison is case/space-insensitive; numbers use a small tolerance. - Known model limits are characterized, not failed: arithmetic and relative-date
reasoning are unreliable, and guardrails occasionally false-positive on benign prompts.
Keep these in the
failure_modessuite as observations, excluded from capability scores. results/andschemas/are generated and gitignored. Don't commit them.- Match the existing style: small focused modules, stdlib only, type hints, short docstrings explaining why.
- A new case: add a JSON line to the relevant
cases/<suite>.jsonlwithid,suite,schema(a registry key),prompt, andexpectshaped for that suite's grader (see existing lines). - A new schema: add a builder + registry entry in
schemas.py; reference its key from cases. Run--quickto confirmfmaccepts it. - A new suite: add a grader to
grading.GRADERS, a default instruction inrun.py, acases/<suite>.jsonl, and a title inreport.SUITE_TITLES.
Run uv run python run.py --quick and confirm it completes with sensible per-suite
scores. For perf/report changes, do a full uv run python run.py and check the
generated report.html.