A command-line tool to profile CoreML models — showing per-operation compute device assignments (CPU/GPU/ANE), compilation time, and prediction latency across all MLComputeUnits configurations.
Replicates what Xcode's CoreML Performance Report does, but from the terminal and designed for programmatic use by coding agents.
$ coreml-cli test_models/160ms/
Device: Apple M4 Pro (arm64)
OS: macOS 26.3.1
── decoder ────────────────────────────────────────────────────────────────────
Parakeet EOU decoder (RNNT prediction network) (Fluid Inference)
Mixed (Float16, Float32, Int16, Int32) | torch==2.4.0 | coremltools 8.3.0
inputs: targets(Int32 1×1), target_length(Int32 1), h_in(Float32 1×1×640),
c_in(Float32 1×1×640)
outputs: decoder(Float32 1×640×1), h_out(Float32 1×1×640),
c_out(Float32 1×1×640)
Compute Unit CPU GPU ANE Cold Compile Warm Compile Predict
────────────────────────────────────────────────────────────────────────────────────
all 100.0% 0.0% 0.0% 28ms 6ms 0.22ms
cpu_only 100.0% 0.0% 0.0% 29ms 6ms 0.22ms
cpu_and_gpu 100.0% 0.0% 0.0% 31ms 5ms 0.22ms
cpu_and_neural_engine 100.0% 0.0% 0.0% 29ms 5ms 0.23ms
── streaming_encoder ──────────────────────────────────────────────────────────
Mixed (Float16, Float32, Int32) | torch==2.4.0 | coremltools 8.3.0
...
Compute Unit CPU GPU ANE Cold Compile Warm Compile Predict
────────────────────────────────────────────────────────────────────────────────────
all 0.0% 100.0% 0.0% 874ms 42ms 6.79ms
cpu_only 100.0% 0.0% 0.0% 381ms 43ms 4.83ms
cpu_and_gpu 0.0% 100.0% 0.0% 466ms 42ms 6.71ms
cpu_and_neural_engine 1.2% 0.0% 98.8% 7249ms 46ms 2.81ms
Requires macOS 14+ and uv.
git clone https://github.com/yourusername/coreml-cli
cd coreml-cli
uv sync# Profile a single model (all compute unit configs)
uv run coreml-cli model.mlmodelc
# Profile all models in a directory
uv run coreml-cli path/to/models/
# Specific compute unit config
uv run coreml-cli model.mlmodelc --units cpu_and_neural_engine
# JSON output (for programmatic use)
uv run coreml-cli model.mlmodelc --json
# Include per-operation breakdown
uv run coreml-cli model.mlmodelc --ops
# Per-op data with private API details (backend support, estimated runtimes)
uv run coreml-cli model.mlmodelc --detailed
# ANE fallback analysis — show CPU ops grouped by rejection reason
uv run coreml-cli model.mlmodelc --fallback
# Fallback analysis as JSON (for agent consumption)
uv run coreml-cli model.mlmodelc --fallback --json
# Control benchmark iterations
uv run coreml-cli model.mlmodelc --iterations 50
# Debug logging to stderr
uv run coreml-cli model.mlmodelc --debugFor each model and compute unit configuration (all, cpu_only, cpu_and_gpu, cpu_and_neural_engine):
- Device assignment — % of operations on CPU, GPU, and ANE (Neural Engine)
- Cold compile time — first-ever load with no cached compilation (CoreML cache cleared). Reflects what the user experiences the first time the model runs on their device — if this is too high, the model may not be usable.
- Warm compile time — load time with cached compilation. This is the cost paid on every app launch after the first.
- Predict latency — median prediction time (5 warmup + 10 timed iterations)
- Model metadata — precision, I/O shapes, author, description, coremltools version
- Per-op breakdown (
--ops) — each operation's name, type, assigned device, and cost weight - Private API data (
--detailed) — selected backend, all supported backends, estimated runtime per backend, validation messages explaining why backends were rejected
Shows only ops that are not on ANE, grouped by rejection reason. Designed for the ANE optimization loop: change conversion → reconvert → --fallback → identify blockers → fix → repeat.
For each CPU-fallback op, reports:
- Why ANE rejected it — e.g., "Unsupported tensor data type: int32", "Unsupported MIL operation"
- How many ops — grouped by rejection reason with op type counts
- Estimated CPU cost — how much latency the fallback adds
- Which ops — names for tracing back to the conversion script
Common ANE rejection reasons and fixes:
Unsupported tensor data type: int32— cast to float16 before these operationsUnsupported MIL operation "lstm"— decompose into supported ops (matmul, sigmoid, tanh)Unsupported MIL operation "logical_and"— replace with float multiply workaroundUnable to resolve operation input— cascading from another CPU op; fix the upstream op firstANE supported but scheduler chose CPU— data transfer overhead; often not worth fixing
Uses PyObjC to call macOS CoreML framework APIs directly from Python:
- Public API —
MLComputePlan(macOS 14+) for per-operation device assignment and cost weights - Private API —
MLE5Engine.segmentationAnalyticsAndReturnError:for richer data including backend support matrices and estimated runtimes per backend
Heavily inspired by:
- maderix/ANE — reverse-engineered private
_ANEClient/_ANECompilerAPIs for direct Neural Engine access. Their runtime introspection approach (objc_msgSend,NSClassFromString) informed how we navigate CoreML's internal object graph. - freedomtan/coreml_modelc_profling — per-operation profiling using both public
MLComputePlanand undocumentedMLE5EngineAPIs. Their Objective-C implementation was the direct reference for our private profiler.
Note that this was a weekend project, built with Claude Code.
- Hardware-specific — compute plans and compilation are tied to the local chip. Results on an M4 Pro will differ from an M1 or A17 Pro.
- Private APIs may break — the
MLE5Enginepath (--detailed) uses undocumented APIs that may change across macOS versions. - macOS 26 tested — CoreML enum values changed in macOS 26 (Tahoe). The tool uses framework constants to stay portable, but has only been tested on macOS 26.
MIT