Skip to content

[Perf] Run the Parakeet Conformer encoder on the Apple Neural Engine (CoreML)#199

Open
beshkenadze wants to merge 1 commit into
Blaizzy:mainfrom
beshkenadze:feat/parakeet-coreml-ane-encoder
Open

[Perf] Run the Parakeet Conformer encoder on the Apple Neural Engine (CoreML)#199
beshkenadze wants to merge 1 commit into
Blaizzy:mainfrom
beshkenadze:feat/parakeet-coreml-ane-encoder

Conversation

@beshkenadze

Copy link
Copy Markdown
Contributor

Run the Parakeet Conformer encoder on the Apple Neural Engine (CoreML)

Summary

Adds an optional CoreML/ANE path for the Parakeet Conformer encoder, behind a flag.
The encoder (≈95% of the compute) runs on the Apple Neural Engine via CoreML while the
TDT decoder and chunking stay in MLX. Same transcript, lower power, and a small speedup
on top.

It plugs into the existing EncoderExecutionImplementation hook in ParakeetModel
(new .coreML case + enableCoreMLEncoder(modelURL:)), so the decode path is untouched.
It falls back to the MLX encoder if CoreML is unavailable. Public MLModel +
MLComputeUnits only — no private ANE APIs.

# auto-downloads the prebuilt encoder from Hugging Face
mlx-audio-swift-stt --model beshkenadze/parakeet-tdt-0.6b-v3-mlx-fp16 \
    --audio in.wav --output-path out --ane --chunk-duration 9.95
// .off (default) · .on (default HF repo) · .repo("id") · .package(localURL)
let model = try await ParakeetModel.fromPretrained(repo, aneEncoder: .on)

Why

ANE has no public API — CoreML is the only sanctioned route, and MLX is GPU/Metal only.
Splitting the graph at the encoder boundary (static feed-forward → CoreML/ANE; the
autoregressive TDT loop → MLX) is a clean, lossless way to reach the ANE. The payoff is
mostly power/thermal (the encoder leaves the GPU), with a speedup as a bonus.

Results

Measured on M1 Max, parakeet-tdt-0.6b-v3, a 20.8-min TED-LIUM 3 talk, chunk 9.95s.

Metric all-MLX hybrid (CoreML/ANE encoder)
ANE residency (encoder) 100% (0 CPU / 0 GPU ops, 0 graph interruptions)
WER vs reference 7.28% 7.11% · agreement 1.07%
RTF (Swift release, interleaved) ~95× ~131×~1.38×
GPU power (sustained) 17.3 W 3.0 W (÷5.8)
Package power 23.4 W 10.3 W (÷2.3) — ANE encoder ≈ 0.9 W
  • The transcript is reproduced ~1:1 (2786 vs 2802 words); the residual is the
    fp16-vs-bf16 difference (CoreML-fp16 is actually closer to fp32 than the shipped
    MLX-bf16 encoder).
  • Residency holds at the 1.1b variant too (100%, 0 interruptions).

How it works

  • Fixed input shape (a fixed mel-frame count, e.g. 1000 = 10s) — required for ANE
    residency; a dynamic (RangeDim) time axis drops it to 0%. The Swift wrapper pads each
    chunk's mel to the fixed length and crops the output via the subsampling formula. Keep
    --chunk-duration ≤ frames·10ms.
  • The ANE output MLMultiArray is stride-padded, so the wrapper reads it by strides.
  • Conversion (NeMo → .mlpackage) lives in tools/coreml-ane/ (convert_encoder.py +
    convert_traced.py); see the README. --fp16-io gives 100% ANE / 0 CPU ops.

Scope

  • Swift: ParakeetCoreMLEncoder.swift, ParakeetModel.swift (the .coreML case),
    App.swift (the --coreml-encoder flag).
  • Tooling: tools/coreml-ane/ converter + README.

Limitations / follow-ups

  • The .mlpackage is not bundled (it's large). A prebuilt one is hosted on Hugging Face
    (beshkenadze/parakeet-tdt-0.6b-v3-coreml-ane);
    users can also convert it via the tooling.
  • MLX↔CoreML marshaling currently uses CPU copies; a zero-copy IOSurface-backed
    MLMultiArray would lift the Swift RTF further (the power win is independent).
  • RTF numbers are M1 Max; newer ANE generations should do better.

Testing

  • New CI-safe unit tests (Tests/ParakeetCoreMLEncoderTests.swift, swift-testing): the
    output-length math matches the dw-striding formula, and a missing .mlpackage throws
    (→ MLX fallback). No ANE/model/network needed; swift test: 2/2 pass.
  • Builds clean (release); transcript parity verified against the all-MLX path on the
    full talk.
  • The decode path is unchanged (the encoder is swapped behind the existing hook).

@beshkenadze beshkenadze changed the title Run the Parakeet Conformer encoder on the Apple Neural Engine (CoreML) [Perf] Run the Parakeet Conformer encoder on the Apple Neural Engine (CoreML) Jun 8, 2026
@beshkenadze beshkenadze force-pushed the feat/parakeet-coreml-ane-encoder branch 2 times, most recently from cb58a8f to c332b3b Compare June 10, 2026 14:26
@lucasnewman

Copy link
Copy Markdown
Collaborator

@beshkenadze It seems like the CoreML export should live in the python repo? Having python tools in this repo doesn't make a lot of sense to me as the toolchains are very distinct. Thoughts?

@beshkenadze

Copy link
Copy Markdown
Contributor Author

Agreed. I'll move tools/coreml-ane/ out and co-locate each converter with its .mlpackage in the HF model repo — artifact + the exact script that produced it. Sound good?

Run the Parakeet Conformer encoder on the Apple Neural Engine via CoreML — the
.mlpackage is auto-downloaded from Hugging Face while the TDT decoder stays in
MLX. Opt-in encoder API (aneEncoder:) + CI-safe tests.
@beshkenadze beshkenadze force-pushed the feat/parakeet-coreml-ane-encoder branch from c931113 to 81f284d Compare June 14, 2026 20:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants