Skip to content

Run the Parakeet Conformer encoder on the Apple Neural Engine (CoreML)#12

Closed
beshkenadze wants to merge 8 commits into
mainfrom
feat/parakeet-coreml-ane-encoder
Closed

Run the Parakeet Conformer encoder on the Apple Neural Engine (CoreML)#12
beshkenadze wants to merge 8 commits into
mainfrom
feat/parakeet-coreml-ane-encoder

Conversation

@beshkenadze

@beshkenadze beshkenadze commented Jun 8, 2026

Copy link
Copy Markdown
Owner

Run the Parakeet Conformer encoder on the Apple Neural Engine (CoreML)

Summary

Adds an optional CoreML/ANE path for the Parakeet Conformer encoder, behind a flag.
The encoder (≈95% of the compute) runs on the Apple Neural Engine via CoreML while the
TDT decoder and chunking stay in MLX. Same transcript, lower power, and a small speedup
on top.

It plugs into the existing EncoderExecutionImplementation hook in ParakeetModel
(new .coreML case + enableCoreMLEncoder(modelURL:)), so the decode path is untouched.
It falls back to the MLX encoder if CoreML is unavailable. Public MLModel +
MLComputeUnits only — no private ANE APIs.

# auto-downloads the prebuilt encoder from Hugging Face
mlx-audio-swift-stt --model beshkenadze/parakeet-tdt-0.6b-v3-mlx-fp16 \
    --audio in.wav --output-path out --ane --chunk-duration 9.95
// .off (default) · .on (default HF repo) · .repo("id") · .package(localURL)
let model = try await ParakeetModel.fromPretrained(repo, aneEncoder: .on)

Why

ANE has no public API — CoreML is the only sanctioned route, and MLX is GPU/Metal only.
Splitting the graph at the encoder boundary (static feed-forward → CoreML/ANE; the
autoregressive TDT loop → MLX) is a clean, lossless way to reach the ANE. The payoff is
mostly power/thermal (the encoder leaves the GPU), with a speedup as a bonus.

Results

Measured on M1 Max, parakeet-tdt-0.6b-v3, a 20.8-min TED-LIUM 3 talk, chunk 9.95s.

Metric all-MLX hybrid (CoreML/ANE encoder)
ANE residency (encoder) 100% (0 CPU / 0 GPU ops, 0 graph interruptions)
WER vs reference 7.28% 7.11% · agreement 1.07%
RTF (Swift release, interleaved) ~95× ~131×~1.38×
GPU power (sustained) 17.3 W 3.0 W (÷5.8)
Package power 23.4 W 10.3 W (÷2.3) — ANE encoder ≈ 0.9 W
  • The transcript is reproduced ~1:1 (2786 vs 2802 words); the residual is the
    fp16-vs-bf16 difference (CoreML-fp16 is actually closer to fp32 than the shipped
    MLX-bf16 encoder).
  • Residency holds at the 1.1b variant too (100%, 0 interruptions).

How it works

  • Fixed input shape (a fixed mel-frame count, e.g. 1000 = 10s) — required for ANE
    residency; a dynamic (RangeDim) time axis drops it to 0%. The Swift wrapper pads each
    chunk's mel to the fixed length and crops the output via the subsampling formula. Keep
    --chunk-duration ≤ frames·10ms.
  • The ANE output MLMultiArray is stride-padded, so the wrapper reads it by strides.
  • Conversion (NeMo → .mlpackage) lives in tools/coreml-ane/ (convert_encoder.py +
    convert_traced.py); see the README. --fp16-io gives 100% ANE / 0 CPU ops.

Scope

  • Swift: ParakeetCoreMLEncoder.swift, ParakeetModel.swift (the .coreML case),
    App.swift (the --coreml-encoder flag).
  • Tooling: tools/coreml-ane/ converter + README.

Limitations / follow-ups

  • The .mlpackage is not bundled (it's large). A prebuilt one is hosted on Hugging Face
    (beshkenadze/parakeet-tdt-0.6b-v3-coreml-ane);
    users can also convert it via the tooling.
  • MLX↔CoreML marshaling currently uses CPU copies; a zero-copy IOSurface-backed
    MLMultiArray would lift the Swift RTF further (the power win is independent).
  • RTF numbers are M1 Max; newer ANE generations should do better.

Testing

  • New CI-safe unit tests (Tests/ParakeetCoreMLEncoderTests.swift, swift-testing): the
    output-length math matches the dw-striding formula, and a missing .mlpackage throws
    (→ MLX fallback). No ANE/model/network needed; swift test: 2/2 pass.
  • Builds clean (release); transcript parity verified against the all-MLX path on the
    full talk.
  • The decode path is unchanged (the encoder is swapped behind the existing hook).

beshkenadze and others added 3 commits June 5, 2026 09:13
…aming) (Blaizzy#193)

* perf(sortformer): single bulk readback in predsToSegments

predsToSegments built its result with a per-frame .item() GPU->CPU read
(one synchronous round-trip per frame, per speaker, per call). On long
streaming runs that is ~95k syncs and dominates the non-encoder time.

Replace it with a single bulk asArray() readback followed by pure-Swift
change detection. Output is bit-identical (verified 0.0% DER on a 32-min
2-speaker file vs the previous implementation; the existing
SortformerPostprocessingTests cover basic / empty / min-duration /
merge-gap / sorted cases). ~1.8x faster streaming end to end.

* test(sortformer): pin predsToSegments boundary times incl. trailing segment

The existing post-processing tests assert only segment counts/order, not
exact times, and never exercise a speaker active through the final frame
(the tail branch). Add predsToSegmentsExactBoundaries which pins the
start edge, the inactive-close edge, and the active-to-last-frame case —
locking the exact frame->time mapping the bulk-readback refactor preserves.
Co-authored-by: vanch <vanchye@outlook.com>
…aizzy#196)

- Fix blocking weight-load crash: prompt_kernel used integer @ModuleInfo keys
  (0/2) -> MLX-swift array misinterpretation. Remap to linear0/linear2.
- Add cache-aware streaming (NemotronASRStreaming.swift): per-layer attention +
  causal-conv caches + incremental causal subsampling (16-frame mel cache);
  generateStream now streams O(n) with no recompute.

Validated vs NeMo CUDA reference (FLEURS en-US 200u): offline 9.62%, streaming
9.43% (CUDA 9.58%); single-clip token-exact.
@beshkenadze beshkenadze force-pushed the feat/parakeet-coreml-ane-encoder branch from 648dac3 to bb8ac70 Compare June 8, 2026 10:58
ParakeetCoreMLEncoder is a drop-in for the MLX encoder that runs the Conformer on the
Apple Neural Engine via CoreML, wired through the existing EncoderExecutionImplementation
hook (.coreML case + enableCoreMLEncoder). Decoder and chunking stay in MLX. The model is
fixed-shape (ANE requirement): chunk mel is padded to the fixed length and the output
cropped via the subsampling formula; the stride-padded ANE output is read by strides.
Falls back to the MLX encoder if CoreML is unavailable. CLI: --coreml-encoder <path>.
Public MLModel + MLComputeUnits only.
convert_encoder.py traces the Conformer encoder at a fixed shape; convert_traced.py runs
the coremltools conversion in an isolated numpy<2 env (coremltools 9.0 + numpy>=2 fails on
a folded aten::Int const). Produces the fp16 MLProgram .mlpackage for --coreml-encoder.
README documents conversion + usage.
Expose subsampledLength as a static helper and verify it matches the dw-striding
output-length formula; assert a missing .mlpackage throws (model then falls back to
MLX). No ANE/model/network needed -> runs in CI. swift test: 2/2 pass.
Add ANEEncoder enum (.off default / .on / .repo(String) / .package(URL)) and
ParakeetModel.fromPretrained(aneEncoder:) so callers just flip it on. .on/.repo
download the .mlpackage from Hugging Face (default
beshkenadze/parakeet-tdt-0.6b-v3-coreml-ane) via the existing HubClient, cached.
CLI gains --ane. Verified end-to-end (download + transcribe, 2802 words, ~105x RT);
swift test 3/3.
@beshkenadze beshkenadze force-pushed the feat/parakeet-coreml-ane-encoder branch from 6b0cd38 to 1a7b407 Compare June 8, 2026 13:24
@beshkenadze

Copy link
Copy Markdown
Owner Author

Superseded by upstream PR Blaizzy#199 (rebased onto upstream/main).

@beshkenadze beshkenadze closed this Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants